Sunday, August 29, 2010

'Human Metadata' Shapes Experiences

I haven't had a chance to read Guy Deutscher's book “Through the Language Glass: Why the World Looks Different in Other Languages,” being published this month by Metropolitan Books, but I read a fascinating excerpt in today's The New York Timess' Sunday Magazine.

Deutscher, an honorary research fellow at the School of Languages, Linguistics and Cultures at the University of Manchester, looks at different languages to examine whether they impose different contexts for its speakers. He cites for example how English speakers can be inexact about the gender in the following sentence: A friend came over last night., while in most other languages we would know whether that "friend" was male or female. (Totally unscholarly aside: might the English language ability to obfuscate gender allow us to create nonsensical laws like "Don't Ask, Don't Tell" and influence perceptions of homosexuality around the globe?)

Most languages required the individual to store gender with objects, (although not uniformly: the German bridge is feminine —the Spanish is masculine), whereas others use a mind-expanding geospatial method. Native Americans and indigenous Islanders apparently use N, S, E, W coordinates to relay location, whereas English and European languages use an egocentric method — in front, behind, etc. More interestingly, that "geospatial data" is actually stored in human memory!

Language apparently forces us to store a type of human metadata with every memory that shapes experiences, connotations and denotations. It is easy to infer how publishers providing mashups that rely on multiple types of metadata would enrich the experience to those who may think differently than the way the media originally thought they would.

Sunday, August 22, 2010

6 Things About Unstructured Content You Need to Know

Unstructured data. As a writer I hate that term. I remember the first time I heard reference to it: sitting in a meeting and technical people were talking about all the unstructured content that publishers produce. How could they be speaking about articles as unstructured? If anyone has made it through kindergarten they have learned that they must follow a linguistic pattern or structure in order to communicate effectively. But in the lexicon of geekdom, any article, picture, powerpoint, video, song, user-generated content -- your kids' text messages -- are all: unstructured. Any "data" that does not fit nicely into a column or a row -- is considered unstructured.

Now, as much as I hate the term, there is a logic that places all content that didn't fall neatly into a table to be called unstructured versus structured. It was evident when classifieds first went online. Dumped from mainframes where customers paid by the character, people created their own short-hand to say 4BR House 4 Sale. Turns out though, that all that unstructure (which I prefer to say as free-form)  makes it very hard to search on. Don't believe me? Go to Craig's List -- which tries to impose structure on advertisements by putting it under broad categories and locations. Other than that -- it is pretty freestyle. Nannies are caregivers are sitters (baby or otherwise) -- and plural or otherwise. And search on one of those terms at the peril of not finding it under the other.

On the flip side are the sites that allow only structure: information fits neatly in a row or a table and has descriptive titles like Type of residence, # of Bedrooms, Siding, Price, MLS # etc. Having that structure makes it easy to query or search on that information. You probably have heard of SQL -- which is the acronym for Structured Query Language. Relational databases use SQL to find content -- by looking in the appropriate fields. Which is all well and good for content like financial information, inventories, human resources stuff. But what about the rest of the content that floats around a corporation? The memos, sales presentations, business plans, schematics, Web sites -- the stuff we sometimes refer to as: Knowledge - and which in geekdom is called unstructured content, are the digital assets we need to carefully manage.

So here's 6 Things You Need to Know About Unstructured Content.
  1. It's everywhere. Analysts, pundits and people in the know estimate that more than 80% of content produced in an enterprise is unstructured
  2. Content is containerized. Unstructured content resides in containers like .doc, .ppt, .tiff -- and you must have the right software application to read or edit it.
  3. Managing unstructured content is hard. Because content resides in containers it is hard to know what it is in each one. 
  4. XML is crucial for reuse and sharing. Sometimes called atomic or neutral format, XML is a language used to transmit content -- without burden of the container. Neutral content can be then "poured" into any template (Word, Web, PDF -- mobile apps!!!) for easier repurposing. If you have unstructured content (and most likely you have lots of it) it should be stored in an XML format.
  5. Good metadata is essential. Once content is in an XML format, enrich it with semantic metadata. This is critical to letting knowledge workers find out what each asset is about
  6. Native XML databases provide agility & efficiencies. Relational databases (RDBMS) are great for organizing and querying structured data -- while XML databases rock for unstructured content. You can make an RDBMS work with XML, but you will lose a lot in database performance (upwards of 30% is estimated by Forrester). ) -- Heck I will use a knife to tighten a screw -- but sometimes I need to go and get the Philips head.
Now let's look briefly at what all this means. You have tons of content that doesn't fit into tables and rows. Storing content in their original containers of powerpoints, word docs and PDFs makes it extraordinarily hard to share and repurpose this content -- since it's hard to search it -- and cutting and pasting becomes the only alternative. And when we are talking about sharing and repurposing, remember all the great mash-up apps that your content might be perfect for -- if only it were in a neutral format.

Look, I don't like the term Unstructured Content -- but the acronym of CTDFWICAR (Content that doesn't fit well in columns and rows) is hardly memorable -- and way too long. Semi-structured content (because XML actually follows schemas -- which makes it semi-structured, but that's for another day) is only half as horrible as unstructured. In any event, while it may be free-form -- this type of content is the lifeblood of any organization -- and deserves its own special database to help keep it valuable.

Wednesday, August 11, 2010

No Net Neutrality? The Ultimate Tax on Businesses & Society

Wanna watch conservative and libertarian eyes twitch: tell 'em we need more government oversight. No doubt they would point to the all-but-impotent Federal Communications Commission (FCC) as yet another example of how government mucks things up. And they would be right, with only themselves (and Congress) to blame.

Set up in 1934 to "ensure that the American people have available -- at reasonable costs and without discrimination -- rapid, efficient, nation- and world-wide communications services," the FCC has been loaded with commissioners more concerned with areola than meaningful access. And the laws they have made (1996) deregulating the telcos but not dealing with the last mile until nearly a decade later, increased competition - but only to a point. The "last mile" is the euphemism for the customer -- and every long-distance new competitor that emerged in the 1990s -- had to pay the incumbent Bells a "toll" to bring the long-distance wire to the customer. Fast forward another 10 years, and through roll-ups, the FCC managed to pave the way to a new unregulated monopoly -- Verizon.

Which brings us to today's topic Net Neutrality -- or rather, Verizuhn'sNet, or as my favorite curmudgeon Jeff Jarvis decries: The Schminternet. Google and Verizon cobbled together a self-serving policy. Per the FCC edict: there would be no discrimination -- on wired services. But wireless and managed services? New ballgame. In fact, the business model already exists: it's called cable -- and you can only get your HBO if you pay for the mega movie package. Telco and Internet Law Professor Susan Crawford calls this a "science-fiction-quality loophole."

That Google has decided to cross the line and work with Verizon to end the Net-Neutrality stalemate is because the FCC created a blackhole instead of policies -- not that the courts have helped. None of the adults decided to man up -- so the corporations are. Funny, the libertarians and conservatives don't trust government -- but they want to trust the folks most likely to profit from any policy changes. Fox. Henhouse. Anyone?

So while the FCC is impotent and will remain so if the commissioners -- and Congress -- who pass insipid bills who are too busy fundraising rather than studying up on the subjects on which they are making laws,  perhaps we can rouse that other slightly older federal agency the FTC, who is supposed to be worried about trade and commerce, consumer protections and anti-competitive monopolies to see that what this mumbo-jumbo really is: a tax on small businesses and individuals. If they can't afford to pay, they won't be able to play. This regressive tax would unlevel the playing field -- stifling job creation and innovation by blocking those too small to ante from Schminternet services.

Oh and just in case there is any, ahem,  (no doubt) unintended discrimination, according to the GooVer proposal, one can file a complaint. A $2m penalty would be incurred by anyone in violation (see loopholes above to see if you have a snow-balls chance of winning).

And of course if the FTC won't step, we can always pray to the folks at FaceBook to do the opposite of whatever Google wants.

Monday, August 9, 2010

Australian sells 8500 iPad apps first month

Interesting news tidbit: The Australian, the Aussie national paper owned by News Limited, part of the Rupert Murdoch news dynasty, launched its iPad app this week. And while the paper pales in circulation to the two metro papers of Sydney and Melbourne, with only 135,000 for its Monday-Friday edition and double that for Saturday, it sold 8,500 apps in its inaugural month.

According to Nick Leeder, deputy chief executive officer of The Australian, the app, which costs $3.99 a month, the number sold far exceeded expectations -- and has proven to a useful albeit "brutal" feedback tool. "The first thing they said was 'we want more content.' There are certain things The Australian stands for that we didn't have in the original app, more Opinion and Media," explained Leeder. Readers also said they wanted content and functionalities that would be found on an iPad app -- "and that's progressively being rolled out."

Not too shabby to have a focus group that pays you to participate. But if there is one thing we all should have learned by now is that audiences will give you some latitude at ramp-up -- but you have a very small window to give them what they want before they go ... er, walk-about.

Monday, August 2, 2010

Mixing, Mashing XML into Content Derivatives

After much deliberation, I have changed jobs, moving from Nstein (now OpenText) to MarkLogic, a provider of “purpose-built databases for unstructured content,” which means we handle all that data that doesn’t fit nicely into rows and columns – you know content like documents, articles, books, graphics -- which is often (and best) represented in XML. Over the years I have written about the importance of semantic metadata but it is but a scintilla of types of metadata that can be appended to content -- as long as that all-important infrastructure is in place. It was so last year to be absorbed in knowing what content you created -- now what is important is to complement that content with information from other resources.

What does this mean to information providers, publishers and other types of media (and if you read my blogs, you know that I believe all of us are publishers!)? Well consider the iPad. Selling at a head-shaking rate of one every three seconds (despite the recession), more than 13 million will be in consumers' hands by Christmas. Add to it the hordes of other mobile devices: iPhones, Blackberries, Androids, eReaders ... and you have 12 percent of the market looking for content for their gizmos. Which in itself can be a challenge -- since content doesn't just magically play nicely on every device. Most of that content will need to flow into its own native application to really exploit the devices' features, which mean that content needs to first be available in a neutral format.

And while you are exploiting the gizmos features ... remember that if gizmos are everywhere their owners are -- there will be an increased desire for what is known in the military as situational awareness ; deriving additional, contextual information that relates to the user -- usually time (temporal data) and space (geo mapping). Think of a soldier needing to know what threats are in the area where he currently is -- or where he is going. Publishers too should think in terms of creating new derivative, situational content. For example, what types of information might a business person want while on Maple and Elm at 8am in the summer -- versus at 8pm in the winter? Or what location-based weather patterns does a commodities trader want when looking at crop futures?

This need for situational awareness provides a great opportunity for publishers to take their knowledge bases and mix it with external resources -- such as public information from NOAA, Google Maps, LinkedIn, or proprietary information from partners. The key to mixing and mashing is having content in a mutable format -- and a database that can handle it. Extensible Markup Language, or XML, is a highly flexible text format, a W3C standard that is sometimes called atomic or a neutral format. It is designed to be easily stored and retrieved -- void of any display format. Which means it "pours" nicely into any layout. MarkLogic's database in my mind then is akin to a gourmet mixing bowl that takes in XML and allows it to be stored and retrieved into any application.

XML is hardly new as it was designed for large-scale publishing, although there is an increased awareness around it due to the Web and blog feeds, and is a great way to describe unstructured data. Unstructured data can reside in regular relational databases (RDBMS) -- but they tend to get bogged down. By storing this unstructured data on a database built specifically to handle these datatypes -- you can search and retrieve much more quickly. Forrester Analyst Noel Yuhanna told me estimated that by unburdening RDBMS of unstructured data -- they saw a 30% lift in database performance, which is huge.

In any event, the real advantage to having content in XML -- and residing in a database built to handle it -- is that you can easily mix and mash it up into new types of content, ready it for new delivery platforms, or ease syndication. And you can do all of this in a matter of weeks not months -- which is terrific since we don't yet know what other new gizmo might be in readers' hands -- least of all by Christmas.