Sunday, August 22, 2010

6 Things About Unstructured Content You Need to Know

Unstructured data. As a writer I hate that term. I remember the first time I heard reference to it: sitting in a meeting and technical people were talking about all the unstructured content that publishers produce. How could they be speaking about articles as unstructured? If anyone has made it through kindergarten they have learned that they must follow a linguistic pattern or structure in order to communicate effectively. But in the lexicon of geekdom, any article, picture, powerpoint, video, song, user-generated content -- your kids' text messages -- are all: unstructured. Any "data" that does not fit nicely into a column or a row -- is considered unstructured.

Now, as much as I hate the term, there is a logic that places all content that didn't fall neatly into a table to be called unstructured versus structured. It was evident when classifieds first went online. Dumped from mainframes where customers paid by the character, people created their own short-hand to say 4BR House 4 Sale. Turns out though, that all that unstructure (which I prefer to say as free-form)  makes it very hard to search on. Don't believe me? Go to Craig's List -- which tries to impose structure on advertisements by putting it under broad categories and locations. Other than that -- it is pretty freestyle. Nannies are caregivers are sitters (baby or otherwise) -- and plural or otherwise. And search on one of those terms at the peril of not finding it under the other.

On the flip side are the sites that allow only structure: information fits neatly in a row or a table and has descriptive titles like Type of residence, # of Bedrooms, Siding, Price, MLS # etc. Having that structure makes it easy to query or search on that information. You probably have heard of SQL -- which is the acronym for Structured Query Language. Relational databases use SQL to find content -- by looking in the appropriate fields. Which is all well and good for content like financial information, inventories, human resources stuff. But what about the rest of the content that floats around a corporation? The memos, sales presentations, business plans, schematics, Web sites -- the stuff we sometimes refer to as: Knowledge - and which in geekdom is called unstructured content, are the digital assets we need to carefully manage.

So here's 6 Things You Need to Know About Unstructured Content.
  1. It's everywhere. Analysts, pundits and people in the know estimate that more than 80% of content produced in an enterprise is unstructured
  2. Content is containerized. Unstructured content resides in containers like .doc, .ppt, .tiff -- and you must have the right software application to read or edit it.
  3. Managing unstructured content is hard. Because content resides in containers it is hard to know what it is in each one. 
  4. XML is crucial for reuse and sharing. Sometimes called atomic or neutral format, XML is a language used to transmit content -- without burden of the container. Neutral content can be then "poured" into any template (Word, Web, PDF -- mobile apps!!!) for easier repurposing. If you have unstructured content (and most likely you have lots of it) it should be stored in an XML format.
  5. Good metadata is essential. Once content is in an XML format, enrich it with semantic metadata. This is critical to letting knowledge workers find out what each asset is about
  6. Native XML databases provide agility & efficiencies. Relational databases (RDBMS) are great for organizing and querying structured data -- while XML databases rock for unstructured content. You can make an RDBMS work with XML, but you will lose a lot in database performance (upwards of 30% is estimated by Forrester). ) -- Heck I will use a knife to tighten a screw -- but sometimes I need to go and get the Philips head.
Now let's look briefly at what all this means. You have tons of content that doesn't fit into tables and rows. Storing content in their original containers of powerpoints, word docs and PDFs makes it extraordinarily hard to share and repurpose this content -- since it's hard to search it -- and cutting and pasting becomes the only alternative. And when we are talking about sharing and repurposing, remember all the great mash-up apps that your content might be perfect for -- if only it were in a neutral format.

Look, I don't like the term Unstructured Content -- but the acronym of CTDFWICAR (Content that doesn't fit well in columns and rows) is hardly memorable -- and way too long. Semi-structured content (because XML actually follows schemas -- which makes it semi-structured, but that's for another day) is only half as horrible as unstructured. In any event, while it may be free-form -- this type of content is the lifeblood of any organization -- and deserves its own special database to help keep it valuable.

1 comment:

Unknown said...

Great post Diane. I really like your analysis of unstructed content and what it really means. We have a community for IM professionals (www.openmethodology.org) and have bookmarked this post for our users. Look forward to reading your work in the future.