Businesses create a huge amount of valuable information in the form of e-mails, memos, notes from call-centers, news, user groups, chats, reports, web-pages, presentations, image-files, video-files, and marketing material and news. According to Merrill Lynch, more than 85% of all business information exists in these forms. These information types are called either semi-structured data or unstructured data. However, organizations often only use these documents once.
The management of semi-structured data is recognized as a major unsolved problem in the information technology industry. According to projections from Gartner, white collar workers spend anywhere from 30 to 40 percent of their time searching, finding and assessing unstructured data. Business Intelligence uses both structured and unstructured data, but the former is easy to search, and the latter contains a large quantity of the information needed for analysis and decision making.
Because of the difficulty of properly searching, finding and assessing unstructured or semi-structured data, organizations may not draw upon these vast reservoirs of information, which could influence a particular decision, task or project. This can ultimately lead to poorly informed decision making.
Therefore, when designing a Business Intelligence/Data Warehouse solution, the specific problems associated with semi-structured and unstructured data must be accommodated for as well as those for the structured data.
Unstructured data vs. semi-structured data
Unstructured and semi-structured data have different meanings depending on their context. In the context of relational database systems, unstructured data cannot be stored in predictably ordered columns and rows. One type of unstructured data is typically stored in a BLOB (binary large object), a catch-all data type available in most relational database management systems. Unstructured data may also refer to irregularly or randomly repeated column patterns that vary from row to row within each file or document.
Many of these data types, however, like e-mails, word processing text files, PPTs, image-files, and video-files conform to a standard that offers the possibility of metadata. Metadata can include information such as author and time of creation, and this can be stored in a relational database. Therefore it may be more accurate to talk about this as semi-structured documents or data, but no specific consensus seems to have been reached.
Unstructured data can also simply be the knowledge that business users have about future business trends. Business forecasting naturally aligns with the BI system because business users think of their business in aggregate terms. Capturing the business knowledge that may only exist in the minds of business users provides some of the most important data points for a complete BI solution.
Problems with semi-structured or unstructured data
There are several challenges to developing Business Intelligence with semi-structured data. According to Inmon & Nesavich, some of those are:
- Physically accessing unstructured textual data – unstructured data is stored in a huge variety of formats.
- Terminology – Among researchers and analysts, there is a need to develop a standardized terminology.
- Volume of data – As stated earlier, up to 85% of all data exists as semi-structured data. Couple that with the need for word-to-word and semantic analysis.
- Searchability of unstructured textual data – A simple search on some data, e.g. “apple”, results in links where there is a reference to that precise search term. (Inmon & Nesavich, 2008) gives an example: “a search is made on the term felony. In a simple search, the term felony is used, and everywhere there is a reference to felony, a hit to an unstructured document is made. But a simple search is crude. It does not find references to crime, arson, murder, embezzlement, vehicular homicide, and such, even though these crimes are types of felonies.”
The use of metadata
To solve problems with searchability and assessment of data, it is necessary to know something about the content. This can be done by adding context through the use of metadata. Many systems already capture some metadata (e.g. filename, author, size, etc.), but more useful would be metadata about the actual content – e.g. summaries, topics, people or companies mentioned. Two technologies designed for generating metadata about content are automatic categorization and information extraction.
The term metadata refers to “data about data“, the data providing information about one or more aspects of the data, such as:
- Means of creation of the data
- Purpose of the data
- Time and date of creation
- Creator or author of the data
- Location on a computer network where the data were created
- Standards used
The term metadata is ambiguous, as it is used for two fundamentally different concepts (types):
- Structural metadata is about the design and specification of data structures and is more properly called “data about the containers of data”.
- Descriptive metadata, on the other hand, is about individual instances of application data, the data content. In this case, a useful description would be “data about data content” or “content about content” thus metacontent. Descriptive, Guide and the National Information Standards Organization concept of administrative metadata are all subtypes of metacontent.
For example, a digital image may include metadata that describe how large the picture is, the color depth, the image resolution, when the image was created, and other data. A text document’s metadata may contain information about how long the document is, who the author is, when the document was written, and a short summary of the document.
Metadata are data. As such, metadata can be stored and managed in a database, often called a metadata registry or metadata repository. However, without context and a point of reference, it might be impossible to identify metadata just by looking at them. For example: by itself, a database containing several numbers, all 13 digits long could be the results of calculations or a list of numbers to plug into an equation – without any other context, the numbers themselves can be perceived as the data. But if given the context that this database is a log of a book collection, those 13-digit numbers may now be identified as ISBNs – information that refers to the book, but is not itself the information within the book.
The term “metadata” was coined in 1968 by Philip Bagley, in his book “Extension of programming language concepts” where it is clear that he uses the term in the ISO 11179 “traditional” sense, which is “structural metadata” i.e. “data about the containers of data”; rather than the alternate sense “content about individual instances of data content” or metacontent, the type of data usually found in library catalogues. Since then the fields of information management, information science, information technology, librarianship and GIS? have widely adopted the term.
In these fields the word metadata is defined as “data about data”. While this is the generally accepted definition of metadata, various disciplines have adopted their own more specific explanation and uses of the term.