You are here:Home » Big Data » BIG DATA - BRINGING STRUCTURE TO UNSTRUCTURED DATA

BIG DATA - BRINGING STRUCTURE TO UNSTRUCTURED DATA

In its native format, a large pile of unstructured data has little value. It is burdensome in the typical enterprise, especially one that has not adopted Big Data practices to extract the value.

However, extracting value can be akin to finding a needle in a haystack, and if that haystack is spread across several farms and the needle is in pieces, it becomes even more difficult. One of the primary jobs of Big Data analytics is to piece that needle back together and organize the haystack into a single entity to speed up the search. That can be a tall order with unstructured data, a type of data that is growing in volume and size as well as complexity.

Unstructured (or uncatalogued) data can take many forms, such as historical photograph collections, audio clips, research notes, genealogy materials, and other riches hidden in various data libraries. The Big Data movement has driven methodologies to create dynamic and meaningful links among these currently unstructured information sources.

For the most part, that has resulted in the creation of metadata and methods to bring structure to unstructured data. Currently, two dominant technical and structural approaches have emerged: (1) a reliance on search technologies, and (2) a trend toward automated data categorization. Many data categorization techniques are being applied across the landscape, including taxonomies, semantics, natural language recognition, auto-categorization, “what’s related” functionality, data visualization, and personalization. The idea is to provide the information that is needed to process an analytics function.

The importance of integrating structured and loosely unstructured data cannot be overstated in the world of Big Data analytics. There are a few enabling technical strategies that make it possible to sort the wheat from the chaff. For instance, there is SQL-NoSQL Integration. Those using MapReduce and other schemaless frameworks have been struggling with structural data and analytics coming from the relational database management system (RDBMS) side. However, the integration of the relational and nonrelational paradigms provides the most powerful analytics by bringing together the best of both worlds.

There are several technologies that enable this integration; some of them take advantage of the processing power of MapReduce frameworks like Hadoop to perform data transformation in place, rather than doing it in a separate middle tier. Some tools combine this capability with in-place transformation at the target database as well, taking advantage of the computing capabilities of engineered machines and using change data capture to synchronize, source, and target, again without the overhead of a middle tier. In both cases, the overarching principle is real-time data integration, in which reflecting data change instantly in a data warehouse—whether originating from a MapReduce job or from a transactional system—and create downstream analytics that have an accurate, timely view of reality. Others are turning to linked data and semantics, where data sets are created using linking methodologies that focus on the semantics of the data.

This fits well into the broader notion of pointing at external sources from within a data set, which has been around for quite a long time. That ability to point to unstructured data (whether residing in the file system or some external source) merely becomes an extension of the given capabilities, in which the ability to store and process XML and XQuery natively within an RDBMS enables the combination of different degrees of structure while searching and analyzing the underlying data.
Newer semantics technologies can take this further by providing a set of formalized XML-based standards for storage, querying, and manipulation of data. Since these technologies have been focused on the Web, many businesses have not associated the process with Big Data solutions.

Most NoSQL technologies fall into the categories of key value stores, graph, or document databases; the semantic resource description framework (RDF) triple store creates an alternative. It is not relational in the traditional sense, but it still maintains relationships between data elements, including external ones, and does so in a flexible, extensible fashion.

A record in an RDF store is composed of a triple, consisting of subject, predicate, and object. That does not impose a relational schema on the data, which supports the addition of new elements without structural modifications to the store. In addition, the underlying system can resolve references by inferring new triples from the existing records using a rules set. This is a powerful alternative to joining relational tables to resolve references in a typical RDBMS, while also offering a more expressive way to model data than a key value store.

One of the most powerful aspects of semantic technology comes from the world of linguistics and natural language processing, also known as entity extraction. This is a powerful mechanism to extract information from unstructured data and combine it with transactional data, enabling deep analytics by bringing these worlds closer together.

Another method that brings structure to the unstructured is the text analytics tool, which is improving daily as scientists come up with new ways of making algorithms understand written text more accurately. Today’s algorithms can detect names of people, organizations, and locations within seconds simply by analyzing the context in which words are used. The trend for this tool is to move toward recognition of further useful entities, such as product names, brands, events, and skills.

Entity relation extraction is another important tool, in which a relation that consistently connects two entities in many documents is important information in science and enterprise alike. Entity relation extraction detects new knowledge in Big Data. Other unstructured data tools are detecting sentiment in social data, integrating multiple languages, and applying text analytics to audio and video transcripts. The number of videos is growing at a constant rate, and transcripts are even more unstructured than written text because there is no punctuation.

Taken from : Big Data Analytics: Turning Big Data into Big Money

2 comments:

  1. Thank you so much for this nice information. Hope so many people will get aware of this and useful as well. And please keep update like this.

    Text Analytics Software

    Text Analytics with R

    ReplyDelete
  2. Good blog…Variety of information which is helpful to improve my knowledge even more and very thoughtful blog…Thanks for the article!!!

    Pixalive
    online Social Media App
    Social Media App online
    online chat
    Play free online games
    free online games

    ReplyDelete