You are here:Home » Big Data » THE BIG DATA PIPELINE IN DEPTH

THE BIG DATA PIPELINE IN DEPTH

Big Data does not arise from a vacuum (except, of course, when studying deep space). Basically, data are recorded from a data-generating source. Gathering data is akin to sensing and observing the world around us, from the heart rate of a hospital patient to the contents of an air sample to the number of Web page queries to scientific experiments that can easily produce petabytes of data.

However, much of the data collected is of little interest and can be filtered and compressed by many orders of magnitude, which creates a bigger challenge: the definition of filters that do not discard useful information. For example, suppose one data sensor reading differs substantially from the rest. Can that be attributed to a faulty sensor, or are the data real and worth inclusion?

Further complicating the filtering process is how the sensors gather data. Are they based on time, transactions, or other variables? Are the sensors affected by environment or other activities? Are the sensors tied to spatial and temporal events such as traffic movement or rainfall?

Before the data are filtered, these considerations and others must be addressed. That may require new techniques and methodologies to process the raw data intelligently and deliver a data set in manageable chunks without throwing away the needle in the haystack. Further filtering complications come with real-time processing, in which the data are in motion and streaming on the fly, and one does not have the luxury of being able to store the data first and process them later for reduction.

Another challenge comes in the form of automatically generating the right metadata to describe what data are recorded and how they are recorded and measured. For example, in scientific experiments, considerable detail on specific experimental conditions and procedures may be required to be able to interpret the results correctly, and it is important that such metadata be recorded with observational data.

When implemented properly, automated metadata acquisition systems can minimize the need for manual processing, greatly reducing the human burden of recording metadata. Those who are gathering data also have to be concerned with the data provenance. Recording information about the data at their time of creation becomes important as the data move through the data analysis process. Accurate provenance can prevent processing errors from rendering the subsequent analysis useless. With suitable provenance, the subsequent processing steps can be quickly identified. Proving the accuracy of the data is accomplished by generating suitable metadata that also carry the provenance of the data through the data analysis process.

Another step in the process consists of extracting and cleaning the data. The information collected will frequently not be in a format ready for analysis. For example, consider electronic health records in a medical facility that consist of transcribed dictations from several physicians, structured data from sensors and measurements (possibly with some associated anomalous data), and image data such as scans. Data in this form cannot be effectively analyzed. What is needed is an information extraction process that draws out the required information from the underlying sources and expresses it in a structured form suitable for analysis.

Accomplishing that correctly is an ongoing technical challenge, especially when the data include images (and, in the future, video). Such extraction is highly application dependent; the information in an MRI, for instance, is very different from what you would draw out of a surveillance photo. The ubiquity of surveillance cameras and the popularity of GPS-enabled mobile phones, cameras, and other portable devices means that rich and high-fidelity location and trajectory (i.e., movement in space) data can also be extracted.

Another issue is the honesty of the data. For the most part, data are expected to be accurate, if not truthful. However, in some cases, those who are reporting the data may choose to hide or falsify information. For example, patients may choose to hide risky behavior, or potential borrowers filling out loan applications may inflate income or hide expenses. The list is endless of ways in which data could be misinterpreted or misreported. The act of cleaning data before analysis should include well-recognized constraints on valid data or well-understood error models, which may be lacking in Big Data platforms.

Moving data through the process requires concentration on integration, aggregation, and representation of the data—all of which are process-oriented steps that address the heterogeneity of the flood of data. Here the challenge is to record the data and then place them into some type of repository.

Data analysis is considerably more challenging than simply locating, identifying, understanding, and citing data. For effective large-scale analysis, all of this has to happen in a completely automated manner. This requires differences in data structure and semantics to be expressed in forms that are machine readable and then computer resolvable. It may take a significant amount of work to achieve automated error-free difference resolution.

The data preparation challenge even extends to analysis that uses only a single data set. Here there is still the issue of suitable database design, further complicated by the many alternative ways in which to store the information. Particular database designs may have certain advantages over others for analytical purposes. A case in point is the variety in the structure of bioinformatics databases, in which information on substantially similar entities, such as genes, is inherently different but is represented with the same data elements.
Examples like these clearly indicate that database design is an artistic endeavor that has to be carefully executed in the enterprise context by professionals. When creating effective database designs, professionals such as data scientists must have the tools to assist them in the design process, and more important, they must develop techniques so that databases can be used effectively in the absence of intelligent database design.

As the data move through the process, the next step is querying the data and then modeling it for analysis. Methods for querying and mining Big Data are fundamentally different from traditional statistical analysis. Big Data is often noisy, dynamic, heterogeneous, interrelated, and untrustworthy—a very different informational source from small data sets used for traditional statistical analysis.

Even so, noisy Big Data can be more valuable than tiny samples because general statistics obtained from frequent patterns and correlation analysis usually overpower individual fluctuations and often disclose more reliable hidden patterns and knowledge. In addition, interconnected Big Data creates large heterogeneous information networks with which information redundancy can be explored to compensate for missing data, cross-check conflicting cases, and validate trustworthy relationships. Interconnected Big Data resources can disclose inherent clusters and uncover hidden relationships and models.

Mining the data therefore requires integrated, cleaned, trustworthy, and efficiently accessible data, backed by declarative query and mining interfaces that feature scalable mining algorithms. All of this relies on Big Data computing environments that are able to handle the load. Furthermore, data mining can be used concurrently to improve the quality and trustworthiness of the data, expose the semantics behind the data, and provide intelligent querying functions.

Virulent examples of introduced data errors can be readily found in the health care industry. As noted previously, it is not uncommon for real-world medical records to have errors. Further complicating the situation is the fact that medical records are heterogeneous and are usually distributed in multiple systems. The result is a complex analytics environment that lacks any type of standard nomenclature to define its respective elements.

The value of Big Data analysis can be realized only if it can be applied robustly under those challenging conditions. However, the knowledge developed from that data can be used to correct errors and remove ambiguity. An example of the use of that corrective analysis is when a physician writes “DVT” as the diagnosis for a patient. This abbreviation is commonly used for both deep vein thrombosis and diverticulitis, two very different medical conditions. A knowledge base constructed from related data can use associated symptoms or medications to determine which of the two the physician meant.

It is easy to see how Big Data can enable the next generation of interactive data analysis, which by using automation can deliver real-time answers. This means that machine intelligence can be used in the future to direct automatically generated queries toward Big Data—a key capability that will extend the value of data for automatic content creation for web sites, populate hot lists or recommendations, and to provide an ad hoc analysis of the value of a data set to decide whether to store or discard it.

Achieving that goal will require scaling complex query-processing techniques to terabytes while enabling interactive response times, and currently this is a major challenge and an open research problem. Nevertheless, advances are made on a regular basis, and what is a problem today will undoubtedly be solved in the near future as processing power increases and data become more coherent.

Solving that problem will require a technique that eliminates the lack of coordination among database systems that host the data and provide SQL querying, with analytics packages that perform various forms of non-SQL processing such as data mining and statistical analyses. Today’s analysts are impeded by a tedious process of exporting data from a database, performing a non-SQL process, and bringing the data back. This is a major obstacle to providing the interactive automation that was provided by the first generation of SQL-based OLAP systems. What is needed is a tight coupling between declarative query languages and the functions of Big Data analytics packages that will benefit both the expressiveness and the performance of the analysis.

One of the most important steps in processing Big Data is the interpretation of the data analyzed. That is where business decisions can be formed based on the contents of the data as they relate to a business process. The ability to analyze Big Data is of limited value if the users cannot understand the analysis. Ultimately, a decision maker, provided with the result of an analysis, has to interpret these results. Data interpretation cannot happen in a vacuum. For most scenarios, interpretation requires examining all of the assumptions and retracing the analysis process.

An important element of interpretation comes from the understanding that there are many possible sources of error, ranging from processing bugs to improper analysis assumptions to results based on erroneous data—a situation that logically prevents users from fully ceding authority to a fully automated process run solely by the computer system. Proper interpretation requires that the user understands and verifies the results produced by the computer. Nevertheless, the analytics platform should make that easy to do, which currently remains a challenge with Big Data because of its inherent complexity.

In most cases, crucial assumptions behind the data are recorded that can taint the overall analysis. Those analyzing the data need to be aware of these situations. Since the analytical process involves multiple steps, assumptions can creep in at any point, making documentation and explanation of the process especially important to those interpreting the data. Ultimately that will lead to improved results and will introduce self-correction into the data process as those interpreting the data inform those writing the algorithms of their needs.

It is rarely enough to provide just the results. Rather, one must provide supplementary information that explains how each result was derived and what inputs it was based on. Such supplementary information is called the provenance of the data. By studying how best to acquire, store, and query provenance, in conjunction with using techniques to accumulate adequate metadata, we can create an infrastructure that provides users with the ability to interpret the analytical results and to repeat the analysis with different assumptions, parameters, or data sets.

Taken from : Big Data Analytics: Turning Big Data into Big Money

0 comments:

Post a Comment