Today Big Data is coined as massive amounts of structured, unstructured, and semi-structured data, and commonly we refer to the concept of the 4 V’s, Volume, Velocity, Variability, and Variety as characteristics of Big Data. Volume is the easy one as it refers to massive data volumes and referenced as exceeding the physical limits of vertical scalability. Velocity is defined based off the speed of the data coming into an organization, such as a RSS feed which can accumulate massive amounts of data at a very fast rate. Variability means that the data can morph into many different meanings based off the context of the information captured, in turn creating variability in the meaning. Finally, Variety is coined because of the many different data formats that are in the industry and how we handle the types of formats and make meaning out of the data.
How do organizations handle Big Data?
With massive amounts of data being generated in today’s society, organizations have concerns revolving around data ingestion, storing the large volumes, building analytics, and providing visualization because few organizations are used to handling these volumes. The days of measuring volumes at the Terabyte and Petabyte levels are now gone; we have arrived at the Yottabyte. Yottabytes are defined as a unit of information equal to one septillion bytes (one quadrillion gigabytes) per Wikipedia! The traditional data warehousing methods cannot handle moving and storing these large volumes, as well as, processing unstructured data efficiently. This is where the MapReduce programming model comes into play handling the processing of large amounts of data.
MapReduce is a library that allows a user to adopt a way of programming that can be easily split and processed against a bunch of machines. The jobs are divided into two parts, a Map, and a Reduce. The Map job takes an input and splits it apart into sub-parts and sends the sub-parts to different machines for processing. The Reduce takes all the sub-parts and reconstructs them to give you an answer. The guts of how this works is that it takes inputs, such as a list of rows, then the rows are split against different machines for processing. The result is a list of intermediate key/value pairs, then the MapReduce library groups all intermediate values together that are associated with the same intermediate keys and passes them to the Reduce function. The Reduce function ingests the intermediate key and the values for those keys, and then merges the similar values together to form a single value. Here is a simple example of how it works:
As you can see the MapReduce programming can significantly improve processing against large volumes of data. This result set improves the query response times by eliminating the need to scan duplicate records and the data sets are significantly smaller. While the MapReduce processing can break down data to its smallest forms for storage and become very complex, this was a very simple example.
Big Data Storage
Other platforms such as Teradata and Greenplum, which also use MPP and MapReduce, have integrated analytics engines in which the processing happens within the database, making them another unique offering for Big Data. With the analytics co-existing next to the data it alleviates the movement of massive datasets to an analytics engine that could significantly degrade performance, but in this case it significantly improves analytic processing performance.
With technologies in place to handle the ingestion, storage, and processing of Big Data the industry still needs to address issues before realizing the full potential of Big Data. As we capture more and more data there are data policies that should be applied.
These policy issues become increasingly important around security, privacy, intellectual property, and liability. The most common concern around data is the security of sensitive personal information that should be kept private. Data breaches can cause consumers’ personal information, corporate confidential information, and even national security information to be exposed if captured by the wrong hands. Therefore, these big data software and appliance vendors such as Oracle with the Exadata platform have mechanisms built into their technologies for mitigating the risk of breaches. For example, the Exadata box contains Encryption/Masking, Access Control, Auditing/Tracking, and Monitoring/Blocking to fight against breaches. This suite of tools provides protection by isolating each data application using a firewall-like shield preventing the use of a compromised admin account to steal data; control database privileged user access to application data preventing insider attacks, and monitoring database activities for SQL injections. Therefore, these protection mechanisms enable organizations to customize their own security parameters at any level of the applications ensuring a low risk of vulnerability.
Policy issues that are driven by governance are around privacy of data, intellectual property, and liability. In the healthcare industry being able to have access to people’s health records can have significant human benefits by pinpointing the right medical treatment for an individual. Also, having a person’s financial records can provide significant benefit to providing consumers with the right fit financial instruments. If this data would be available for research and usage by industries such as Healthcare and Finance, it would be massive and come with more underlying issues. This leads me to the issue of intellectual property and liability where many legal issues could arise if the governance of this information is not handled properly. Questions will arise such as, “Who owns the dataset, and what rights come with the data?” “Who is responsible if the use of the data leads to a negative consequence?” These questions will have to be answered prior to use to protect the owner and consumer of the data.
Big Data Analytics & Visualization
With Hadoop, Greenplum, Teradata, and Exadata as some of the players who seem to have the data storage and processing modeled into a stable and consistent product, we now need analytical processing and visualization. As the Big Data industry continues to evolve, the need for analytics will be growing exponentially. Some of the vendors have addressed this need, such as Teradata and Oracle. Teradata has nCluster Analytics that is included in the data tier of their database. This provides a huge advantage because the database features, such as fault tolerance and workload management techniques, are equally applied to the data management and the analytical applications. The product uses SQL as well as most of the standard programming languages, such as R, Java, and C++ for analytic processing, as well as, comes with SAS in the base product. Teradata’s “Next Generation Analytics” include Trend Analysis, Fraud Detection, Network Security Analysis, Consumer Behavior, and Portfolio Analytics. Another flavor of the Big Data analytics toolset is Oracle Exalytics. Exalytics is an in-memory system designed for high performance analysis, planning, and modeling to support Business Intelligence and Enterprise Performance Management applications. This product includes Oracle Business Intelligence for visualization and ad hoc reporting, Oracle Times Ten In-Memory Database providing users with rapid query response processing, and Oracle Essbase that delivers multi-dimensional analysis. This product uses columnar compression to reduce the memory footprint for expedited query response times and the analytic engine can run on compressed data, eliminating the need to uncompress, compute analytics, and then compress again for storage. These options are for organizations that need to be able to query against large data volumes while returning the results in a timely manner, irrespective of the language (e.g., NoSQL, SQL, and Java).
Organizations that are looking to develop a Big Data Strategy should define a roadmap that will address the demand for large volumes of data coming in different formats. The ability to address and decision the following areas will ensure the success of the Big Data Strategy:
- Data Policies (security, privacy, intellectual property, and liability)
- Technologies to be used for storing, processing, analyzing, and visualizing large amounts of data.
- Organizational changes around the way you look at Big Data and have the ability to staff people that understand Big Data and how to use it.
- Access the data that will be used for fostering opportunities or driving new business analysis techniques.
At Unissant, we understand the challenges and the complexities involving technology, business, and the value of going to a Big Data Solution. We help our customers realize the benefits and drawbacks of moving to a Big Data based environment through continuous dialog and experiences. The use of Big Data is not necessary for all organizations and with our experience of implementing environments; we are looked upon as a trusted advisor to our clients who are thinking about Big Data initiatives. We have been developing proprietary frameworks for many years around Data Governance, Data Quality, Master Data Management, Business Intelligence, Information Security, Data Classification, and Metadata. Unissant uses the latest Big Data technologies and teams with many of the major vendors such as Oracle, Greenplum, Teradata, and Karmasphere to provide our customers the “right” environment for their business needs.