What is Big Data?
According to Wikipedia,
“Big data” is a field that treats ways to analyze, systematically extract information from, or otherwise deal with data sets that are too large or complex to be dealt with by traditional data-processing application software.
Another 2016 definition states that “Big data represents the information assets characterized by such a high volume, velocity, and variety to require specific technology and analytical methods for its transformation into value”. Later on one more property, veracity was added.
What qualifies as being “big data” varies depending on the capabilities of the users and their tools, and expanding capabilities make big data a moving target. “For some organizations, facing hundreds of gigabytes of data for the first time may trigger a need to reconsider data management options. For others, it may take tens or hundreds of terabytes before data size becomes a significant consideration.” [Introduction to Big Data](radar.oreilly.com/r2/release2-0-11.html).
For many organizations, big data is often a cost-free byproduct of digital interaction (Digital footprint).
The quantity of generated and stored data. The size of the data determines the value and potential insight, and whether it can be considered big data or not. The volume directly depends on the capacity of our hard drives and they have been evolving at tremendous speed allowing the volume to increase daya by day.
“The density of hard drives increases by a factor of 1,000 every 10.5 years (doubling every 13 months)” coined by Mark Kryder, a distinguished scientist in electrical engineering and physics and was Seagate Corp.’s senior vice president of research and chief technology officer.
The type and nature of the data. This helps people who analyze it to effectively use the resulting insight. Big data draws from text, images, audio, video; plus it completes missing pieces through data fusion.
In this context, the speed at which the data is generated and processed to meet the demands and challenges that lie in the path of growth and development. Big data is often available in real-time. Compared to small data, big data are produced more continually. Two kinds of velocity related to big data are the frequency of generation and the frequency of handling, recording, and publishing.
It is the extended definition for big data, which refers to the data quality and the data value. The data quality of captured data can vary greatly, affecting the accurate analysis.
Types of data
Data can be structured, semi-structured, or unstructured. While in most cases, the data is unstructured because it is collected via multiple different sources and they do not have the same structure. Moreover, it can be historical or real time based on the company requirements but all the data at some point in time becomes historical.
Capital markets and exchanges produces a large amount of time series market data, quotes and trade data. Multiples exchanges writes more than a Terabyte per day.
There are multiple parts from the technological perspective when it comes to Big Data.
Streaming and Processing: It’s a bridge between data senders and data processing/storage systems. The incoming messages/data are received, processed, and passed/stored. The famous tools used for designing such pipeline are kafka, pulsar, Storm, Flink, Spark, and Samza. These are all open source solution. Other popular solutions include AWS Kinesis and Google cloud pub/sub.
Storage: Once we have data, we need to store it over distributed systems. The data does not have a proper structure so NoSQL databases such as CouchDB, HBase, Cassandra, MongoDB, etc are preferred.
Security: Apache Sentry, Apache Ranger, Apache Knox are some of the projects that provide authentication and authorization for Hadoop based systems. While cloud providers also has its own Security and Event management system.
While Big Data has been around since a decade but still it lacks the required technology and standards to evolve. For example, there’s no standard API or interface if one wants to change their Storage mechanism. There’s no standard API to query the databases also as we have in SQL standard. Each database has its own API. This is true for Streaming, Metadata, Deploy, Security and Governance, and for Integration with other applications such as Machine learning, AI etc also.
As all the providers are trying to build their own unique solution which results in competition and All these drawbacks limits the development in Big Data.
Previously also many studies has been done with Google Trends data. In 2013, a study, titled “Quantifying Trading Behavior in Financial Markets Using Google Trends,” was published in Nature’s Scientific Reports. They analyzed 98 different terms related to stock market and tried to find a pattern between search queries and overall direction of traders decisions.
Nick Bilton (26 April 2013). “Google Search Terms can predict Stock Market, Study Finds”. The New York Times. Retrieved 9 August 2013.
Another pattern was related to the tax policy and ADR (American Depository Receipts). Non-american companies holding the stock had to pay dual tax on the dividend. To bypass this many companies were selling the holdings before Ex-dividend day and buy after Ex-dividend day. More such trading patterns can be found by analyzing the historical transaction data.
Real time credit scoring is another useful application of Big data. Ying Wang, Siming Li, and Zhangxi Lin examined the potential of non banking related parameters affecting the credit scores. The general process for any loan or credit request is, an applicant submits the details to the bank. Bank will get more information from other sources and an experienced specialist will process the application to decide the interest rate and other loan parameters. But the central store having the customer data updates them a on monthly basis and there can be some non-financial factors that could help to decide the risk of a default. The authors analyzed several factors such as login frequency, additional contact data, volume of transactions in the last month, numbers of successful transactions overall, the client’s business sector, and so on.
Y. Wang, S. Li, and Z. Lin, “Revealing Key Non-Financial Factors for Online Credit-Scoring in e-Financing”, Proc. 10th Int’l Conf. Service Systems and Service Management (ICSSSM), 2013, pp. 547–552