What is Big Data?
According to Wikipedia, “Big data” is a field that treats ways to analyze, systematically extract information from, or otherwise deal with data sets that are too large or complex to be dealt with by traditional data-processing application software.
Another 2016 definition states that “Big data represents the information assets characterized by such a high volume, velocity, and variety to require specific technology and analytical methods for its transformation into value”. Later on one more property, veracity was added.
Per Oreilly Introduction to Big Data book – What qualifies as being “big data” varies depending on the capabilities of the users and their tools, and expanding capabilities make big data a moving target. “For some organizations, facing hundreds of gigabytes of data for the first time may trigger a need to reconsider data management options. For others, it may take tens or hundreds of terabytes before data size becomes a significant consideration.”
For many organizations, big data is often a cost-free byproduct of digital interaction (Digital footprint).
Applications of Big Data
- Portfolio Risk Management
- Algorithmic Trading Strategies
- Pre-trade analytics
- Post-trade processing
- Compliance and Surveillance
- Regulatory Reporting
Characteristics of Big Data
The quantity of generated and stored data. The size of the data determines the value and potential insight, and whether it can be considered big data or not. The volume directly depends on the capacity of our hard drives and they have been evolving at tremendous speed allowing the volume to increase day by day.
“The density of hard drives increases by a factor of 1,000 every 10.5 years (doubling every 13 months)” coined by Mark Kryder, a distinguished scientist in electrical engineering and physics and was Seagate Corp.’s senior vice president of research and chief technology officer.
The type and nature of the data. This helps people who analyze it to effectively use the resulting insight. Big data draws from text, images, audio, video; plus it completes missing pieces through data fusion.
In this context, the speed at which the data is generated and processed to meet the demands and challenges that lie in the path of growth and development. Big data is often available in real-time. Compared to small data, big data are produced more continually. Two kinds of velocity related to big data are the frequency of generation and the frequency of handling, recording, and publishing.
It is the extended definition for big data, which refers to the data quality and the data value. The data quality of captured data can vary greatly, affecting the accurate analysis.
Types of Data
Data can be structured, semi-structured, or unstructured. While in most cases, the data is unstructured because it is collected via multiple different sources and they do not have the same structure. Moreover, it can be historical or real-time based on the company requirements but all the data at some point in time become historical. Some examples –
- Structured data
- Market data
- Transaction data
- Securities Reference data
- Unstructured data
- Corporate filings
- Corporate fundamental data
- Macro-Micro economics indicators
- Social Media
There are multiple parts from the technological perspective when it comes to Big Data.
Streaming and Processing: It’s a bridge between data senders and data processing/storage systems. The incoming messages/data are received, processed, and passed/stored. The famous tools used for designing such pipelines are Kafka, pulsar, Storm, Flink, Spark, MapReduce, Hive and Samza. These are all open-source solutions. Other popular solutions include AWS Kinesis and Google cloud pub/sub.
Storage: Once we have data, we need to store it over distributed systems. The data does not have a proper structure so NoSQL databases such as CouchDB, HBase, Cassandra, MongoDB, etc are preferred. InfluxDB and KDB+ are very popular for storing large time-series data. Cloud-based Google Bigquery and Amazon Timestream are also becoming popular.
Security: Apache Sentry, Apache Ranger, Apache Knox are some of the projects that provide authentication and authorization for Hadoop based systems. While cloud providers also has its own Security and Event management system.
Analytics: Multiple applications are being built using time-series data, natural language processing, statistical analysis, regression, simulation for backtesting, neural networks (ANN, LSTM, CNN, etc.) and predictive modeling.