The Big Data Technology? What? Why? Where? and How?
Also Read, What is Data Lake? The difference between Databases, Data Warehouses, and Data Lakes.
The Digital world means tons of apps, websites, social media pages, and more. What value does it generate? It generates leisure, pleasure, ease, and automation. But, the most elemental attribute it generates is the Data.
What is Data? It is anything in a digital form, be it a fact, a figure, or a stat. It can be a pic of your loving dog, an entire movie, or your identification details. Every one of the things mentioned in one or some other form is data. When the size of the data becomes big, it gets difficult to manage and can be called big data. But, how to define it? And what to do with it.? Let’s dive deep into it with 3W’s and 1 H. What, Why, Where, and How.
What is Big Data?
When the data exceeds one or more of the 4 V’s, it can be considered Big. What are those 4 V’s?
- Volume: Simply the size of the data.
- Velocity: The speed with which data is generated.
- Variety: The various types, can be images, text files, videos, gifs, or something else.
- Veracity: How relevant is the data? What is its truthfulness?
Why use Big Data?
You might have experienced sometimes how we instantly recognize a song, sometimes people get intuition when it is about to rain or something bad is going to happen, or we almost know when it is going to be a goal in football. Why is that? Because of patterns. Our brain can detect, learn and recognize patterns, given we have sufficient data points. The same goes with Data Analysis, and ML/AI technologies, they work on patterns.
The more data, the more patterns we have, the more stats, the more models, and thus the better and more precise prediction. A basic example is YouTube recommendations, even if you log in with a new account and watch some videos for a day, YouTube will detect patterns for you and put your IP in a category of people who think like you, and then recommend per your taste. Because it already has millions of rows of YouTube history of thousands of people who think like you.
Data patterns can help predict what should be kept on a Walmart shelf in certain weather or what could be the temperature for the next 20 days, or what Instagram feed would their customer like. The patterns can even tell what disease might a person living in a particular area or from a particular gene pool can get in the future. All this is possible with Big Data.
Where is the Big Data Used?
It has many use cases. A few of them are listed below.
- Natural Language Processing (NLP), is an ML/AI-specific technology used for computers to read and understand human text and language with the help of keywords.
- Image Detection and Classification, another ML/AI tech, can detect faces, objects, sizes, and borders in images and videos. Used for face detection, robotic vision to enable robots to work, fault detection in industries, etc.
- Weather Analysis, to read weather patterns to issue meteorological warnings to fishermen, farmers, ships, airplanes, and common people.
- Traffic Optimization, finding the fastest routes, traffic bottlenecks, and ensuring a smooth driving experience for people.
- Crime prediction and prevention, real-time intelligence models can be created based on CCTV feeds across the city to predict an alert wherever and whenever required.
- Ad targeting, make ads reach the correct set of audience, with help of data. The recent use of deep fakes in the Indian Ad Industry by companies like Zomato and Cadbury to show localized versions of Ads is an interesting use case.
How Big Data is implemented?
There are several tools and technologies already available in the market. I will list down a few.
Data Storage:
- Data Lakes, like Cloud Blob Storage like Amazon S3, Azure Blob or GCP Storage Bucket, Apache Hadoop HDFS.
- Data Warehouses, like Snowflake, Google BigQuery, Amazon Redshift, Azure Synapse Analytics, and IBM Db2.
- Databases, like AWS Dynamo DB, Azure Cosmos Db, Oracle, MongoDB, and MySQL.
Data Transformation:
Tools Like Apache Spark, Hadoop, Talend, Informatica Power Centre, Dataiku (works on top of Snowflake), etc.
Data Streaming:
Apache Kafka, Google Cloud Dataflow, Amazon Kinesis, Stream SQL, Spark Streaming, etc.
Data Visualization:
Tableau, QlikView, Microsoft Power BI, Oracle Visual Analyzer, Databricks Visualization, Pandas with Matplotlib, etc.
Bonus Topic: The Data Lake
What is a Data Lake?
Data Lake is a large storage space, that is designed to store and secure great variety and an immense amount of data in its native type. This data can be of any size, any scale, and can be structured, unstructured, or semi-structured.
How Data Lake is different that Data Warehouse?
Data Warehouse can hold data at a high scale, but the data needs to be in a structured format, in rows and columns. The data in the data warehouse may be used in the future, so it is kept in format for quick analysis.
On the other hand, Data Lake acts as a data dump, it would contain all the data but in native format. Consider some Ad-hoc analysis is required, a transformation can be done and data can be made ready as per demand.
How Database is different than Data Warehouse?
The database is used for operational purposes, like bank transactions, customer purchases, etc. It captures and maintains the current data. The database is updated per transaction. The database is usually OLTP-type.
A Data warehouse is more likely to be used as a historical data dump. Which can create multiple years of history and is used more for analysis. The data here is kept for and according to the purpose of analysis e.g. tables would be created not similar to transactional purpose but would have tables for analysis. It is updated in batch mode in the scheduled process. The database here is usually OLAP-type.
That was all about Big Data and the Type of Databases. I hope it cleared your concepts.