Each minute of every day on the Internet
➡Every second, on average, around 6,000 tweets are tweeted on Twitter, which corresponds to over 350,000 tweets sent per minute.
➡300 hours of video are uploaded to YouTube every minute! Almost 5 billion videos are watched on YouTube every single day.
➡300,000 photos and videos are shared on Instagram per minute.
➡Over 3.5 Billion Google searches are conducted worldwide each minute.
➡Every minute on Facebook: 510,000 comments are posted, 350,000 statuses are updated, and 250,000 photos are uploaded.
with a bit of Calculation, we can see the amount of data created on the internet each day. It's Huge and these Data are Termed as Big Data but it's not all.
WHAT IS BIG DATA?
Big data refers to the large, diverse sets of information that grow at ever-increasing rates. It encompasses the volume of information, the velocity or speed at which it is created and collected, and the variety or scope of the data points being covered.
It is more simple to understand with 4 basic characteristics — the 4 V’s.
Big Data as Challenge
Big Data has So many Benefits but these come with an Umbrella of Problems.
Big data challenges include storing and analyzing large, rapidly growing, diverse data stores, then deciding precisely how to best handle that data.
- Data Coming at an exponential rate and MNC’s has to accept the incoming flow of data and at the same time process it fast so that it does not create bottlenecks.
- Data Comes in many different forms, either structured or unstructured. For Computations or Research, These data must be classified into various categories.
- Generated insights from these data must be Fast and Precise.
- In order for organizations to capitalize on the opportunities offered by big data, they are going to have to do some things differently. And that sort of change can be tremendously difficult for large organizations.
- Security is also a big concern for organizations with big data stores. After all, some big data stores can be attractive targets for hackers or advanced persistent threats (APTs).
To Deal with These Challenges Companies Came up With a new Concept of Distributed Computing and Parallel computing system.
Distributed computing is a model in which components of a software system are shared among multiple computers to improve efficiency and performance.
Distributed Systems have the ability to distribute data between cluster nodes and enable clients to seamlessly retrieve the data from multiple nodes. This improves the performance of sharing and storing files significantly.
It Enabling data users to receive more storage space if needed and enabling storage system operators to scale the storage system up and down by adding or removing storage units to the cluster.
For Implementing this many new Technology were invented :
Based On Google MapReduce, Hadoop runs applications using the MapReduce algorithm, where the data is processed in parallel with others. In short, Hadoop is used to develop applications that could perform a complete statistical analysis on huge amounts of data.
NoSQL Databases :
NoSQL systems are distributed, non-relational databases designed for large-scale data storage and for massively-parallel, high-performance data processing across a large number of commodity servers.
Apache Spark is a data processing framework that can quickly perform processing tasks on very large data sets, and can also distribute data processing tasks across multiple computers, either on its own or in tandem with other distributed computing tools.
Spark speed is that it is able to “run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.
Many Other Apache Technologies also Cames such as
>Apache Hive is a data warehouse infrastructure that facilitates querying and analysis of large datasets residing in Apache Hadoop.
>HBase is a column-oriented non-relational database management system that runs on top of the Hadoop Distributed File System (HDFS). HBase provides a fault-tolerant way of storing sparse data sets, which are common in many big data
>Apache Kafka is a data stream used to feed Hadoop BigData lakes. Kafka brokers support massive message streams for a low-latency follow-up analysis in Hadoop or Spark.
Thanks for Reading!
Hope It Helps You Gain More