Introduction to Big-data

#bigdata #shortoverview #shortnotes

The definition of big data is data that contains greater variety, arriving in increasing volumes and with more velocity. This is also known as the three Vs.
Put simply, big data is larger, more complex data sets, especially from new data sources. These data sets are so voluminous that traditional data processing software just can’t manage them. But these massive volumes of data can be used to address business problems you wouldn’t have been able to tackle before.
In simple words, big data is a concept describing data sets that exceed the size that can be managed by traditional tools.
5 V's of Big data:
1.Velocity - speed at which data is transformed into insight
2.Volume - scale of the data / vast amount of data
3.Variety - diversity of the data / data comes from diff sources: machines, people, processes both from outside & inside organizations.
4.Veracity - certainty of the data / quality & origin of data
5.Value - profit, any kind of benefit

5V's with Real life examples:
1.Velocity - video uploaded on youtube
2.Volume - data rising from use of mobile, laptop, etc
3.Variety - images, music, videos, text, etc
4.Veracity - 1 in 3 business leader don't trust their data
5.Value - gain profit & various benefit

Who uses big data:
Recommendation engines, virtual personal assistants
Netflix used it for producing "House of cards" - series at right time
IOT - refrigerator, apple watches, coffee machines, etc

Conversions:
1 bit = 1 or 0
8 bits = 1 byte
1024 byte = 1 kilobyte
1024 kilobyte = 1 megabyte
1024 megabyte = 1 gigabyte
1024 gigabyte = 1 terabyte
1024 terabyte = 1 petabyte
1024 petabyte = 1 exabyte
1024 exabyte = 1 zettabyte

Three major sources of Big data:
1.People-generated data
2.Machine-generated data
3.Business-generated data

Forms of data based on structure:
1.Structured - highly organized and well-defined data, it is typically represented in tabular format with rows and columns eg. excel sheet, databse etc
2.Unstructured - 80% of data in world eg.multi-radio telescope
3.Semi-structured - structured + unstructured eg. xml, json, etc

Key aspects of Big data platfrom:
1.Integration -> to bring together parts into a whole, one platform to manage data. eg. hadoop
2.Analysis -> looking up to behaviour of consumer
3.Visualization -> displaying data in simpler way (Graphical, displaying temperature on map by region, etc)
4.Security & governance -> Access control mechanisms, authentication, encryption, and auditing functionalities to protect sensitive data. Automated integration + visual context + agile governance

Data Science is process of cleaning, mining & analyzing data to derive insights of value from it. Data size is less important.

Data science process:
Determine problem -> collect data -> Explore data -> Analyze data -> storytelling: visualization + communication -> Take action

Enhanced 360 view of the customer - wholistic approach that consider all available and meaningful data about the customer to derive better insights

Security/Intelligence extension - analyze data to increase seccurity, improved threat prediction, increase in customer confidence &
deterencce of potentil threats

Opeartion analysis - analysing machine data and real-time analysis eg. aeroplane system, personalized recommendations used by walmart

Techinques for analyzing data - A/B testing, Machine Learning, Natural Language Processing(NLP).
Big data technologies - business intelligence, cloud computing, databases (structured + unstructured). e.g. Hadoop
Cloudera Data Science Workbench - hosted by IBM
Visualization - charts, graphs, other displays

Hadoop - It is an open-source software framework used to store and process huge amounts of data. It is implemented in several distinct, specilized modules. Written in Java. Founder Doug Cutting (Yahoo) named Hadoop after his son's toy elephant.
1.Storage
2.Resource management
3.Distributed processing
4.Common utilities

Rack - collection of 30 or 40 nodes (computers), physically stored close together & connected to same network switch.
Hadoop cluster - collection of racks

Apache Hadoop ia a highly scalable storage platform designed to process vary large data sets across hundreds to thousands of computing nodes that operate in parallel.
Hadoop provides a cost effective storage solution for large data volumes with no format requirements.
MapReduce, the programming paradigm that allows for this massive
scalability, is the heart of Hadoop.

Hadoop strategy:
1.Choosing recommended distributions
2.Maturing the environment with modernized hybrid architectures
3.Adopting a data lake strategy based on Hadoop technology
"Data lakes are a method of storing data that
keep vast amounts of raw data in their native
format and more horizontally to support the
analysis of originally disparate sources of data."

Data warehouses in big-data - Deliver deep insight with advanced in-database analytics & operational analytics, provide online analytic processing (or OLAP).

Data warehouse Modernization: