DEV Community

Cover image for Data engineering for beginners (Part 01)
Muhammad Rameez
Muhammad Rameez

Posted on • Edited on

Data engineering for beginners (Part 01)

What is Data Engineering ?

Perform different techniques to get transformed data including cleaning, transforming and extraction on data to send it to the data scientist so that he can use this data to perform some predictions is know as Data Engineering.

Data Scientist work starts where the work of a Data Engineers work ends.

Image description

Hahaha. Don't worry, I will elaborate each an every concept that is used in data engineering for beginner, in this blog. Buckle up and sit back relax.

What is Database ?

Database is a collection of row and columns can be in the form of structured formed/tables or can be in unstructured format. Database is used to store the data so that we can retrieve it afterwards.

So, we use Database to store the data and retrieve it but in the case of quick response this trick doesn't works because as the database contains all the data so when we call it to fetch, it takes time. That was a very bad experience for the user.

*For example: *
In case of live Dashboard, when the user call the dashboard to see his stats and it will take time ☹️

At this place, Data Warehouse got entered.

Image description

What is Data Warehouse ?

Data warehouse is same like the Database. It is also used to store the data. But the major difference is that data warehouse is specific for the Niche. Like if we want sales record then one data warehouse only contains sales records no other niches than also it is difficult to delete the data from the data warehouse, however data is easily deletable from the database.

"Data warehouse is not good so" if this question is revolving into your mind read this blog [Database vs Data Engineering]

As, data warehouse only contains the data related to specific category so it is fast for retrieval of data. Thatswhy, we do use data warehouse.

What is Data lake ?

Data lake is the pound of data where data is scattered, can be in the form structured, semi structured or unstructured. From the above meme, Data Engineer collect data from the data lake transform it into the required format and load it into the mean stream or send it to the data scientist so that he can use this data to perform calculation build prediction models etc

Image description

What is ETL ?

Extract Transform Load is the technique that a Data Engineer use to get the data from different resources and different format like data can be in the form of csv, html, excel, image, video, txt etc
The data engineer extract this data from different resources than transform it according to his use-case. Then load it to the machine learning model or send it to the data scientist.

This concept is the back bone of data engineering.

So after the transformation of data we use some tools to show the data.

Like for dashboard, we use BI tools to show the data into the live Dashboard. Such as Power BI etc

Image description

For the future specification and sorting of data, we use Data mart.

What is Data Mart ?

Data mart is a chunk of data warehouse where all the categorized data of specific need is stored, resulting quick response and fast analysis of data. In other words, we can say that, it is a subset of a data warehouse that is focused on a single subject. No doubt, the size of a Data mart is relatively small in 1Gbs but data warehouse Data capacity is more likely in 100Gbs. They are often used to provide users in an organization with easy access to particular data.

Image description

Let's Summerize with the help of perfect example :

Consider, we want to perform a sentimental analysis of Twitter. So, the data is in unstructured format. We extract data from the different Twitter resources and load it to the Data lake After that, the data is clearify and send it to the Data warehouse then send it to the Data Mart where we get very very specific category data which we than load into the Analysis Software or BI Tools. This model contains two part as:
Online transaction processing (OLTP) and other be Online Transaction Analysis processing (OLAP).

Have you noticed that there is a lots of process pipeline passing from one phase to another phase. So, don't worry let me introduce you with Lakehouse

Image description

What is Lakehouse ?

Lakehouse is a combination of Data warehouse and Data Lake. It contains every type of data including unstructured format and can have all the features of data warehouse to perform analysis itself. Data Lakehouse offers the benefits of both a data warehouse and a data lake, including data agility, scalability, cost efficiency and performance. Data Lakehouse also provides the ability to process and analyze data in real-time and at scale, enabling organizations to make better, faster decisions.

It's easy to manage, less administration, less redundancy and smooth process.

OLTP vs OLAP :

OLTP (Online Transaction Processing) is a type of system used to manage and process data from transactions. It is typically used to record data from transactions such as sales, purchases, and payments. OLTP systems are optimized for fast, real-time data access and updates. (Calculation Part)

OLAP (Online Analytical Processing) is a type of system used to analyze data from multiple perspectives. It is typically used to analyze data from multiple sources and generate reports to help make informed decisions. OLAP systems are optimized for complex queries and analysis. (Analysis Part)

As, their is alway a tread of data security and cost effectiveness so this issue is resolved with the help of Data Mesh a decentralization system to store data.

Data Mesh :

It is an architecture that focuses on the decentralization of data and the empowerment of teams to make data-driven decisions. Data meshes are designed to make data accessible to multiple teams, departments, and applications, with the goal of providing a single source of truth. Decentralization means that data is not stored on a single place which cause data safety, security and scalability, it can also reduce the risk of data loss. The data mesh architecture allows for data to be stored, managed, and processed in distributed systems.

These are the basics of data engineering concepts that you must know as a beginner.

Moving towards part 02 (Intermediate Concepts)

Top comments (0)