Bob Otieno Okech

Posted on Jan 27

A Comprehensive Guide to Setting Up a Data Engineering Project Environment.

In today’s fast-paced business world, data is generated every second. This data holds the potential to provide valuable insights and drive decisions, yet much of it goes unused. Managing data from multiple sources manually can be time-consuming and error-prone. That’s why organizations need an efficient system to gather data from various sources, transform it to align with business needs, and store it in a central location for analysis. This is the role of a data platform.

A data platform serves as a unified hub for collecting, processing, and storing an organization’s data. It ensures that business users can access accurate, consistent, and actionable data to inform their strategies and decisions.

In this guide, we’ll walk you through building your first data engineering environment using a selection of powerful tools and services.

In this section, we will go step by step in creating a data engineering environment using various tools and services.

Tools and services needed

To get started, you’ll need the following:

AWS S3 Bucket – For storing raw and processed data.
PostgreSQL Database – For structured data storage.
DBeaver – A database management tool for querying and managing databases.
Python 3 – For automation, data transformation, and integration tasks.

Setting Up Your First S3 Bucket on AWS

To create an S3 bucket, you’ll need an authenticated AWS account. If you don’t have one, start by signing up here

Step 1: Access the S3 Service

Once logged in, use the AWS search bar to search for S3. Select the service from the search results.

Step 2: Create a Bucket

You’ll be taken to the S3 dashboard. Click the Create Bucket button.

Step 3: Name Your Bucket

Provide a unique name for your bucket, keeping AWS naming conventions in mind. Once you've entered the required details, click Create Bucket.

That’s it! Your S3 bucket is now ready to store files.

Step 4: Upload any file

Navigate to the upload section and drag and drop a CSV file. Once the data has been uploaded the upload status will read succeeded as below.

Connect Python with S3

To connect Python to your S3 bucket, you'll need the Boto3 library.

Step 1: Install boto3 in your pc

You can run this code in your terminal to install boto3 if not installed

pip install boto3

You'll then go to IAM in your AWS console and create a new user to access the bucket. Search for IAM in the search bar navigate to the IAM management console and click users in the navigation bar.

Step 2: Create a IAM to access the s3 bucket

Ensure when setting up the user permission, assign the amazons3full access policy as shown below

Click Create New User, fill in the required information, and generate access keys and secret keys.

Within your Python environment run the following code

Step 3: Connect to s3 using python

# import boto3 library
import boto3

# Create an S3 resource using the connection parameters
s3 = boto3.resource(
    service_name="s3",
    region_name=connection_params["region_name"],
    aws_access_key_id=connection_params["aws_access_key_id"],
    aws_secret_access_key=connection_params["aws_secret_access_key"],
)

# Generate a list of all buckets available in our s3_resources bucket
for bucket in s3_resources.buckets.all():
    print(bucket.name)

Since the service we are accessing is s3, the service_name parameter should be s3, fill in the console region_name, and use the access keys and secret keys to access the storage.

After executing the code above the list of all buckets we have in s3 as shown below
List of buckets using AWS UI

List of buckets using python3

Getting Postgres database engine

You can set up your Postgres instance locally but we'll choose to use a cloud provider like Aiven.

Step 1: Create an account on Aiven and the database

Log in to aiven.com and create an account.
Navigate to create a new project and enter the details.
Click on the new project and create a service. A list of services will be shown as below. Select Postgres and ensure you have selected the free plan
then create a service

Once the service is built, you'll be able to view it in the project file as shown below mine is mypostgres-001 and the status is running

Setting up Dbeaver in your PC

Dbeaver is a database management tool for querying and managing databases.
We'll Dbeaver to manage our newly created Postgres database.

Step 1: Download and Install Dbeaver

Navigate to Dbeaver to download and install Dbeaver.

Once you have Dbeaver installed,

Step 2: Connect Dbeaver with your Postgres database

Navigate to the connect icon and add a connection the our Postgres database.

Under the Connection setting, ensure the details match what Aiven has provided as database credentials

Test your connection and ensure it's connected.

From this stage, you can fully create databases and tables to store your data in the new database.

Step 2: Run the code to ensure you can manipulate objects in the database.

-- Create a database called luxdev_test
CREATE DATABASE luxdev_test;

-- Connect to the newly created database
\c luxdev_test;

-- Create a schema
CREATE SCHEMA raw;

-- Create a table to store the data
CREATE TABLE raw.students (
    id SERIAL PRIMARY KEY,
    name VARCHAR(25),
    position INT
);

-- Insert a few records into the table
INSERT INTO raw.students (
    name, position
) 
VALUES
    ('Peter', 12),
    ('Mercy', 11),
    ('Bob', 13);

See the snapshot of output to get when executing the code in the snapshot, showing the data has successfully been added to the student's table

Wrapping it Up

Congratulations! You’ve successfully walked through the foundational steps of setting up a data engineering project environment. By combining the power of AWS S3 for data storage, Python for automation, PostgreSQL for structured data management, and DBeaver for database management, you’ve built a scalable foundation for handling and analyzing data efficiently.

DEV Community