Extract, Transform, Load (ETL) is a fundamental process in data engineering used to collect data from various sources, process it, and store it in a structured format for analysis. In this tutorial, we will build a simple ETL pipeline using Python and the CoinGecko API to extract cryptocurrency market data, transform it into a structured format, and load it into a SQLite file for further use.
Prerequisites
To follow along, you need to have:
- Python installed (>=3.7)
- requests and pandas libraries installed
- A CoinGecko API key (optional but recommended for higher request limits)
You can install the required libraries using:
pip install requests
Step 1: Extract Data from CoinGecko API
The extraction phase involves fetching cryptocurrency market data from the CoinGecko API.
import requests
import pandas as pd
def extract_data_from_api():
url = "https://api.coingecko.com/api/v3/coins/markets?vs_currency=usd"
headers = {
"accept": "application/json",
"x-cg-demo-api-key": "YOUR_API_KEY_HERE", # Replace with your API key
}
response = requests.get(url, headers=headers)
if response.status_code == 200:
data = response.json()
return pd.json_normalize(data) # Convert JSON response to DataFrame
else:
raise Exception("Error fetching data from API")
Step 2: Transform the Data
Transformation is necessary to clean and structure the data before loading it. We'll select relevant columns and rename them for clarity.
def transform_data(df):
df_transformed = df[["id", "symbol", "name", "current_price", "market_cap", "total_volume", "price_change_percentage_24h", "last_updated"]].copy()
df_transformed.columns = ["id", "symbol", "name", "price_usd", "market_cap_usd", "volume_24h_usd", "price_change_24h_percent", "date"]
df_transformed["date"] = pd.to_datetime(df_transformed["date"]).dt.date
df_transformed = df_transformed.fillna(0)
return df_transformed
Step 3: Load Data into SQLite Database
The final step is to store the processed data into an SQLite database for further analysis.
import sqlite3
def load_data_to_sqlite(df, db_file, table_name):
# Connect to the database
conn = sqlite3.connect(db_file)
cursor = conn.cursor()
# Create the table if it does not exist
cursor.execute(f"""
CREATE TABLE IF NOT EXISTS {table_name} (
id TEXT PRIMARY KEY,
symbol TEXT,
name TEXT,
price_usd REAL,
market_cap_usd REAL,
volume_24h_usd REAL,
price_change_24h_percent REAL
)
""")
# Load the data into the table
df.to_sql(table_name, conn, if_exists='replace', index=False)
# Commit and close the connection
conn.commit()
conn.close()
print(f"Data successfully loaded into table '{table_name}' in database {db_file}")
Step 4: Putting It All Together
Now, we can orchestrate the entire ETL process using a main function.
def etl_pipeline():
# Extraction
df_crypto = extract_data_from_api()
print("Extraction successful")
# Transformation
df_transformed = transform_data(df_crypto)
print("Transformation successful")
# Loading
db_file = "database.db"
table_name = "crypto_data"
load_data_to_sqlite(df_transformed, db_file, table_name)
print("Loading successful")
if __name__ == "__main__":
etl_pipeline()
This tutorial demonstrated how to build a simple ETL pipeline in Python using the CoinGecko API. We covered extracting data from the API, transforming it into a structured format, and loading it into an SQLite database. This pipeline can be extended to store data in a cloud database, automate execution using cron jobs, or integrate with data visualization tools.
Happy coding!
Top comments (0)