DEV Community

James
James

Posted on

Debugging Python Data Pipelines

Introduction:

In this article, we'll explore the process of debugging a Python data pipeline that fetches and stars GitHub repositories related to data engineering. Our pipeline will utilize the GitHub API to fetch repository information, process the data, and star the repositories.

Step 1: Setting Up Logging and Debugging Messages

To begin, let's set up the logging module in Python to get valuable insights into our data pipeline's execution. We'll create a data_pipeline.py file and include the necessary imports and basic configuration for logging.

# data_pipeline.py
import logging

# Configure logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

# Your GitHub API credentials 
GITHUB_API_TOKEN = 'YOUR_GITHUB_API_TOKEN'

Enter fullscreen mode Exit fullscreen mode

Step 2: Fetching Data from GitHub API

Next, we'll implement the function to fetch GitHub repositories related to data engineering. We'll use the popular requests library to make API calls.

import requests

def fetch_data_from_github():
    url = 'https://api.github.com/search/repositories'
    params = {'q': 'dataengineering', 'sort': 'stars', 'order': 'desc'}

    try:
        response = requests.get(url, params=params, headers={'Authorization': f'token {GITHUB_API_TOKEN}'})
        response.raise_for_status()
        data = response.json()
        return data['items']
    except requests.exceptions.RequestException as e:
        logger.error(f"Failed to fetch data from GitHub API: {e}")
        return []
Enter fullscreen mode Exit fullscreen mode

Step 3: Unit Testing for GitHub API

To ensure the GitHub API function behaves correctly, let's write some unit tests using pytest.

# test_data_pipeline.py
import data_pipeline

def test_fetch_data_from_github():
    # Mock the API response for testing
    data_pipeline.GITHUB_API_TOKEN = 'TEST_TOKEN'
    data_pipeline.requests.get = lambda *args, **kwargs: MockApiResponse()

    repositories = data_pipeline.fetch_data_from_github()
    assert len(repositories) == 2

class MockApiResponse:
    def __init__(self):
        self.status_code = 200

    def json(self):
        return {
            'items': [
                {'name': 'around-dataengineering', 'html_url': 'https://github.com/around-dataengineering'},
                {'name': 'dataengineering', 'html_url': 'https://github.com/dataengineering'}
            ]
        }
Enter fullscreen mode Exit fullscreen mode

Step 4: Star the GitHub Repositories

Now, we'll implement the function to star the fetched GitHub repositories. We'll use the pygithub library, which simplifies working with the GitHub API.

from github import Github

def star_repositories(repositories):
    try:
        github_client = Github(GITHUB_API_TOKEN)
        user = github_client.get_user()

        for repo in repositories:
            repo_obj = github_client.get_repo(repo['name'])
            user.add_to_starred(repo_obj)
            logger.info(f"Starred repository: {repo['name']}")
    except Exception as e:
        logger.error(f"Failed to star repositories: {e}")
Enter fullscreen mode Exit fullscreen mode

Step 5: Debugging with Interactive Debugger (pdb)

Now that we have our main functions implemented, let's use the interactive debugger pdb to trace and inspect the pipeline's execution. We'll add a breakpoint in the star_repositories function and run the pipeline.


import pdb

def star_repositories(repositories):
    try:
        github_client = Github(GITHUB_API_TOKEN)
        user = github_client.get_user()

        for repo in repositories:
            repo_obj = github_client.get_repo(repo['name'])
            pdb.set_trace()  # Set a breakpoint here
            user.add_to_starred(repo_obj)
            logger.info(f"Starred repository: {repo['name']}")
    except Exception as e:
        logger.error(f"Failed to star repositories: {e}")
Enter fullscreen mode Exit fullscreen mode

Step 6: Running the Pipeline and Debugging

Finally, let's run the pipeline and debug it using the pdb interactive debugger. We'll execute the fetch_data_from_github and star_repositories functions in sequence.

if __name__ == '__main__':
    repositories = fetch_data_from_github()
    star_repositories(repositories)
Enter fullscreen mode Exit fullscreen mode

When the pdb breakpoint is hit, you can inspect variable values, step through the code, and identify any issues. Use commands like next (n), step (s), and continue (c) to navigate through the code.

Conclusion:
Debugging Python data pipelines is essential to ensure their reliability and efficiency. By implementing logging, unit testing, interactive debugging, and using relevant libraries and tools, you can identify and resolve issues effectively.

Top comments (0)