Lesson 3 Pagination & Authentication & dlt Configuration
Introduction to Pagination
- Pagination is a technique used to retrieve data in pages, especially when an endpoint limits the amount of data that can be fetched at once.
- The GitHub API returns data in pages, and pagination allows us to retrieve all the data.
GitHub API Pagination
- The GitHub API provides the
per_page
andpage
query parameters to control pagination. - The
Link
header in the response contains URLs for fetching additional pages of data.
Implementing Pagination with dlt's RESTClient
- dlt's RESTClient can handle pagination seamlessly when working with REST APIs like GitHub.
- The
RESTClient
is part of dlt's helpers, which makes it easier to interact with REST APIs by managing repetitive tasks.
Authentication with GitHub API
- Authentication is required to avoid rate limit errors when fetching data from the GitHub API.
- To authenticate, create an environment variable for your access token or use dlt's secrets configuration.
dlt Configuration and Secrets
- Configurations are non-sensitive settings that define the behavior of a data pipeline.
- Secrets are sensitive data like passwords, API keys, and private keys, which should be kept secure.
- dlt automatically extracts configuration settings and secrets based on flexible naming conventions.
Exercise 1: Pagination with RESTClient
- Use dlt's RESTClient to fetch paginated data from the GitHub API.
- The full list of available paginators can be found in the official dlt documentation.
Exercise 2: Run pipeline with dlt.secrets.value
- Use the
sql_client
to query thestargazers
table and find the user with id17202864
. - Use environment variables to set the
ACCESS_TOKEN
variable.
Key Takeaways
- Pagination is essential when working with APIs that return data in pages.
- dlt's RESTClient can handle pagination seamlessly and manage repetitive tasks.
- Authentication is required to avoid rate limit errors when fetching data from the GitHub API.
- dlt configuration and secrets are essential for setting up data pipelines securely.
Further Reading
- GitHub API documentation: Pagination
- dlt documentation: RESTClient, Configuration and Secrets
Lesson 4 Using Pre-built Sources and Destinations
Pre-built Sources
Overview
Pre-built sources are the simplest way to get started with building your stack. They are fully customizable and come with a set of pre-defined configurations.
Types of Pre-built Sources
-
Existing Verified Sources: Use an existing verified source by running the
dlt init
command. - SQL Databases: Load data from SQL databases (PostgreSQL, MySQL, SQLight, Oracle, IBM DB2, etc.) into a destination.
- Filesystem: Load data from the filesystem, including CSV, Parquet, and JSONL files.
- REST API: Load data from a REST API using a declarative configuration.
Steps to Use Pre-built Sources
-
Install dlt: Install dlt using the
dlt init
command. -
List all verified sources: Use the
dlt init
command to list all available verified sources and their short descriptions. -
Initialize the source: Initialize the source using the
dlt init
command. - Add credentials: Add credentials using environment variables or other methods.
- Run the pipeline: Run the pipeline to load data into the destination.
Pre-built Destinations
Overview
Pre-built destinations are used to load data into a specific location. They are customizable and come with a set of pre-defined configurations.
Types of Pre-built Destinations
- Filesystem destination: Load data into files stored locally or in cloud storage solutions.
-
Delta tables: Write Delta tables using the
deltalake
library. -
Iceberg tables: Write Iceberg tables using the
pyiceberg
library.
Steps to Use Pre-built Destinations
- Choose a destination: Choose a destination based on your needs.
-
Modify the destination parameter: Modify the
destination
parameter in your pipeline configuration. - Run the pipeline: Run the pipeline to load data into the destination.
Example Use Cases
-
Loading data from a SQL database: Use the
sql_database
source to load data from a SQL database into a destination. -
Loading data from a REST API: Use the
rest_api
source to load data from a REST API into a destination. -
Loading data from the filesystem: Use the
filesystem
source to load data from the filesystem into a destination.
Exercise
-
Run the rest_api source: Run the
rest_api
source to load data from a REST API into a destination. -
Run the sql_database source: Run the
sql_database
source to load data from a SQL database into a destination. -
Run the filesystem source: Run the
filesystem
source to load data from the filesystem into a destination.
Next Steps
- Proceed to the next lesson: Proceed to the next lesson to learn more about custom sources and destinations.
- Explore the dlt documentation: Explore the dlt documentation to learn more about pre-built sources and destinations.
Top comments (0)