Link to original article written by Mage DevRel, Cole Freeman: https://www.mage.ai/blog/streamline-data-transfer-the-ultimate-data-integration-guide-sftp-to-bigquery
TLDR
Securely move data from SFTP to Google BigQuery using Mage Pro for automated analytics. This guide walks you through setting up a pipeline, connecting to SFTP, choosing data streams, configuring replication, transforming data, connecting to BigQuery, triggering syncs, and verifying data. Automate your data flow, eliminate manual errors, and get valuable business insights!
Table of Contents
- Introduction
- Business use case
- Step by step implementation guide
- Step 1: Creating a new pipeline
- Step 2: Configuring the SFTP connection
- Step 3: Selecting data streams
- Step 4: Configuring data replication methods
- Step 5: Transforming data
- Step 6: Setting up Google BigQuery as destination
- Step 7: Triggering the data sync
- Step 8: Verifying data synced to Google BigQuery
- Conclusion
Are you struggling with secure data transfers from SFTP to your data warehouse for analytics? This guide demonstrates how to build automated, scalable data integration pipelines using Mage to transform complex data workflows into straightforward processes that power your analytics infrastructure. In the following sections, we'll walk through setting up Mage Pro, creating your first pipeline, and establishing reliable data synchronization to Google BigQuery.
Business use case
Many companies face a common challenge when managing transaction data from multiple systems, even multiple vendors. These organizations regularly receive sensitive financial information via SFTP as CSV files while their analytics teams require this data in a centralized warehouse like Google BigQuery for analyzing important business operations. Without automation, the process of transferring data from SFTP servers to analytics platforms creates multiple pain points. Manual transfers:
- Consume valuable time;
- Introduce security vulnerabilities when handling sensitive information;
- Llead to data inconsistencies that impact analysis quality.
Implementing an automated SFTP to BigQuery pipeline addresses these challenges by streamlining the entire data flow. The business benefits are substantial, they:
- Reduce processing time;
- Eliminate of human error in transfers;
- Produce consistent application of data transformations;
- Improve monitoring capabilities;
- Better data availability for analytics teams.
This integration is particularly valuable for any organization that regularly receives data via SFTP but needs that information in a modern data warehouse to power analytics, reporting, and business intelligence. By creating a reliable pipeline between these systems, companies can focus less on data logistics and more on extracting valuable insights that drive business decisions. Let’s get started building an integration pipeline in Mage Pro.
Step by step implementation guide
To begin, you'll want to log into your Mage Pro account, and once logged in, you'll land on the homepage. On the homepage, look for the navigation menu. Hover over it, and you’ll see several options. Your next step is to find the button labeled "Pipelines." Click on it to be directed to the pipelines page where all your data integration workflows are managed.
Step 1: Creating a new pipeline
Once you are on the pipelines page, it’s time to create a new pipeline. Click on the green "New pipeline" button. You will be prompted to choose how you want to start your pipeline configuration. For this tutorial, select the option to "Start from Scratch."
Next, choose "Data Integration" as the type of pipeline you want to create. Name your pipeline something descriptive to easily identify it later. After naming it, hit the "Create new pipeline" button to proceed.
Step 2: Configuring the SFTP connection
After creating your pipeline, you will enter the pipeline editor. Here, you’ll need to set up the SFTP connection. Click into the Select source dropdown and then scroll down the list until you see SFTP and click on it
This action will populate a YAML configuration file in the editor. You’ll need to fill in the necessary details for the SFTP connection, including the host, port, username, and password. Additionally, specify the table name to pull data, and set the folder prefix to "export" including any subfolder specifications if necessary.
Once you've entered the connection details, it's crucial to test the SFTP connection to ensure everything is set up correctly. Look for the "Test Connection" button and click it. If everything is configured properly, you should receive a positive confirmation of the connection.
Step 3: Selecting data streams
With a successful connection established, it's time to select the data streams you want to synchronize. Mage will display available streams from your SFTP source. Look for the stream corresponding to "fetch golf player data" and check the box next to it to confirm your selection.
After confirming the stream, allow a moment for Mage to load all column data associated with the selected stream. This data will include metadata and actual CSV content, which is essential for setting up your data integration.
Step 4: Configuring data replication methods
Now that your data stream is selected, you can configure how you want to replicate this data into Google BigQuery. You have two primary options: loading a full table or incrementally loading new data. For this example, we will choose the "Full Table" option, as we are working with a single file.
Additionally, you can set unique conflict methods that dictate how Mage should handle records with duplicate values. You may choose to update existing records or ignore them. It’s essential to determine this based on your data requirements.
Moreover, you can provide bookmarks to track sync processing, which is particularly useful for incremental loading. Finally, you may define a key property to create a primary key for your destination table if required.
Step 5: Transforming data
Transforming data may be necessary to meet specific analytics requirements. In Mage, you can access the transformation settings through the "Transformer" button. This feature allows you to apply various transformations to your data before it’s sent to Google BigQuery. Define any transformations needed, such as data calculations, aggregations, or filtering.
Step 6: Setting up Google BigQuery as destination
Next, we need to configure Google BigQuery as the destination for your data. In the pipeline editor, select BigQuery from the destination options. This action will generate another YAML configuration file.
Within this configuration, you will need to provide the path to your credentials file. This file is crucial as it authenticates your access to Google BigQuery. You can download the credentials in JSON format from your Google Cloud Platform account.
Additionally, specify your project ID and dataset name. For example, if your dataset is called "golf rankings," enter that in the configuration. Ensure that your location settings are correct, typically set to "US" for the United States.
With your destination configured, it’s time to test the data sync connection. Click the "Test Connection" button. A successful test indicates that your pipeline can communicate with Google BigQuery effectively.
Step 7: Triggering the data sync
Once your data integration pipeline is fully configured and tested, it’s time to trigger the data sync. Navigate to the "Triggers" section in the left-hand menu. Here, you can create a new trigger or run an existing one.
For this demonstration, simply select the "Run One" trigger. This will initiate the data transfer process from your SFTP server to Google BigQuery. During this process, you can monitor the run's progress, and Mage provides detailed logging to help diagnose any potential issues.
Step 8: Verifying data synced to Google BigQuery
After the sync process completes, it’s important to verify that the data has been correctly loaded into Google BigQuery. Refresh your BigQuery console and navigate to the designated dataset, such as "golf rankings." You should see a new table, in this dataset where you can run a new query. Simply run a Count (*) query on the new table to ensure the expected data synced to BigQuery. This query will return the number of records imported, confirming that your data transfer was successful.
Conclusion
In this guide, we have covered the entire process of automating data integration from an SFTP server to Google BigQuery using Mage. From setting up your SFTP connection to verifying the data in BigQuery, each step is crucial for creating a reliable data pipeline.
By following these steps, you've created a scalable and efficient workflow that ensures only new or updated records are processed, saving you both time and resources. With your data now integrated, you can focus on analyzing it to gain insights that drive your business forward.
Looking to try out Mage Pro? Book your demo today!
Top comments (0)