DEV Community

Abhinav Singh
Abhinav Singh

Posted on

Data Synchronization from Google BigQuery to ClickHouse in an AWS Air-Gapped Environment

Understanding the Key Components

Airgap Environment
An airgapped environment enforces strict outbound policies, preventing external network communication. This setup enhances security but presents challenges for cross-cloud data synchronization.

Proxy Server
A proxy server is a lightweight, high-performance intermediary facilitating outbound requests from workloads in restricted environments. It acts as a bridge, enabling controlled external communication.

ClickHouse
ClickHouse is an open-source, column-oriented OLAP (Online Analytical Processing) database known for its high-performance analytics capabilities.

This article explores how to seamlessly sync data from BigQuery, Google Cloud’s managed analytics database, to ClickHouse running in an AWS-hosted airgapped Kubernetes cluster using proxy-based networking.

Use Case
Deploying ClickHouse in airgapped environments presents challenges in syncing data across isolated cloud infrastructures such as GCP, Azure, or AWS.

In our setup, ClickHouse is deployed via Helm charts in an AWS Kubernetes cluster, with strict outbound restrictions. The goal is to sync data from a BigQuery table (GCP) to ClickHouse (AWS K8S), adhering to airgap constraints.

Challenges

  • Restricted Outbound Network: The ClickHouse cluster cannot directly access Google Cloud services due to airgap policies.

Data Transfer Between Isolated Clouds: There is no straightforward mechanism for syncing data from GCP to ClickHouse in AWS without external connectivity.

Solution
The solution leverages a corporate proxy server to facilitate communication. By injecting a custom proxy configuration into ClickHouse, we enable HTTP/HTTPS traffic routing through the proxy, allowing controlled outbound access.

Image description

Architecture Overview
BigQuery to GCS Export: Data is first exported from BigQuery to a GCS bucket.
ClickHouse GCS Integration: ClickHouse fetches data from GCS using ClickHouse’s GCS function.
Proxy Routing: ClickHouse’s outbound requests are routed through a corporate proxy server.
Data Ingestion in ClickHouse: The retrieved data is processed and stored within ClickHouse for analytics.

Implementation Steps

1. Proxy Configuration

  • Created a proxy.xml file defining proxy details for outbound HTTP/HTTPS requests.

  • Used a Kubernetes ConfigMap (clickhouse-proxy-config)* to store this configuration.

  • Mounted the ConfigMap dynamically into the ClickHouse pod.

2. Kubernetes Deployment

  • Mounted proxy.xml in the ClickHouse pod at /etc/clickhouse-server/config.d/proxy.xml.

  • Adjusted security contexts, allowing privilege escalation (for testing) and running the pod as root to simplify permissions.

Image description

3. Testing and Validation

  • Deployed a non-stateful ClickHouse instance to iterate quickly.

  • Verified that ClickHouse requests were routed through the proxy.

Observed proxy logs confirming outbound requests were successfully relayed to GCP.

Image description

Left window shows query to BigQuery and right window shows proxy logs — the request forwarding through proxy server

Outcome

This approach successfully enabled secure communication between ClickHouse (AWS) and BigQuery (GCP) in an airgapped environment. The use of a ConfigMap-based proxy configuration made the setup:

  • Scalable: Easily adaptable to different cloud vendors (GCP, Azure, AWS).
  • Flexible: Decouples networking configurations from application logic. Secure: Ensures outbound traffic is strictly controlled via the proxy.

By leveraging ClickHouse’s extensible configuration system and Kubernetes, we overcame strict network isolation to enable cross-cloud data workflows in constrained environments. This architecture can be extended to other cloud-native workloads requiring external data synchronization in airgapped environments.

Top comments (0)