Astrodevil

Posted on Jan 25

Snowflake vs. Databricks: Building Your Data Stack with Kestra🚀🌩️🧑‍💻

#cloud #opensource #devops #programming

When it comes to selecting a cloud data platform, two big names often comes up, Snowflake and Databricks. Both platform has its strengths, offer high scalability, advanced analytics and a robust set of tools but they’re built for specific needs. Let's dive into what makes each unique and how you might use them in your data stack.

In this article, we will compare Databricks and Snowflake and demonstrate how Kestra’s orchestration capabilities can help in the management of both platforms. Find out how to achieve seamless data workflows that boost productivity and performance for modern data engineering teams.

Snowflake: More Than Just Another Data Warehouse

Snowflake isn't your typical data warehouse. It's built for the cloud from the ground up, which gives it some great capabilities.

A key feature of Snowflake is its ability to handle compute and storage separately. This architectural choice has significant implications for data processing. It allows to scale up compute resources for large queries without incurring extra storage costs, then scaling back down when the task is complete. It's like having a flexible, pay-as-you-go supercomputer for your data.

Snowflake's architecture is built on a unique three-layer design that separates storage, compute and cloud services. The storage layer houses all data, which is automatically compressed and organized for quick retrieval. The compute layer consists of "virtual warehouses" - essentially clusters of compute resources that process queries independently of storage. This separation allows for incredible flexibility, as you can scale up compute power for demanding queries without affecting your storage costs. The cloud services layer ties everything together, handling tasks from query optimization to security management, ensuring smooth operation across the platform.

One particularly useful feature of Snowflake is "Time Travel". Snowflake's 'Time Travel' feature enables access to historical data up to 90 days old. It's a powerful tool for data governance and recovery, allowing restoration of accidentally dropped tables or reversal of faulty updates without traditional backups. For auditing, it provides point-in-time analysis, crucial for compliance and investigations. Snowflake implements this by maintaining a hidden history table for each table, efficiently tracking all changes.

Databricks: Where Data Science Meets Big Data

Databricks, on the other hand, comes from a different set of approach. It grew out of the Apache Spark project and it shows in its focus on big data processing and machine learning.

Databricks' core innovation is the 'Lakehouse' architecture, built on Delta Lake. This open-source storage layer brings ACID transactions to object storage, addressing the reliability issues of traditional data lakes. Delta Lake achieves this through a transaction log that tracks all changes, enabling features like time travel and rollbacks. This approach combines the flexibility of data lakes with the reliability and performance of data warehouses, allowing organizations to store and process both structured and unstructured data in a single system

One interesting feature of Databricks is how it brings together different data roles. Data engineers, analysts and data scientists can all work in the same platform, using tools they're comfortable with, whether that's SQL, Python, R or Scala.

Choosing Between Snowflake and Databricks

The choice between Snowflake and Databricks often depends on the primary use case:

Snowflake and Databricks perform differently across various workloads. Snowflake handles complex SQL queries on structured data well, especially in concurrent query scenarios. This makes it strong for BI and analytics tasks. Databricks, built on Spark, processes large-scale datasets effectively, particularly for data science and machine learning jobs. It performs better with unstructured or semi-structured data processing.

Both platforms work with many data formats, but have different strengths. Snowflake natively handles structured and semi-structured data (JSON, Avro, Parquet) through its VARIANT data type. Databricks, using its Lakehouse architecture, manages both structured and unstructured data, including images and text files. For connecting with other tools, Snowflake links well with BI tools and data pipelines, while Databricks integrates deeply with big data and ML ecosystems.

Data governance and lineage tracking differ between the platforms. Snowflake focuses on access control and data sharing, with features for data lineage and impact analysis. Databricks uses its Unity Catalog for access control and detailed audit logs. Both platforms manage metadata, but Databricks ties this more closely to its ML workflows.

For machine learning, Databricks has an advantage with its integrated MLflow for experiment tracking and model management. It supports distributed training by default and simplifies model deployment through MLflow's model serving. Snowflake, newer to ML, is improving its support through Snowpark and integrations with Python libraries. However, it currently needs more setup for distributed training and model deployment than Databricks.

That said, these aren't hard and fast rules. You can find plenty of organizations use both, Snowflake for their data warehouse and BI needs and Databricks for data science and ML workflows.

Orchestrating Snowflake and Databricks Workflows with Kestra

Modern data ecosystems often involve multiple specialized tools for various processing needs. This approach requires effective workflow coordination, which is where Kestra comes in. Kestra helps data engineering teams create dynamic, efficient workflows across different platforms.

Kestra is an open-source orchestration platform that makes building and managing complex data pipelines easier. It supports many plugins for different platforms and tools. The Kestra Snowflake plugin works as both a direct target and a data repository in complex workflows, handling downstream application needs. The Kestra Databricks plugin adds capabilities for data processing and transformation

Using the Kestra Snowflake Plugin:

Kestra offers a JDBC plugin for Snowflake, which enables users to leverage Snowflake's data warehouse effectively. This integration allows various tasks such as querying Snowflake tables, simplifying data movement and facilitating smooth orchestration of Git workflow for dbt projects targeting Snowflake. You can even build dynamic data workflows with Kestra Snowflake plugin.

Here's a code snippet to execute a query to fetch table data into Kestra's internal storage that's commonly used:

id: select
type: io.kestra.plugin.jdbc.snowflake.Query
url: jdbc:snowflake://<account_identifier>.snowflakecomputing.com
username: snowflake
password: snowflake_passwd
sql: select * from source
fetch: true

Simplifying Data Movement

Downloading from Snowflake: Transfer data to Kestra with features like data compression and role-based access control.

id: "download"
type: "io.kestra.plugin.jdbc.snowflake.Download"
stageName: MYSTAGE
fileName: prefix/destFile.csv

Uploading to Snowflake: Simplify uploads with options for data compression and transformation.

id: "upload"
type: "io.kestra.plugin.jdbc.snowflake.Upload"
stageName: MYSTAGE
prefix: testUploadStream
fileName: destFile.csv

Implementing Git Workflow for dbt with Snowflake

Kestra also supports Git workflows for dbt projects targeting Snowflake, automating transformations and helping in collaborative project management.

Sample dbt workflow task:

id: dbt_snowflake
namespace: company.team

tasks:
  - id: git
    type: io.kestra.plugin.core.flow.WorkingDirectory
    tasks:
      - id: clone_repository
        type: io.kestra.plugin.git.Clone
        url: <https://github.com/kestra-io/dbt-example>
        branch: main

      - id: dbt
        type: io.kestra.plugin.dbt.cli.DbtCLI
        docker:
          image: ghcr.io/kestra-io/dbt-snowflake:latest
        profiles: |
          my_dbt_project:
            outputs:
              dev:
                type: snowflake
                account: "{{ secret('SNOWFLAKE_ACCOUNT') }}"
                user: "{{ secret('SNOWFLAKE_USER') }}"
                password: "{{ secret('SNOWFLAKE_PASSWORD') }}"
                role: "{{ secret('SNOWFLAKE_ROLE') }}"
                database: "{{ secret('SNOWFLAKE_DATABASE') }}"
                warehouse: COMPUTE_WH
                schema: public
                threads: 4
                query_tag: dbt
        commands:
          - dbt deps
          - dbt build

Kestra's Snowflake plugin simplifies data warehousing with YAML, making it accessible even to non-developers. It supports both ETL and ELT pipelines, which allows you to transform or directly load raw data from various sources. Kestra efficiently manages both workflows, ensuring data is always ready for analysis and insights.

Using the Kestra Databricks Plugin:

Kestra also have a JDBC plugin for Databricks, which enables users to leverage strengths of both platforms to offer comprehensive solutions for data orchestration and analytics. This integration allows various applications like dynamic compute cluster management, efficient data movement and streamlined data queries.

Dynamic Compute Cluster Management By using Kestra’s CreateCluster and DeleteCluster tasks, you can dynamically adjust computational resources based on workflow needs. These tasks let users specify cluster parameters like name, node type, worker count and Spark version. They also provide detailed control over the Databricks compute cluster lifecycle.

id: databricks-create-cluster
namespace: company.team

tasks:
  - id: create-cluster
    type: io.kestra.plugin.databricks.cluster.CreateCluster
    authentication:
      token: <your-token>
      host: <your-host>
    clusterName: kestra-demo
    nodeTypeId: Standard_DS3_v2
    numWorkers: 1
    sparkVersion: 13.0.x-scala2.12


  - id: delete-cluster
    type: io.kestra.plugin.databricks.cluster.DeleteCluster
    authentication:
      token: <your-token>
      host: <your-host>
    clusterId: <cluster-id>

Efficient Data Movement

Once a cluster is running, files can be uploaded to DBFS using Kestra's Upload task. Here's an example that downloads a CSV from a URL to Kestra's internal storage and then uploads it to DBFS:

id: databricks-upload
namespace: company.team
tasks:
  - id: http_download
    type: io.kestra.plugin.core.http.Download
    uri: "<https://huggingface.co/datasets/kestra/datasets/raw/main/csv/orders.csv>"
  - id: uploadFile
    type: io.kestra.plugin.databricks.dbfs.Upload
    authentication:
      token: <your-token>
      host: <your-host>
    from: "{{ outputs.http_download.uri }}"
    to: /Shared/kestra/data/orders.csv

Streamlined Data Queries

SQL queries on Databricks are executed using Kestra's Query task. This task requires the httpPath, found on the JDBC/ODBC tab of the compute cluster page. Here's how to create a table and load a CSV file from DBFS:

id: databricks-load-table
namespace: company.team
tasks:
  - id: load-table-from-dbfs
    type: io.kestra.plugin.databricks.sql.Query
    accessToken: <your-token>
    host: <your-host>
    httpPath: <your-httpPath>
    sql: "CREATE TABLE orders USING CSV LOCATION '/Shared/kestra/data/orders.csv' OPTIONS (header 'true', inferSchema 'true');"

Next, we will query this newly created table to get the top 10 orders from the newly created table:

id: databricks-query
namespace: company.team
tasks:
  - id: query-top-10-orders
    type: io.kestra.plugin.databricks.sql.Query
    accessToken: <your-token>
    host: <your-host>
    httpPath: <your-httpPath>
    sql: "SELECT * FROM orders ORDER BY total DESC LIMIT 10;"

Query results are stored in Kestra's internal storage and can be found in the Outputs tab. Kestra's integration of SQL execution in the orchestration workflow enhances Databricks' data processing capabilities. This enables more advanced data analysis and real-time, data-driven decision-making.

The Kestra Databricks plugin makes orchestration and data management straightforward, supporting both ETL and ELT workflows. Its declarative structure is easy to use, allowing even non-developers to manage Databricks pipelines effectively. This ensures that data is readily available for analysis and insights.

Conclusion

Both Snowflake and Databricks are powerful platforms with their own strengths. By understanding these strengths and using a tool like Kestra to orchestrate workflows between them, you can build a data stack that can help you scale. It's not about choosing one or the other, it's about using the right tool for each job and making them work together effectively.

If you found this article useful, share it with your peers and community.

If You ❤️ My Content! Connect Me on Twitter

Check SaaS Tools I Use 👉🏼Access here!

I am open to collaborating on Blog Articles and Guest Posts🫱🏼‍🫲🏼 📅Contact Here

DEV Community

Snowflake vs. Databricks: Building Your Data Stack with Kestra🚀🌩️🧑‍💻

Snowflake: More Than Just Another Data Warehouse

Databricks: Where Data Science Meets Big Data

Choosing Between Snowflake and Databricks

Orchestrating Snowflake and Databricks Workflows with Kestra

Using the Kestra Snowflake Plugin:

Using the Kestra Databricks Plugin:

Conclusion

Top comments (0)

Read next

My Journey to Better Coding: Lessons Learned

The Software Design /System Design Interview Preparation RoadMap (with Resources)

The Future of Software Development in the AI Era

Saudi Riyal Font – Easily Display SAR Currency Symbol in HTML