Nicoda-27

Posted on Feb 15 • Edited on Feb 23

How to be Test Driven with Spark: Chapter 0 and 1 - Modern Python Setup

#python #tdd #uv #ruff

Chapter 0: Why this tutorial

This goal of this tutorial is to provide a way to easily be test driven with spark on your local setup without using cloud resources.

Before deep diving into spark and how, we must first align on our setup environment to ease reproducibility; this will be the focus of this article.

The official documentation describes how to create tests with pyspark.

It requires to have spark server with a spark connect support for it to work as described in the documentation.

As a reminder, this is how spark connect works:

Namely, a specific server needs to be created so your tests can connect to this server and process the data as intended.

Why it is not enough?

Launching the server requires some extra requirements on your machine, namely a java virtual machine.
Launching the server requires a specific script called start-connect-server.sh which is to be found

Some data engineers might argue they can just use a spark server already deployed to be able to test; but there are several drawbacks to this approach:

You are being charged to launch simple tests or run experiments keeping cloud providers very happy
You slow down the developer feedback loop which is the time necessary to implement a feature and validates that no regression has been introduced. A developer is more confident to have no regression when tests are all executed
You create external dependencies that you have no control off. You might encounter issues with testing when the cloud provider is down, or you don't have internet access or someone changes the configuration of the server by accident.

The goal is to have a test environment that is self descriptive, quick to setup, quick to start and reliable.

Chapter 1: Setup

In this chapter, multiples tool will be introduced and setup. The intent is to have a clean python environment to reproduce the code. This is a very opinionated section, but it might be useful to challenge your existing tools with this section.

Python version management

Mise will be leveraged to handle python versions. It claims to be the The front-end to your dev env and it will be used to install specific versions of languages and tools.

It can be used for much more, and it is strongly advised to look at the documentation to understand the true power of this tool not limited to python developement.

Mise first needs to be installed, see documentation for further instructions. You can launch the following:

curl https://mise.run | sh

Once installed, you will have to customize your .bashrc or your .zhsrc (or other terminal support) to activate mise on your terminal.

echo 'eval "$(~/.local/bin/mise activate bash)"' >> ~/.bashrc

Mise can now be used to install python at a specific version with the following command:

mise install python@3.12

It will download a pre-compiled version of python and make it available globally.

Let's now use it, you first need to position yourself at the root of the project and launch:

mise use python@3.12

It will create a mise.toml file with the following section

[tools]
python = "3.12"

And a .python-version with the indication

3.12

With the help of these files, mise will be able to activate when located at the root of your project. It's also a great way to document other contributors of the requirements to launch this project without relying on README that becomes easily outdated.

Python dependency management

A tool to help us add, remove and download dependencies is necessary. Uv, will be used later on as it's very fast and easy to use.

To install it, the official documentation; but in this tutorial mise will be leveraged:

mise use uv

This will both install and setup uv for the project. See how [mise.toml]((https://github.com/Nicoda-27/spark_tdd/blob/doc/chapter_1/mise.toml) has been modified with the addition of:

[tools]
python = "3.12"
uv = "latest"

Now it can be used to initialize the project, namely:

uv init

This will create a folder structure for you and a hello.py. In this project, we have customized it a bit to add a tests section a pyspark_tdd package as part of src so it looks like:

.
├── src
│   ├── hello.py
├── tests
├── .python-version
├── .mise.toml
└── pyproject.toml

Ignoring files

Every repository needs a set of files to ignore before adding them to a commit. This is done via a .gitignore file and anyone can leverage existing templates for your language of preference.

If you start a project from scratch, you will need to first setup git

git init

Github maintains ignore files template for each language. You can leverage it with:

curl -L -o .gitignore https://raw.githubusercontent.com/github/gitignore/refs/heads/main/Python.gitignore

The chosen language for gitignore is in this project the python template.

Adding formatting and linting

Python

Linters and formatters are powerful tools to enforce code writing rules among developers. It takes away the pain of having to care how the code is written at the syntax level.

Ruff will be leveraged to format our python code as it's very powerful and can be run at file saves without latency.

Ruff will be added as a project dev dependency. A dev dependency is one that the project does not need to run, it can be related to tests, experimentation, formatting etc. Everything that is not meant to be shipped to production must be retained as a dev dependency to keep your python package as self contained as possible.

We can add ruff like so:

uv add ruff --dev

This will add a dev dependency in the pyproject.toml with

[dependency-groups]
dev = [
    "ruff>=0.8.4",
]

This will also create a .venv at the current working directory. You might notice that the .venv is ignored from git which is intended. Indeed, you don't want to commit your .venv directory as it's a copy of the dependencies of your project and can be quite extensive.

It will also create an uv.lock that documents your direct dependencies version and the indirect dependencies (the dependencies of your dependencies). This mechanism allows to segregates dependencies of your project from the rest.

Your project should now look like

.
├── .venv
├── src
│   ├── hello.py
├── tests
├── .python-version
├── .gitignore
├── .mise.toml
├── pyproject.toml
└── uv.lock

Other languages

As a project is not just python files, but also configuration, pipelines, documentation etc, formatting these files too is also necessary.

Documenting how these files will be formatted is done using editorconfig.
We will use the one from the editorconfig website.

Your Integrated Development Environment (IDE)

Whichever IDE will be used, it's very important that you setup formatting at file saves to save you time and remove the pain from handling it by hand.

If you are using VSCode, you can install the ruff extension and adjust the following to your settings.json

"editor.formatOnSave": true,
"[python]": {
    "editor.formatOnSave": true,
    "editor.codeActionsOnSave": {
        "source.fixAll": "explicit",
        "source.organizeImports": "explicit"
    },
    "editor.defaultFormatter": "charliermarsh.ruff"
},

The first test

To see if everything works as expected, you will write a very simple unit test. In a test driven approach, the test is written before the source code.

A test framework is required to launch the test automation, pytest will be used. You need to add it as a dev dependency

uv add pytest --dev

You can create a tests/test_dummy.py with the following code:

from your_python_package.multiply import multiply


def test_my_dummy_function():
    assert multiply(1, 2) == 2

This requires a function multiply that can be defined as in src/your_python_package/multiply.py:

 def multiply(a: int, b: int) -> int:
    return a * b

You can now run the tests, make sure you're using the right python from the .venv

which python

should display something like /$HOME/somepath/your_project/.venv/bin/python. If not, you can restart a new terminal, mise should be able to resolve.

Then run

pytest

Then it will display an error:

tests/test_dummy.py:1: in <module>
    from your_python_package.multiply import multiply
E   ModuleNotFoundError: No module named 'your_python_package'

You need to add an extra entry for pytest to detect the src layout. In pyproject.toml, you can add:

[tool.pytest.ini_options]
pythonpath = ["src"]

Now

pytest

should display

============================================================= test session starts ===============================================================
platform linux -- Python 3.12.8, pytest-8.3.4, pluggy-1.5.0
rootdir: /home/somepath/src/your_project
configfile: pyproject.toml
collected 1 item                                                                                                                                 

tests/test_dummy.py .                                                                                                                      [100%]

=============================================================== 1 passed in 0.01s ================================================================

You can do some housekeeping and remove the unnecessary src/your_python_package/hello.py.

You now have a proper setup to start working.

What's next

Now that one test is implemented, the continuous integration (ci) must be setup. In a collaborative way of working, the ci is the only source of truth to guarantee if everything is broken or not.

Notice we still have not touched upon any spark components, it's very important to have a clean reproducible codebase before diving.

That will be the topic of the next chapter.

You can find the original materials in spark_tdd. This repository exposes what's the expected repository layout at the end of each chapter in each branch:

[23/02/25 UPDATE]: Chapter 2 has been released

DEV Community

How to be Test Driven with Spark: Chapter 0 and 1 - Modern Python Setup

Chapter 0: Why this tutorial

Why it is not enough?

Chapter 1: Setup

Python version management

Python dependency management

Ignoring files

Adding formatting and linting

Python

Other languages

Your Integrated Development Environment (IDE)

The first test

What's next

Top comments (0)

Read next

Boosting WebSocket Scalability through a Python Proxy

PyTorch Day 01: Introduction to Deep Learning and PyTorch

How to host ArcticDB on S3 and connect with Lambda

2025 Startup Tech Stack Blueprint