Kristof Bruyninckx

Posted on Feb 24

Data oriented python

#python #coding

There is a plethora of options in python to bundle data together. You could use builtin types such as lists, tuples and dicts. You could also create a class, a named tuple or a dataclass, or go even further and use external libraries such as attrs or pydantic. But when to use what, what are the benefits and downsides of each approach? That is the question i hope to tackle in this blog post.

Tuples and Lists

The most basic built-in tuples and lists are very easy to use, but also easy to misuse. Imagine you want to represent a 2-dimensional point of natural numbers. You could do this with a tuple.

>>> p1 = (1,2)
>>> print(p1[0]) # print the x-coordinate
1

While this works, this has a number of downsides. The knowledge that this a point with x and y coordinates is not explicit. You may be able to decipher this from the context and variable naming, but this makes the code less readable. Especially if you are accessing individual x and y coordinates as this is done by indexing.

The lack of readability and intent becomes more apparent if we consider other methods that take points as there input.

def distance(p1: tuple[int, int], p2: tuple[int, int]) -> float:
    return math.sqrt((p2[0]-p1[0])**2 + (p2[1]-p1[1])**2)

Of course, we could create a type alias so that it becomes slightly more readable, but that only helps partially.

type Point2D = tuple[int, int] # Note that this syntax is python 3.12+
def distance(p1: Point2D, p2: Point2D):
    return math.sqrt((p2[0]-p1[0])**2 + (p2[1]-p1[1])**2)

also note that, as tuples are immutable, you can't change the values

>>> p1[0] = 3
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'tuple' object does not support item assignment

One last thing to note is that tuples, like other sequence types such as list, offer equality by content rather than reference out of the box. This is one thing that is rather desirable for the example.

>>> p1 = (1, 2)
>>> p2 = (1, 2)
>>> p1 == p2 
True

We may represent the point as a list, which would give us mutability but this also gives us access to all kinds of unwanted methods that can be misused to introduce bugs.

>>> p1 = [1, 2]
>>> p1[0] = 3 # we can update the value now
>>> p1.append(4) # this really does not make sense in the context of a 2d point

Named tuples

A better option would be to use named tuples.

class Point2d(NamedTuple):
    x: int
    y: int

>>> p1 = Point2d(x=1, y=2)

If we now look at the previous distance function it looks quite good.

def distance(p1: Point2D, p2: Point2D):
    return math.sqrt((p2.x-p1.x)**2 + (p2.y-p1.y)**2)

While this is a good option, named tuples are always immutable. They also, being tuples, support all operations that regular tuples support. For example:

>>> p1 = Point2d(x=1.0, y=2.0)
# Some tuple operations
>>> x, y = *p1
>>> p1 == (1, 2)  
True # Given that you can compare with any tuple, this allows comparing apples and oranges without complaint, you might accidentally compare with a tuple containing completely different data and unrelated data.
>>> Test(x=1,y=3) + Test(x=1, y=3) 
(1, 3, 1, 3). # Not exactly desirable in our context. 
>>> for el in tuple:
>>>    ... #

Regular Classes

So what else can we do? We could make use of a regular class

class Point2D
    def __init__(self, x, y):
        self.x = x 
        self.y = y

But this is not enough, Aside from this being slightly more code than the named tuple, we'll have to implement some custom dunder methods or we get this:

>>> Point2D(1,2) == Point2D(1,2) 
False # By default, object equality will by by reference
>>> print(Point2D(1,2))
<__main__.Point2D object at 0x78d45a6dd890>

This makes sense as regular classes are used in many different contexts, the global default should not narrow down its use-cases. However for data oriented classes you would end up having to write a lot of boilerplate (commonly at least __init__, __eq__ and __repr__).

Dataclasses

In come dataclasses. By using a simple decorator you automatically get implementations for those dunder methods mentioned above.

from dataclasses import dataclass

@dataclass
class Point2D:
    x: int
    y: int

Now, we automatically get the following:

>>> Point2D(1,2) == Point2D(1,2)
True
>>> print(Point2D(1,2))
Point2D(x=1, y=2)

You can customize things to your specific needs by providing some arguments to the decorator as below:

 @dataclasses.dataclass(*, init=True, repr=True, eq=True, order=False, unsafe_hash=False, frozen=False, match_args=True, kw_only=False, slots=False, weakref_slot=False)¶

So you can really do a lot with a single line of code. Notably you can use frozen=True to make your dataclass immutable. Also note that if you provide both frozen=True and eq=True, you automatically get a sensible hash function so that you can use your object in contexts that require it, such as in sets and for dict keys.

Conclusion

By default, i suggest using the option that is the most restrictive in the sense that it should allow exactly what you need, nothing more. Often this is a choice between dataclasses and pydantic, which i will briefly discuss in a bit. I only use NamedTuple if i have a very specific reason to stay compatible with regular tuples (for example if i am refactoring existing code and i want to retain backwards compatibility). This is the safest in terms of preventing misuse.

Pydantic is great but don't just use it willy-nilly. If you need to do runtime validation on inputs, whether it being loading from files or any place where you deal with user input it tends to be a great fit. As an example, FastApi is a wonderful framework using pydantic for input validation in the context of building web APIs.

Finally, though i didn't explicitly handle it because i never used it, there is also the the attrs package. This predates the standard library dataclasses and seems to be a bit more powerful but also more complex. I suggest you have a look it it (as well i probably) if you encounter an use case that you can't handle with dataclasses.

DEV Community

Data oriented python

Tuples and Lists

Named tuples

Regular Classes

Dataclasses

Conclusion

Top comments (0)

Read next

Telegram bot para replicar sinais no mt5

7 Must-Try Open-Source Tools for Python and JavaScript Developers 🚀

Suppressing "KeyboardInterrupt" Message on Python Script

Why Is Spark Slow??