There is a plethora of options in python to bundle data together. You could use builtin types such as lists, tuples and dicts. You could also create a class, a named tuple or a dataclass, or go even further and use external libraries such as attrs or pydantic. But when to use what, what are the benefits and downsides of each approach? That is the question i hope to tackle in this blog post.
Tuples and Lists
The most basic built-in tuples and lists are very easy to use, but also easy to misuse. Imagine you want to represent a 2-dimensional point of natural numbers. You could do this with a tuple.
>>> p1 = (1,2)
>>> print(p1[0]) # print the x-coordinate
1
While this works, this has a number of downsides. The knowledge that this a point with x and y coordinates is not explicit. You may be able to decipher this from the context and variable naming, but this makes the code less readable. Especially if you are accessing individual x and y coordinates as this is done by indexing.
The lack of readability and intent becomes more apparent if we consider other methods that take points as there input.
def distance(p1: tuple[int, int], p2: tuple[int, int]) -> float:
return math.sqrt((p2[0]-p1[0])**2 + (p2[1]-p1[1])**2)
Of course, we could create a type alias so that it becomes slightly more readable, but that only helps partially.
type Point2D = tuple[int, int] # Note that this syntax is python 3.12+
def distance(p1: Point2D, p2: Point2D):
return math.sqrt((p2[0]-p1[0])**2 + (p2[1]-p1[1])**2)
also note that, as tuples are immutable, you can't change the values
>>> p1[0] = 3
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'tuple' object does not support item assignment
One last thing to note is that tuples, like other sequence types such as list, offer equality by content rather than reference out of the box. This is one thing that is rather desirable for the example.
>>> p1 = (1, 2)
>>> p2 = (1, 2)
>>> p1 == p2
True
We may represent the point as a list, which would give us mutability but this also gives us access to all kinds of unwanted methods that can be misused to introduce bugs.
>>> p1 = [1, 2]
>>> p1[0] = 3 # we can update the value now
>>> p1.append(4) # this really does not make sense in the context of a 2d point
Named tuples
A better option would be to use named tuples.
class Point2d(NamedTuple):
x: int
y: int
>>> p1 = Point2d(x=1, y=2)
If we now look at the previous distance function it looks quite good.
def distance(p1: Point2D, p2: Point2D):
return math.sqrt((p2.x-p1.x)**2 + (p2.y-p1.y)**2)
While this is a good option, named tuples are always immutable. They also, being tuples, support all operations that regular tuples support. For example:
>>> p1 = Point2d(x=1.0, y=2.0)
# Some tuple operations
>>> x, y = *p1
>>> p1 == (1, 2)
True # Given that you can compare with any tuple, this allows comparing apples and oranges without complaint, you might accidentally compare with a tuple containing completely different data and unrelated data.
>>> Test(x=1,y=3) + Test(x=1, y=3)
(1, 3, 1, 3). # Not exactly desirable in our context.
>>> for el in tuple:
>>> ... #
Regular Classes
So what else can we do? We could make use of a regular class
class Point2D
def __init__(self, x, y):
self.x = x
self.y = y
But this is not enough, Aside from this being slightly more code than the named tuple, we'll have to implement some custom dunder methods or we get this:
>>> Point2D(1,2) == Point2D(1,2)
False # By default, object equality will by by reference
>>> print(Point2D(1,2))
<__main__.Point2D object at 0x78d45a6dd890>
This makes sense as regular classes are used in many different contexts, the global default should not narrow down its use-cases. However for data oriented classes you would end up having to write a lot of boilerplate (commonly at least __init__
, __eq__
and __repr__
).
Dataclasses
In come dataclasses. By using a simple decorator you automatically get implementations for those dunder methods mentioned above.
from dataclasses import dataclass
@dataclass
class Point2D:
x: int
y: int
Now, we automatically get the following:
>>> Point2D(1,2) == Point2D(1,2)
True
>>> print(Point2D(1,2))
Point2D(x=1, y=2)
You can customize things to your specific needs by providing some arguments to the decorator as below:
@dataclasses.dataclass(*, init=True, repr=True, eq=True, order=False, unsafe_hash=False, frozen=False, match_args=True, kw_only=False, slots=False, weakref_slot=False)ΒΆ
So you can really do a lot with a single line of code. Notably you can use frozen=True to make your dataclass immutable. Also note that if you provide both frozen=True
and eq=True
, you automatically get a sensible hash function so that you can use your object in contexts that require it, such as in sets and for dict keys.
Conclusion
By default, i suggest using the option that is the most restrictive in the sense that it should allow exactly what you need, nothing more. Often this is a choice between dataclasses and pydantic, which i will briefly discuss in a bit. I only use NamedTuple if i have a very specific reason to stay compatible with regular tuples (for example if i am refactoring existing code and i want to retain backwards compatibility). This is the safest in terms of preventing misuse.
Pydantic is great but don't just use it willy-nilly. If you need to do runtime validation on inputs, whether it being loading from files or any place where you deal with user input it tends to be a great fit. As an example, FastApi is a wonderful framework using pydantic for input validation in the context of building web APIs.
Finally, though i didn't explicitly handle it because i never used it, there is also the the attrs package. This predates the standard library dataclasses and seems to be a bit more powerful but also more complex. I suggest you have a look it it (as well i probably) if you encounter an use case that you can't handle with dataclasses.
Top comments (0)