TLDR:
Use data structures that don't change once created(using Pyrsistent) to make it easy to understand and maintain code.
Purpose.
When a data structure(e.g. dict) once created, does not change, it allows us to read code with more confidence.
For e.g. Let's say you have a customer
variable in your code and you are tracking it's value by reading the code. What do you reason is the value of customer
below?
customer = dict(name="Rajiv", age=40)
some_function(customer)
print(customer)
In the above code, we can't say. In Python, for most default data structures like dict
, it is possible that some_function
could have changed the value of customer
. So, we have to dig in and read the code of some_function
to be fully sure. If the code of some_function
was below:
def some_function(a):
a1 = a
a1['name'] = 'NewRajiv' # Changing the values. blasphemy
# do something with a1
then print(customer)
would display {'name': 'NewRajiv', 'age': 40}
.
If you are lucky, some_function
does not pass it forward to other functions! Or else, you would have to dig in and read those functions too :). Now that would suck. Unless, it is the intention that the customer
field should be mutated but in most cases, one does not expect it to be so(in other languages,naming conventions are used to indicate if that is the case). A knowledgeable programmer may make a copy(via the copy.deepcopy()
) and work on the copy to prevent her code from affecting the client code but I have not been that knowledgeable programmer :)
What if we could use a data structure that once created, cannot be changed i.e. it is immutable. Let's check out a library called pyrsistent that gives us such data structures.
from pyrsistent import m # m is like a dictionary
customer1 = m(name='Rajiv', age=40)
customer2 = customer1.set(name='NewRajiv')
print(customer1) # pmap({'age': 40, 'name': 'Rajiv'})
print(customer2) # pmap({'age': 40, 'name': 'NewRajiv'})
When we specify a different value('NewRajiv'), a copy is created with that new value and assigned to customer2
. customer1
still retains the value it was first assigned. Now, let's go back to our previous code example and modify it a bit for pyrsistent
from pyrsistent import m # m is like a dictionary
def some_function(a):
a1 = a.set('name', 'NewRajiv')
# do something with a1
customer = m(name="Rajiv", age=40)
some_function(customer)
print(customer)
print(customer)
would display {'name': 'Rajiv', 'age': 40}
, the value set in our code. So, we can safely reason about our code and what it's doing without worrying about it changing inside some_function
. We don't have to even look into some_function
in this case. Trust me, when you can't run that snippet of code to see what the actual values are, this feature makes life so easy :).
pyrsistent
also has support for other common data structures(i.e. lists, sets) and much much more. Most of these pyrsistent
data structures are drop in replacements for their Python counterparts when it comes to accessing the data.
From the pyrsistent docs:
from pyrsistent import v # like a list
a = v(1, 2, 3)
b = a.append(4)
print(b[1]) # 2
print(b[1:3]) # pvector([2, 3])
print([2 * x for x in b]) # [2, 4, 6, 8]
On Speed and Memory
I simplified(ok, I lied) when I said that pyrsistent
makes a copy of the data structure. Such a practice would be a waste of memory and time if we copy over every huge data structure. pyrsistent
mitigates that to a great extent by not just blindly copying data structures and then making the modifications. It tries to be intelligent by sharing the common parts
between a original data structure and the new modified copy to save on memory and time.
Let's take an example(Credit: Wikipedia: Persistent Data Structures). Ah, you now are exposed to what this is really called. This concept is called persistent data structures
or functional data structures
.
NOTE: The below example is just to explain the concepts and such a binary search tree is not part of pyrsistent
.
Let's say you had a binary search tree(xs
) which was a persistent data structure:
Now if you added a node e
to that data structure, e.g. ys = insertNode(xs, e)
A naive implementation would copy the data structure and then insert e
at the appropriate location. In a persistent data structure approach, it would be:
Since e
falls into the right side of the tree(i.e. the tree with g
as the root), the tree with root b
is not affected and hence can be reused. You can see a arrow from d'
to b
indicating that.
This reuse saves memory space, and saves time not done copying as it merely uses pointers to refer to the unchanged data.
Note: just because a data structure is being reused does not mean modifications to one can affect another. They are immutable and hence cannot be modified. If you do modify, a new structure is created like above.
Note on Note: pyrsistent
tries to be as fast as possible and has comparable speeds to the norm for most cases. The complexity of most operations are well described in their docs.
Nested Transformations
What if we have to update a nested value in a data structure while maintaining immutability. pyrsistent
has a method transform
for that.
How I would normally do it
import copy
m4 = dict(a=1, b=6, c=[1, 2])
# I want to update c[1] to 17
m4_new = copy.deepcopy(m4)
m4_new['c'][1] = 17
From their docs,
from pyrsistent import m # m is like a dictionary
from pyrsistent import v # m is like a list
m4 = m(a=5, b=6, c=v(1, 2))
m4_new = m4.transform(('c', 1), 17)
print(m4_new) # pmap({'a': 5, 'c': pvector([1, 17]), 'b': 6})
Updating dictionaries
One thing I do very often is merging dictionaries. For e.g., I may have to construct my configuration taking the the following sources with the earliest being the highest priority.
- Environment variables
- File configuration
- Default configuration
How I would normally do it.
default_conf = dict(database_url='dev_url', user='postgres', port=5432)
# Imagine file_conf below was extracted from a file
file_conf = dict(user='test_user', port=5433)
# Imagine env_conf below was constructed from environment variables
environment_conf = dict(database_url='test_url')
final_conf = {**default_conf, **file_conf, **environment_conf}
print(final_conf) # {'database_url': 'test_url', 'user': 'test_user', 'port': 5433}
That's great for 99% of the cases I would think :). But for the sake of discussion, perhaps if you had HUGE dictionaries(e.g. merging all the data you scrapped illegally from some website ;) ), that would be some duplication of data in memory.
In pyrsistent
:
from pyrsistent import m
default_conf = m(database_url='dev_url', user='postgres', port=5432)
# Imagine file_conf below was extracted from a file
file_conf = m(user='test_user', port=5433)
# Imagine env_conf below was constructed from environment variables
environment_conf = m(database_url='test_url')
final_conf = default_conf + file_conf + environment_conf
print(final_conf) # pmap({'database_url': 'test_url', 'user': 'test_user', 'port': 5433})
I hope this is enough to get you started in a better coding experience :). There are many other wonderful features in pyrsistent
like having the above behaviour for records(PRrecord
) and clases(PClass
) and many more advanced features. I'll leave that for another post.
So head out to pyrsistent and check it out. And if you like it, don't forget to star! It's a wonderful piece of engineering whose authors that we should applaud and support.
Top comments (0)