Removing one or more columns from a pandas DataFrame
is a pretty common task, but it turns out there are a number of possible ways to perform this task. I found that this StackOverflow question, along with solutions and discussion in it raised a number of interesting topics. It is worth digging in a little bit to the details.
First, what’s the “correct” way to remove a column from a DataFrame
? The standard way to do this is to think in SQL and use drop
.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.arange(25).reshape((5,5)),
columns=list("abcde"))
display(df)
try:
df.drop('b')
except KeyError as ke:
print(ke)
a b c d e
0 0 1 2 3 4
1 5 6 7 8 9
2 10 11 12 13 14
3 15 16 17 18 19
4 20 21 22 23 24
"['b'] not found in axis"
Wait, what? Why an error? That’s because the default axis that drop
works with is the rows. As with many pandas methods, there’s more than one way to invoke the method (which some people find frustrating).
You can drop rows using axis=0
or axis='rows'
, or using the labels
argument.
df.drop(0) # drop a row, on axis 0 or 'rows'
df.drop(0, axis=0) # same
df.drop(0, axis='rows') # same
df.drop(labels=0) # same
df.drop(labels=[0]) # same
a b c d e
1 5 6 7 8 9
2 10 11 12 13 14
3 15 16 17 18 19
4 20 21 22 23 24
Again, how do we drop a column?
We want to drop a column, so what does that look like? You can specify the axis
or use the columns
parameter.
df.drop('b', axis=1) # drop a column
df.drop('b', axis='columns') # same
df.drop(columns='b') # same
df.drop(columns=['b']) # same
a c d e
0 0 2 3 4
1 5 7 8 9
2 10 12 13 14
3 15 17 18 19
4 20 22 23 24
There you go, that’s how you drop a column. Now you have to either assign to a new variable, or back to your old variable, or pass in inplace=True
to make the change permanent.
df2 = df.drop('b', axis=1)
print(df2.columns)
print(df.columns)
Index(['a', 'c', 'd', 'e'], dtype='object')
Index(['a', 'b', 'c', 'd', 'e'], dtype='object')
It’s also worth noting that you can drop both rows and columns at the same time using drop by using the index
and columns
arguments at once, and you can pass in multiple values.
df.drop(index=[0,2], columns=['b','c'])
a d e
1 5 8 9
3 15 18 19
4 20 23 24
If you didn’t have the drop method, you can basically obtain the same results through indexing. There are many ways to accomplish this, but one equivalent solution is indexing using the .loc
indexer and isin
, along with inverting the selection.
df.loc[~df.index.isin([0,2]), ~df.columns.isin(['b', 'c'])]
a d e
1 5 8 9
3 15 18 19
4 20 23 24
If none of that makes sense to you, I would suggest reading through my series on selecting and indexing in pandas, starting here.
Back to the question
Looking back at the original question though, we see there is another available technique for removing a column.
del df['a']
df
b c d e
0 1 2 3 4
1 6 7 8 9
2 11 12 13 14
3 16 17 18 19
4 21 22 23 24
Poof! It’s gone. This is like doing a drop with inplace=True
.
What about attribute access?
We also know that we can use attribute access to select columns of a DataFrame
.
df.b
0 1
1 6
2 11
3 16
4 21
Name: b, dtype: int64
Can we delete the column this way?
del df.b
--------------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-10-0dca358a6ef9> in <module>
---------> 1 del df.b
AttributeError: b
We cannot. This is not an option for removing columns with the current pandas design. Is this technically impossible? How come del df['b']
works but del df.b
doesn’t?. Let’s dig into those details and see whether it would be possible to make the second work as well.
The first version works because in pandas, the DataFrame
implements the __delitem__
method which gets invoked when you execute del df['b']
. But what about del df.b
, is there a way to handle that?
First, let’s make a simple class that shows how this works under the hood. Instead of being a real DataFrame
, we’ll just use a dict
as a container for our columns (which could really contain anything, we’re not doing any indexing here).
class StupidFrame:
def __init__ (self, columns):
self.columns = columns
def __delitem__ (self, item):
del self.columns[item]
def __getitem__ (self, item):
return self.columns[item]
def __setitem__ (self, item, val):
self.columns[item] = val
f = StupidFrame({'a': 1, 'b': 2, 'c': 3})
print("StupidFrame value for a:", f['a'])
print("StupidFrame columns: ", f.columns)
del f['b']
f.d = 4
print("StupidFrame columns: ", f.columns)
StupidFrame value for a: 1
StupidFrame columns: {'a': 1, 'b': 2, 'c': 3}
StupidFrame columns: {'a': 1, 'c': 3}
A couple of things to note here. First, we how that we can access the data in our StupidFrame
with the index operators ([]
), and use that for setting, getting, and deleting items. When we assigned d
to our frame, it wasn’t added to our columns because it’s just a normal instance attribute. If we wanted to be able to handle the columns as attributes, we have to do a little bit more work.
So following the example from pandas (which supports attribute access of columns), we add the __getattr__
method, but we also will handle setting it with the __setattr__
method and pretend that any attribute assignment is a ‘column’. We have to update our instance dictionary (__dict__
) directly to avoid an infinite recursion.
class StupidFrameAttr:
def __init__ (self, columns):
self. __dict__ ['columns'] = columns
def __delitem__ (self, item):
del self. __dict__ ['columns'][item]
def __getitem__ (self, item):
return self. __dict__ ['columns'][item]
def __setitem__ (self, item, val):
self. __dict__ ['columns'][item] = val
def __getattr__ (self, item):
if item in self. __dict__ ['columns']:
return self. __dict__ ['columns'][item]
elif item == 'columns':
return self. __dict__ [item]
else:
raise AttributeError
def __setattr__ (self, item, val):
if item != 'columns':
self. __dict__ ['columns'][item] = val
else:
raise ValueError("Overwriting columns prohibited")
f = StupidFrameAttr({'a': 1, 'b': 2, 'c': 3})
print("StupidFrameAttr value for a", f['a'])
print("StupidFrameAttr columns: ", f.columns)
del f['b']
print("StupidFrameAttr columns: ", f.columns)
print("StupidFrameAttr value for a", f.a)
f.d = 4
print("StupidFrameAttr columns: ", f.columns)
del f['d']
print("StupidFrameAttr columns: ", f.columns)
f.d = 5
print("StupidFrameAttr columns: ", f.columns)
del f.d
StupidFrameAttr value for a 1
StupidFrameAttr columns: {'a': 1, 'b': 2, 'c': 3}
StupidFrameAttr columns: {'a': 1, 'c': 3}
StupidFrameAttr value for a 1
StupidFrameAttr columns: {'a': 1, 'c': 3, 'd': 4}
StupidFrameAttr columns: {'a': 1, 'c': 3}
StupidFrameAttr columns: {'a': 1, 'c': 3, 'd': 5}
--------------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-12-fd29f59ea01e> in <module>
39 f.d = 5
40 print("StupidFrameAttr columns: ", f.columns)
--------> 41 del f.d
AttributeError: d
How could we handle deletion?
Everything works but deletion using attribute access. We handle setting/getting columns using both the array index operator ([]
) and attribute access. But what about detecting deletion? Is that possible?
One way to do this is using the __delattr__
method, which is described in the data model documentation. If you define this method in your class, it will be invoked instead of updating an instance’s attribute dictionary directly. This gives us a chance to redirect this to our columns instance.
class StupidFrameDelAttr(StupidFrameAttr):
def __delattr__ (self, item):
# trivial implementation using the data model methods
del self. __dict__ ['columns'][item]
f = StupidFrameDelAttr({'a': 1, 'b': 2, 'c': 3})
print("StupidFrameDelAttr value for a", f['a'])
print("StupidFrameDelAttr columns: ", f.columns)
del f['b']
print("StupidFrameDelAttr columns: ", f.columns)
print("StupidFrameDelAttr value for a", f.a)
f.d = 4
print("StupidFrameDelAttr columns: ", f.columns)
del f.d
print("StupidFrameDelAttr columns: ", f.columns)
StupidFrameDelAttr value for a 1
StupidFrameDelAttr columns: {'a': 1, 'b': 2, 'c': 3}
StupidFrameDelAttr columns: {'a': 1, 'c': 3}
StupidFrameDelAttr value for a 1
StupidFrameDelAttr columns: {'a': 1, 'c': 3, 'd': 4}
StupidFrameDelAttr columns: {'a': 1, 'c': 3}
Now I’m not suggesting that attribute deletion for columns would be easy to add to pandas, but at least this shows how it could be possible. In the case of current pandas, deleting columns is best done using drop
.
Also, it’s worth mentioning here that when you create a new column in pandas, you don’t assign it as an attribute. To better understand how to properly create a column, you can check out this article.
If you already knew how to drop a column in pandas, hopefully you understand a little bit more about how this works.
The post How to remove a column from a DataFrame, with some extra detail appeared first on wrighters.io.
Top comments (0)