Recently, I was given a dataset that contained sensitive information about customers and that should not under any circumstance be made public. The dataset resided on one of our servers which I deem to be a reasonably secure location. I wanted to copy the data to my local drive, in order to work with the data more comfortably and at the same time not having to fear that the data is less save. So, I wrote a little script that changes the data, while still preserving some key information. I will detail all the steps that I have taken, and highlight some handy tricks along the way.
The Task
The task is to prepare a dataset, such that it can later be used for machine learning purposes (e.g. classification, regression, clustering) without containing any sensitive information. The final dataset should not be too different from the original one and should reflect the initial datasets' distributions.
Let's go
I will be using a Jupyter notebook as my environment. First, let's import all the necessary libraries.
import pandas as pd
import numpy as np
import scipy.stats
%matplotlib inline
import matplotlib.pyplot as plt
from sklearn_pandas import DataFrameMapper
from sklearn.preprocessing import LabelEncoder
# get rid of warnings
import warnings
warnings.filterwarnings("ignore")
# get more than one output per Jupyter cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
# for functions we implement later
from utils import best_fit_distribution
from utils import plot_result
I am going to assume that you are already familiar with most of the libraries used here. I just want to highlight three things. sklearn_pandas is a convenient library that tries to bridge the gap between the two packages. It provides a DataFrameMapper
class that makes working with pandas DataFrames easier as it allows for changing the encoding of variables in fewer lines of code. With from IPython.core.interactiveshell ...
I change a Jupyter Notebook default, such that more than one output is displayed. A nice blog post about other handy Jupyter tricks is here. Finally, we are going to put some code into a file called utils.py
, that we will be placing next to the Notebook.
df = pd.read_csv("../data/titanic_train.csv")
For our analysis, we will be using the training portion of the Titanic Dataset.
df.shape
df.head()
Now that we have loaded the data, we are going to strip all the personally identifieable information. The columns ["PassengerId", "Name"]
contain such information. Notice that ["PassengerId", "Name"]
are unique for every row, so if we build a machine learning model, we would drop them anyways later on. Similar arguments can be made about ["Ticket", "Cabin"]
, which are almost unique for every row.
For purposes of this demonstration, we will not deal with missing values. We simply disregard all observations, that contain missing values.
df.drop(columns=["PassengerId", "Name"], inplace=True) # dropped because unique for every row
df.drop(columns=["Ticket", "Cabin"], inplace=True) # dropped because almost unique for every row
df.dropna(inplace=True)
The result looks like this.
df.shape
df.head()
Next, to strip even more information, and as a preprocessing step for later, we are going to encode Sex
and Embarked
with numeric values. Sex
will be coded 0,1
, Embarked
will be coded 0,1,2
. The class LabelEncoder()
does most of the work for us.
encoders = [(["Sex"], LabelEncoder()), (["Embarked"], LabelEncoder())]
mapper = DataFrameMapper(encoders, df_out=True)
new_cols = mapper.fit_transform(df.copy())
df = pd.concat([df.drop(columns=["Sex", "Embarked"]), new_cols], axis="columns")
DataFrameMapper
comes from the sklearn_pandas
packages and accepts a list of tuples where the first item of the tupels are column names and the second item of the tuples are transformers. For our purposes, we use LabelEncoder()
, but any other Transformer would be accepted by the interface as well (MinMaxScaler()
StandardScaler()
, FunctionTransfomer()
).
In the last line, we join the encoded data with the rest of the data. Note that you can also write axis=1
, but axis="columns"
is way more readable and I encourage everybody to use the latter version.
df.shape
df.head()
df.nunique()
Anonymizing by sampling from the same distribution
Above I printed the number of unique values per column. We will go ahead and assume that everything with less than 20 unique values is a nominal or categorical variable, and everything with equal to or more than 20 unique values is a continuous one. Let's put the nominal/categorical variables in one list and the other ones in another list.
categorical = []
continuous = []
for c in list(df):
col = df[c]
nunique = col.nunique()
if nunique < 20:
categorical.append(c)
else:
continuous.append(c)
for c in list(df):
iterates over all columns. Instead of list(df)
, one could also write df.columns.tolist()
. I am still torn which one I like more.
Here is the core idea of this post: For every categorical variable, we will determine the frequencies of its unique values, and then create a discrete probability distribution with the same frequencies for each unique value. For every continuous variable, we will determine the best continuous distribution from a pre-defined list of distributions. How this is done is explained below. Once we have determined all the probability distribution (discrete and continuous), we sample from these distributions to create a new dataset.
Treatment of nominal/categorical variables
This is the simpler case and requires only 3 lines of code.
for c in categorical:
counts = df[c].value_counts()
np.random.choice(list(counts.index), p=(counts/len(df)).values, size=5)
First, we determine how often a unique value occurs in a variable. This is the empirical probibility function. Then we use this probibility function and pipe it to np.random.choice()
to create a new random variable that has the same probibility function.
Treatment of continuous variables
Luckily, there is a StackOverflow thread about a similar issue. The main solution is the following. For every continuous variable do:
- Create a histogram using a pre-defined number of bins
- Go though a list of continuous functions, and fit every function to that histogram. This fitting process also yields the parameters for the function.
- The one function that has the smallest error (the smallest residual sum of squares), between itself and the histogram is the one we will use for approximating the continuous variable.
The author of this solution put everything neatly into two functions. I created a third one and put everything in a file called utils.py
. We will use the functions from utils.py
in our notebook.
best_distributions = []
for c in continuous:
data = df[c]
best_fit_name, best_fit_params = best_fit_distribution(data, 50)
best_distributions.append((best_fit_name, best_fit_params))
# Result
best_distributions = [
('fisk', (11.744665309421649, -66.15529969956657, 94.73575225186589)),
('halfcauchy', (-5.537941926133496e-09, 17.86796415175786))]
The best distribution for Age
is fisk
and the best distribution for Fare
is halfcauchy
. Let's look at that result.
plot_result(df, continuous, best_distributions)
Not too bad, I'd say.
Putting the codes into one function
def generate_like_df(df, categorical_cols, continuous_cols, best_distributions, n, seed=0):
np.random.seed(seed)
d = {}
for c in categorical_cols:
counts = df[c].value_counts()
d[c] = np.random.choice(list(counts.index), p=(counts/len(df)).values, size=n)
for c, bd in zip(continuous_cols, best_distributions):
dist = getattr(scipy.stats, bd[0])
d[c] = dist.rvs(size=n, *bd[1])
return pd.DataFrame(d, columns=categorical_cols+continuous_cols)
Now we have a function, that we can use to create, say, 100 new observations.
gendf = generate_like_df(df, categorical, continuous, best_distributions, n=100)
gendf.shape
gendf.head()
As a post processing step, one can also round the continuous variables. I chose not to do that. What I did do however, was I deleted all the column names, as this might also leak some infomation about the dataset. I simply replaced them with 0,1,2,... .
gendf.columns = list(range(gendf.shape[1]))
Finally, everything is saved to disk.
gendf.to_csv("output.csv", index_label="id")
Final Remarks
One drawback of this approach is that all the interactions between the variables are lost. For example, let's say in the original dataset, women (Sex=1
) had a higher chance of surviving (Survived=1
), then man (Sex=0
). In the generated dataset, this relationship is no longer exsistent. Any other relationship between the variables that might have existed, are lost as well.
I hope you find this blog post helpful and would like to hear your thoughts and comments. All the codes shown here are also available on github.
Top comments (3)
I wish if you have the k-anonymity algorithm for anonymizing sensitive dataset
Hey Ashique, you can get the k-anonymity algorith here: link.cryptonumerics.com/9e6a5
The link you entered is not working!