DEV Community

Cover image for One-Hot Encoding with DictVectorizer
Victor Isaac Oshimua
Victor Isaac Oshimua

Posted on • Edited on

One-Hot Encoding with DictVectorizer

Introduction

Usually, datasets used for training machine learning models contain feature columns with various data types, one of which is categorical features, these features are non-numerical.
Examples include:

  • Name (with values like "Kelvin", "Jonathan")
  • Gender (with values like "male", "female")
  • Country (with values like "Nigeria", "USA")

In many cases, categorical features are represented as strings, and most machine learning algorithms cannot process strings unless we convert them to numerical values.

There are various methods to deal with categorical variables, one of which is One-Hot encoding.
This article will guide you on how to implement one-hot encoding with DictVectorizer.

Table of content

  1. Prerequisites
  2. What is One-Hot Encoding?
  3. What is DictVectorizer?
  4. Implementing One-Hot encoding
  5. Conclusion

Prerequisites

  • Basic understanding of Python
  • Basic understanding of data science libraries e.g Pandas, Numpy, Scikit learn
  • Jupyter notebook to test try codes yourself
  • Basic understanding of Machine learning

What is One-Hot Encoding?

One hot encoding is a method used for converting categorical variables to numerical values.

One-hot encoding assigns binary features to unique categorical values. If a value is present in an observation, its corresponding feature is set to 1; otherwise, it is set to 0.
For example:

One hot encoding image by author

In the above diagram, the original data has a column called "Country" that contains the following values: "NIGERIA", "USA", "JAPAN", and "TOGO".

One-hot encoding created four new binary columns, one for each unique category, with a value of 1 indicating that the category is present and a value of 0 indicating that it is not.

What is DictVectorizer?

As implied by its name, DictVectorizer is a class that transforms lists of feature-value mappings(Python dict objects) into vectors.

Implementing one-hot encoding

Now that you have a grasp of One-Hot encoding and DictVectorizer, let's dive into putting it into practice.

To implement one-hot encoding, tabular data with categorical features is needed, hence a Kaggle dataset will be used in this guide, follow this link to download the dataset used in this guide.

The following steps will guide you on using DictVectorizer to implement one hot encoding.

1. Read and process the data



# import libraries 
import pandas as pd
import numpy as np
# Read the data
data = pd.read_csv("drug200.csv")
# select categorical data 
columns = ["Sex", "BP", "Cholesterol"]
# select only the first 10 rows
categorical=data[columns].iloc[:11]
categorical


Enter fullscreen mode Exit fullscreen mode
  • The above code imports necessary data science libraries needed to read and process the data

  • The code also reads the CSV (comma-separated values) data into a pandas DataFrame and selects the columns with categorical features.

Here is the output of the code

Image description

2. Convert categorical features to a list of dictionaries



categorical_dict = categorical.to_dict(orient="records")
categorical_dict


Enter fullscreen mode Exit fullscreen mode

Here is the output of the code.

Image description

3. Initiate an instance of DictVectorizer class



from sklearn.feature_extraction import DictVectorizer
dv = DictVectorizer(sparse = False)




Enter fullscreen mode Exit fullscreen mode

Here's a breakdown of what the code does:

  • from sklearn.feature_extraction import DictVectorizer:
    This line imports the DictVectorizer class from the sklearn.feature_extraction module.

  • dv = DictVectorizer(sparse=False):
    This line creates an instance of the DictVectorizer class and assigns it to the variable dv.

  • The sparse=False argument is passed to the DictVectorizer constructor, indicating that the resulting matrix representation should be a dense numpy array rather than sparse.

4. Fit and transform the dictionary.



dv.fit(categorical_dict)
transformed_data=dv.transform(categorical_dict)
transformed_data


Enter fullscreen mode Exit fullscreen mode

Here is the result of the above code.

Image description

We successfully One-Hot encoded the categorical data with DictVectorizer.

To explore more, let's check how the categorical data were represented on the transformed data.



dv.get_feature_names()


Enter fullscreen mode Exit fullscreen mode

Here is the output of the code

Image description

Furthermore, here is a diagram of how the column/feature names are stored on the transformed data.

Image description

Conclusion

DictVectorizer is a way of performing One-Hot encoding on categorical data, it takes a list of dictionaries and transforms them to numpy arrays.

DictVectorizer is easy to implement and makes machine learning model deployment simpler.

I would recommend utilizing DictVectorizer to transform your categorical data into numerical representations for your future machine learning projects.

Thanks for reading this article. If you have any further questions or would like to connect, feel free to reach out to me on Twitter and on LinkedIn. I appreciate your engagement and look forward to staying connected.

Top comments (0)