DEV Community

myrabel
myrabel

Posted on

Feature Engineering: The Ultimate Guide

What Is Feature Engineering
This is the process of selecting, manipulating and transforming raw data into features that can be used in supervised learning. A feature is a measurable input that can be used in a predictive model. Features can be numerical, text based or categorical since they represent different aspects of data that are relevant to the problem being “solved”.

Image description

Benefits:

  • The model has a higher efficiency
  • Algorithms fit the data and detect patterns easily
  • Greater flexibility of the features

What is the need for feature engineering?
Feature engineering helps to:

  1. Improve user experience – feature engineering’s primary reason is to enhance product or service offering to make it more effective/efficient and increase satisfaction in customers.
  2. Give a competitive advantage – offering unique and innovative features helps differentiate products in the market.
  3. Meet customer needs – identify areas that new features can be used to enhance product value and meet customer needs.
  4. Increase revenue – development of a new feature that provides additional functionality leading to more uptake by customers.
  5. Future proofing – anticipating future trends and potential customer needs to help develop features that keep a product relevant.

Feature engineering consists of the following processes:

  • Feature Creation This is the process of generating new features based on domain knowledge. This can reveal hidden patterns and relationships that were not initially apparent

Methods used include:
Aggregation – combining multiple data points to create a more holistic view. Standard functions include count, sum, average, minimum, maximum, percentile, standard deviation and co-efficient of variation.

Differences and Ratios – these are effective methods of representing changes in numeric features for purposes of anomaly detection and prediction

  • Feature extraction The process of creating new features from existing ones to provide more information to the model.

Dimensionality reduction for example, reduces the number of features while preserving the most important information.
Types: dimensionality reduction, feature combination, feature aggregation, feature transformation

  • Feature selection
    The process of selecting a subset of relevant features from the dataset to be used in the model. Selecting the most relevant and informative features reduces the complexity of the model and improves its performance by eliminating irrelevant features.
    Types: filter method, wrapper method, embedded method

  • Feature scaling
    The process of transforming the features so that they have a similar scale. This prevents a single feature from dominating the analysis.

Normalization is the process of scaling the data values in such a way that that the value of all the features lies between 0 and 1. This method works well when the data is normally distributed.

Standardization is the process of scaling the data values in such a way that that they gain the properties of standard normal distribution. The features now have a mean value of 0 and a standard deviation of 1.
Types: min-max scaling, standard scaling, robust scaling

Techniques

  • Cleaning and Imputation - the process of addressing missing values and inconsistencies in the data to ensure the information used to train a model is reliable and consistent.
    Numerical imputation and categorical imputation are the types used.

  • Feature scaling – the process of standardizing the range of numerical features to ensure they equally contribute to the model’s training process.

  • Encoding
    One hot encoding transforms categorical values into numerical values that can be used by models. Each category is transformed into a binary value indicating presence (1) or absence (0).
    Binning is a technique that transforms continuous variables into categorical variables. Ranges of values are divided into several “bins” and each assigned a categorical value.
    Example: Age group bins from ages 18-80 [18-25 young adults, 26-35 middle aged adults 36-60 older adults and 61-80 as elderly]

Top comments (0)