DEV Community

Cover image for cflearn - a minimal Automatic Machine Learning (AutoML) solution for tabular datasets based on PyTorch

Posted on • Updated on

cflearn - a minimal Automatic Machine Learning (AutoML) solution for tabular datasets based on PyTorch


Introduction video

I've been working on tabular datasets in the past few years, and managed to build a rough AutoML system that beat the 'auto sklearn' solution to some extend. After I met PyTorch, I was deeply attracted by its simplicity and power, but I failed to find a satisfying solution for tabular datasets which was 'carefree' enough. So I decided to take advantage of my knowledges and build one myself, and here comes the carefree-learn, which aims to provide out of the box tools to train neural networks on tabular datasets with PyTorch.

carefree-learn provides high level APIs for PyTorch to simplify the training on tabular datasets. It features:

  • A scikit-learn-like interface with much more 'carefree' usages. In fact, carefree-learn provides an end-to-end pipeline on tabular datasets, including AUTOMATICALLY deal with:
    • Detection of redundant feature columns which can be excluded (all SAME, all DIFFERENT, etc).
    • Detection of feature columns types (whether a feature column is string column / numerical column / categorical column).
    • Imputation of missing values.
    • Encoding of string columns and categorical columns (Embedding or One Hot Encoding).
    • Pre-processing of numerical columns (Normalize, Min Max, etc.).
    • And much more...
  • Can either fit / predict directly from some numpy arrays, or fit / predict indirectly from some files locate on your machine.
  • Easy-to-use saving and loading. By default, everything will be wrapped into a zip file!
  • Distributed Training, which means hyper-parameter tuning can be very efficient in carefree-learn.
  • Supports many convenient functionality in deep learning, including:
    • Early stopping.
    • Model persistence.
    • Learning rate schedulers.
    • And more...
  • Some 'translated' machine learning algorithms, including:
    • Trainable (Neural) Naive Bayes
    • Trainable (Neural) Decision Tree
  • Some brand new techniques which may boost vanilla Neural Network (NN) performances on tabular datasets, including:
  • Highly customizable for developers. We have already wrapped (almost) every single functionality / process into a single module (a Python class), and they can be replaced or enhanced either directly from source codes or from local codes with the help of some pre-defined registration functions provided by carefree-learn.
  • Full utilization of the WIP ecosystem cf*, such as:
    • carefree-toolkit: provides a lot of utility classes & functions which are 'stand alone' and can be leveraged in your own projects.
    • carefree-data: a lightweight tool to read -> convert -> process ANY tabular datasets. It also utilizes cython to accelerate critical procedures.

To try carefree-learn, you can install it with pip install carefree-learn.


I structured the carefree-learn backend in three modules: Model, Pipeline and Wrapper:

  • Model: In carefree-learn, a Model should implement the core algorithms.
    • It assumes that the input data in training process is already 'batched, processed, nice and clean', but not yet 'encoded'.
      • Fortunately, carefree-learn pre-defined some useful methods which can encode categorical columns easily.
    • It does not care about how to train a model, it only focuses on how to make predictions with input, and how to calculate losses with them.
  • Pipeline: In carefree-learn, a Pipeline should implement the high-level parts, as listed below:
    • It assumes that the input data is already 'processed, nice and clean', but it should take care of getting input data into batches, because in real applications batching is essential for performance.
    • It should take care of the training loop, which includes updating parameters with an optimizer, verbosing metrics, checkpointing, early stopping, logging, etc.
  • Wrapper: In carefree-learn, a Wrapper should implement the preparation and API part.
    • It should not make any assumptions to the input data, it could already be 'nice and clean', but it could also be 'dirty and messy'. Therefore, it needs to transform the original data into 'nice and clean' data and then feed it to Pipeline. The data transformations include:
      • Imputation of missing values.
      • Transforming string columns into categorical columns.
      • Processing numerical columns.
      • Processing label column (if needed).
    • It should implement some algorithm-agnostic methods (e.g. predict, save, load, etc.).

It is worth mentioning that carefree-learn uses registrations to manage the code structure.

Although the demand of working with tabular datasets is not that large, I'll be very happy if carefree-learn could help someone who needs it.


For detailed information, please visit the documentation.

Quick Start

import cflearn
from cfdata.tabular import TabularDataset

x, y = TabularDataset.iris().xy
m = cflearn.make().fit(x, y)
# Make label predictions
# Make probability predictions
# Estimate performance
cflearn.estimate(x, y, wrappers=m)

""" Then you will see something like this:

|        metrics         |                       acc                        |                       auc                        |
|                        |      mean      |      std       |     score      |      mean      |      std       |     score      |
|          fcnn          |    0.946667    |    0.000000    |    0.946667    |    0.993200    |    0.000000    |    0.993200    |


# `carefree-learn` models can be saved easily, into a zip file!
# For example, a `cflearn^_^` file will be created with this line of code:
# And loading `carefree-learn` models are easy too!
m = cflearn.load()
# You will see exactly the same result as above!
cflearn.estimate(x, y, wrappers=m)

# `carefree-learn` can also easily fit / predict / estimate directly on files!
# `delim` refers to 'delimiter', and `skip_first` refers to skipping first line or not.
# * Please refer to if you're interested in more details.
""" Suppose we have an 'xor.txt' file with following contents:


m = cflearn.make(delim=",", skip_first=False).fit("xor.txt", x_cv="xor.txt")
cflearn.estimate("xor.txt", wrappers=m)

""" Then you will see something like this:

|        metrics         |                       acc                        |                       auc                        |
|                        |      mean      |      std       |     score      |      mean      |      std       |     score      |
|          fcnn          |    1.000000    |    0.000000    |    1.000000    |    1.000000    |    0.000000    |    1.000000    |


# When we fit from files, we can predict on either files or lists:
print(m.predict([[0, 0]]))   # [[0]]
print(m.predict([[0, 1]]))   # [[1]]
print(m.predict("xor.txt"))  # [ [0] [1] [1] [0] ]


In carefree-learn, Distributed Training doesn't mean training your model on multiple GPUs or multiple machines, because carefree-learn focuses on tabular datasets (or, structured datasets) which are often not as large as unstructured datasets. Instead, Distributed Training in carefree-learn means training multiple models at the same time. This is important because:

  • Deep Learning models suffer from randomness, so we need to train multiple models with the same algorithm and calculate the mean / std of the performances to estimate the algorithm's capacity and stability.
  • Ensemble these models (which are trained with the same algorithm) can boost the algorithm's performance without making any changes to the algorithm itself.
  • Parameter searching will be easier & faster.
import cflearn
from cfdata.tabular import TabularDataset

# It is necessary to wrap codes under '__main__' on WINDOWS platform when running distributed codes
if __name__ == '__main__':
    x, y = TabularDataset.iris().xy
    # Notice that 3 fcnn were trained simultaneously with this line of code
    _, patterns = cflearn.repeat_with(x, y, num_repeat=3, num_parallel=3)
    # And it is fairly straight forward to apply stacking ensemble
    ensemble = cflearn.ensemble(patterns)
    patterns_dict = {"fcnn_3": patterns, "fcnn_3_ensemble": ensemble}
    cflearn.estimate(x, y, metrics=["acc", "auc"], other_patterns=patterns_dict)

""" Then you will see something like this:

|        metrics         |                       acc                        |                       auc                        |
|                        |      mean      |      std       |     score      |      mean      |      std       |     score      |
|         fcnn_3         |    0.937778    |    0.017498    |    0.920280    | -- 0.993911 -- |    0.000274    |    0.993637    |
|    fcnn_3_ensemble     | -- 0.953333 -- | -- 0.000000 -- | -- 0.953333 -- |    0.993867    | -- 0.000000 -- | -- 0.993867 -- |


You might notice that the best results of each column is 'highlighted' with a pair of '--'.

Hyper Parameter Optimization (HPO)

import cflearn
from cfdata.tabular import *

if __name__ == '__main__':
    x, y = TabularDataset.iris().xy
    # Bayesian Optimization (BO) will be used as default
    hpo = cflearn.tune_with(
        x, y,
        num_repeat=2, num_parallel=0, num_search=10
    # We can further train our model with the best hyper-parameters we've obtained:
    m = cflearn.make(**hpo.best_param).fit(x, y)
    cflearn.estimate(x, y, wrappers=m)

""" Then you will see something like this:

~~~  [ info ] Results
|        metrics         |                       acc                        |                       auc                        |
|                        |      mean      |      std       |     score      |      mean      |      std       |     score      |
|        0659e09f        |    0.943333    |    0.016667    |    0.926667    |    0.995500    |    0.001967    |    0.993533    |
|        08a0a030        |    0.796667    |    0.130000    |    0.666667    |    0.969333    |    0.012000    |    0.957333    |
|        1962285c        |    0.950000    |    0.003333    |    0.946667    |    0.997467    |    0.000533    |    0.996933    |
|        1eb7f2a0        |    0.933333    |    0.020000    |    0.913333    |    0.994833    |    0.003033    |    0.991800    |
|        4ed5bb3b        |    0.973333    |    0.013333    |    0.960000    |    0.998733    |    0.000467    |    0.998267    |
|        5a652f3c        |    0.953333    | -- 0.000000 -- |    0.953333    |    0.997400    |    0.000133    |    0.997267    |
|        82c35e77        |    0.940000    |    0.020000    |    0.920000    |    0.995467    |    0.002133    |    0.993333    |
|        a9ef52d0        | -- 0.986667 -- |    0.006667    | -- 0.980000 -- | -- 0.999200 -- | -- 0.000000 -- | -- 0.999200 -- |
|        ba2e179a        |    0.946667    |    0.026667    |    0.920000    |    0.995633    |    0.001900    |    0.993733    |
|        ec8c0837        |    0.973333    | -- 0.000000 -- |    0.973333    |    0.998867    |    0.000067    |    0.998800    |

~~~  [ info ] Best Parameters
acc  (a9ef52d0) (0.986667 Β± 0.006667)
{'optimizer': 'rmsprop', 'optimizer_config': {'lr': 0.005810863965757382}}
auc  (a9ef52d0) (0.999200 Β± 0.000000)
{'optimizer': 'rmsprop', 'optimizer_config': {'lr': 0.005810863965757382}}
best (a9ef52d0)
{'optimizer': 'rmsprop', 'optimizer_config': {'lr': 0.005810863965757382}}

~~  [ info ] Results
|        metrics         |                       acc                        |                       auc                        |
|                        |      mean      |      std       |     score      |      mean      |      std       |     score      |
|          fcnn          |    0.980000    |    0.000000    |    0.980000    |    0.998867    |    0.000000    |    0.998867    |


You might notice that:

  • The final results obtained by HPO is even better than the stacking ensemble results mentioned above.
  • We search for optimizer and lr as default. In fact, we can manually passed params into cflearn.tune_with. If not, then carefree-learn will execute following codes:
from import *

params = {
    "optimizer": String(Choice(values=["sgd", "rmsprop", "adam"])),
    "optimizer_config": {
        "lr": Float(Exponential(1e-5, 0.1))

It is also worth mention that we can pass file datasets into cflearn.tune_with as well. See tests/usages/ for more details.

What's next

The next step is to make some benchmark testing and optimize carefree-learn's performance. I'm pretty sure it can reach a satisfying level with some tuned default settings.

And, as always, bug fixing XD

Top comments (0)