Data Correlation Analysis

#programming #maxbox #data #datascience

This tutorial on data analysis with Python and Python4Delphi, uses four powerful libraries: Pandas, Scikit (sklearn), Seaborn and Matplotlib.

The second part will put a spot concerning overfitting a model. This short section shows how the see overfitting with high variance measure.

Regarding the difference sklearn vs. scikit-learn: The package “scikit-learn” is recommended to be installed using pip install scikit-learn but in your code imported using import sklearn. A bit confusing, because you can also do pip install sklearn and will end up with the same scikit-learn package installed, because there is a small "dummy" pypi package sklearn which will install scikit-learn for you.

Now, let’s talk about importing real data. In this tutorial, we’ll use a link file with import house prices as a target value. You can import data into Pandas using the sklearn.datasets function. We load in a dataset from Scikit-Learn and pack it into a DataFrame:

import pandas as pd
import numpy as np
from sklearn.datasets import fetch_california_housing

# Target column is under ch.target, rest is under ch.data
ch = fetch_california_housing(as_frame=True)
df = pd.DataFrame(data=ch.data, columns=ch.feature_names)
df['MedHouseVal'] = ch.target

df.head()


And the same with Python4Delphi_maXbox5:

execstr('from sklearn.datasets import fetch_california_housing');
execstr('housing = fetch_california_housing()');
//# Target column is under ch.target, the rest is under ch.data
execstr('ch = fetch_california_housing(as_frame=True)');
execstr('df = pd.DataFrame(data=ch.data, columns=ch.feature_names)');
execstr('df["MedHouseVal"] = ch.target; print(df.head())');

Checking for correlation, and quantifying correlation is one of the key steps during exploratory data analysis and forming hypotheses, for example the correlation between house value and house age.

“Covariance” indicates the direction of the linear relationship between variables. “Correlation” on the other hand measures both the strength and direction of the linear relationship between two variables.

Now, let’s move on to visualizing the data in a correlation matrix. Matplotlib and Seaborn are great libraries for creating charts and graphs. To get the correlations between all of the numerical features above, we simply call df.corr() (which defaults to Pearson Correlation). Since a tabular format isn’t really intuitive or understandable— let’s plot this as a heat map with the help from seaborn:

import seaborn as sns
import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(10, 6))

sns.heatmap(df.corr(), ax=ax, annot=True)

// Python4Delphi
execstr('fig, ax = plt.subplots(figsize=(10, 6))             '+LF+
        'plt.title("corr. matrix - california_housing mX5")  '+LF+   
        'sns.heatmap(df.corr(),ax=ax,annot=True); plt.show() ');

The script you get at: https://sourceforge.net/projects/maxbox5/files/examples/1382_data_science_stuff140_1_py_uc.txt/download

One of the fundamental activities in statistics is creating models that can summarize data using a small set of numbers, thus providing a compact description of the data. In this correlation we will discuss the concept of a statistical model and how it can be used to describe data.

Corralation Matrix of House Prices (MedHouseVal)

We see a strong correlation between house prices and income (MedInc) of 69% positive correlation of 20640 records. Another correlation is between income and rooms with 33%. By the way: the top correlation of 85% is a typical covariance of average rooms and average bedrooms relationship!

We can also see a single target variable and would like to show which features correlate with it. We’ll calculate the correlations with df.corr() and then subset the resulting DataFrame to include only the target column:

//Get Correlation to Target Variable
     execstr('corr = df.corr()[["MedHouseVal"]]         '+LF+
             'plt.title("Correlation to Target - mX5")  '+LF+  
             'sns.heatmap(corr, annot=True);plt.show()  ');   
//Thanks the idea of: https://stackabuse.com/bytes/calculate-correlation-of-dataframe-featurescolumns-with-pandas/

Correlation to Target

How do we calculate correlation matrix?

The correlation matrix above shows the correlation coefficients between several variables related to house value: Each cell in the table matrix shows the correlation between two specific variables (features):

type DMatrix = array of array of double;

procedure CalculateCorrelationMatrix(data:DMatrix; var matrix:DMatrix);
var i,j,n: Integer;
begin
  n:= Length(data);
  for i:= 0 to n - 1 do begin
    for j:= 0 to n - 1 do 
      matrix[i][j]:= PearsonCorrelation(data[i], data[j]);
  end;
end;

The funny thing is the type DMatrix, it’s a double array of course but als an array of double :).

The second part visualizes in a short fragment simulation the problem of overfitting data (advanced concept). In statistics, a model is meant to provide a similarly condensed description, but for data rather than for a physical structure. Like physical models, a statistical model is generally much simpler than the data being described;

The basic structure of a statistical model is: data=model+error

This code snippet trains a support vector regression (SVR) or linear regression model, predicts the target values for the test set, and then calculates and prints the R² score and Mean Squared Error (MSE) for the model. Feel free to adapt it to your specific dataset and model!

Understanding the fit() Method
The fit() method in Scikit-Learn is used to train a machine learning model. Training a model involves feeding it with data so it can learn the underlying patterns. This method adjusts the parameters of the model based on the provided or simulated data.

execstr('model = SVR(gamma="auto") #=scale              '+LF+ 
        'print(model)                                   '+LF+ 
        'model.fit(x,y)                                 '+LF+ 
        'pred_y = model.predict(x)                      '+LF+ 
        '                                               '+LF+ 
        'for yo, yp in zip(y[1:15,:], pred_y[1:15]):    '+LF+ 
        '  print(yo,yp)                                 ');   

execstr('x_ax=range(Datarange)                                       '+LF+
        'plt.scatter(x_ax, y, s=5,color="blue",label="original")     '+LF+
        'plt.plot(x_ax,pred_y, lw=1.5,color="red",label="predicted") '+LF+
        'plt.title("high variance - be overfitted mX5")              '+LF+
        'plt.legend()                                                '+LF+
        'plt.show()                                                  ');  

execstr('score=model.score(x,y)                   '+LF+
        'print("Score:",score)                    '+LF+
        '                                         '+LF+
        'mse =mean_squared_error(y, pred_y)       '+LF+
        'print("Mean Squared Error:",mse)         ');

Scikit-learns model.score(X,y) calculation works on co-efficient of determination i.e R² is a simple function that takes model.score= (X_test,y_test). It doesn’t require y_predicted value to be supplied externally to calculate the score for you, rather it calculates y_predicted internally and uses it in the calculations.

Overfitting a model

As we can see the variance of predicte is to high. In machine learning, overfitting occurs when an algorithm fits too closely or even exactly to its training data, resulting in a model that can’t make accurate predictions or conclusions from any data other than the training data.

What is Overfitting? | IBM

Overfitting defeats purpose of the machine learning model. Generalization of a model to new data is ultimately what allows us to use machine learning algorithms every day to make predictions and classify data.