DEV Community

Cover image for Linear regression, quick guide
Ziaeemehr
Ziaeemehr

Posted on

Linear regression, quick guide

I am going to provide a quick guide for data analysis method, by providing a short description of what the method is doing and provide some simple examples. I usually do this for little things that I learn every day and document it in a simple and clear way to reduce the time when I need them later.

I use chapter 15 of Numerical python by Robert Johansson, Statistical Modeling.
Patsy The patsy library provides features for defining statistical models with a simple formula language inspired by statistical software such as r. The patsy library is designed to be a companion library for statistical modeling packages, such as statsmodels. For more information about the project and its documentation, see the web page http://patsy.readthedocs.org.

As a simplest example I am going to fit a line to some 1 dimensional data, y = f(x). Knowing the function f(x) would allow us to compute the value of Y for any of values x. If we do not know the function f(x), but we have access to data for observations {yi, xi}, we can parameterize the function f(x) and fit the values of the parameters to the
data. An example of a parameterization of f(x) is the linear model f(x)= beta_0 + beta_1 x, where the coefficients beta_0 and beta_1 are the parameters of the model.

Importing the libraries

import patsy
import numpy as np
import pandas as pd
from scipy import stats
import statsmodels.api as sm
import statsmodels.formula.api as smf
import statsmodels.graphics.api as smg

Generating some data and plot them.

x = np.arange(0, 10, 0.1)
y = 5 * np.random.rand(len(x)) + x
# plotting the data
pl.scatter(x, y, s=10)

putting the data in a pandas data frame.

data = {"y": y, "x": x}
df_data = pd.DataFrame(data)

I directly pass the patsy formula y ~ x to the ordinary linear regression (ols). Note that we leave out coefficients in the model formula, as it is implicitly assumed that each term in the formula has a model parameter as coefficient.

model = smf.ols("y ~ x", df_data)
result = model.fit()
print(result.params)

#Intercept    2.230047
#x            1.020011 (the slope)
#dtype: float64

Finally I plot the fitted values.

pl.plot(x, result.fittedvalues, lw=2, c="k",
        label="y=ax+b, a=%.3f, b=%.3f" % (result.params["x"],
                                          result.params["Intercept"]))
pl.legend()
pl.xlabel("x")
pl.ylabel("y")
pl.show()

instead of this we could use lower level method least square fitting, which give us the save result.

X = np.vstack([np.ones(len(x)), x]).T
beta, res, rank, sval = np.linalg.lstsq(X, y, rcond=None)
print(beta)
#[2.23004704 1.02001068]

If X is a vector in Y= beta_0 + beta_1 X we have multiple linear regression, and if Y is a vector it is known as multivariate linear regression.
An example of multiple linear regression:

### from least square fitting
y = np.array([1, 2, 3, 4, 5])
x1 = np.array([6, 7, 8, 9, 10])
x2 = np.array([11, 12, 13, 14, 15])

X = np.vstack([np.ones(5), x1, x2, x1 * x2]).T
# print(X)
beta, res, rank, sval = np.linalg.lstsq(X, y, rcond=None)
# print (beta)
# [-5.55555556e-01  1.88888889e+00 -8.88888889e-01 -1.11022302e-15]


### DesignMatrix: to create a dictionary data that maps the variable
# names to the corresponding data arrays
data = {"y": y, "x1": x1, "x2": x2}
y, X = patsy.dmatrices("y ~ 1 + x1 + x2 + x1 * x2", data)
# print (X)

### DesignMatrix from data frame
df_data = pd.DataFrame(data)
y, X = patsy.dmatrices("y ~ 1 + x1 + x2 + x1:x2", df_data, return_type="dataframe")
# print(X)

ordinary linear regression (OLS)
we can use the class OLS from the statsmodels library instead of using the lower-level method np.linalg.lstsq

model = sm.OLS(y, X)
result = model.fit()
# print(result.params)
# Intercept   -5.555556e-01
# x1           1.888889e+00
# x2          -8.888889e-01
# x1:x2       -7.771561e-16
# dtype: float64
# print(result.params["x1:x2"])

we can directly pass the Patsy formula for the model when we create a model instance, which completely eliminates the need for first creating the design matrices

model = smf.ols("y ~ 1 + x1 + x2 + x1:x2", df_data)
result = model.fit()
# print(result.params)
# Intercept   -5.555556e-01
# x1           1.888889e+00
# x2          -8.888889e-01
# x1:x2       -7.771561e-16
# dtype: float64

As concrete examples for patsy formula, consider the following formula and the resulting right-hand side terms (which we can extract from the design_info attribute using the term_names attribute):

from collections import defaultdict
data = defaultdict(lambda: np.array([]))
print(patsy.dmatrices("y ~ a", data=data)[1].design_info.term_names)
# ['Intercept', 'a']
print(patsy.dmatrices("y ~ 1 + a + b", data=data)[1].design_info.term_names)
# ['Intercept', 'a', 'b']
print(patsy.dmatrices("y ~ -1 + a + b", data=data)[1].design_info.term_names)
# ['a', 'b']
print(patsy.dmatrices("y ~ a * b", data=data)[1].design_info.term_names)
# ['Intercept', 'a', 'b', 'a:b']
print(patsy.dmatrices("y ~ a * b * c", data=data)[1].design_info.term_names)
# ['Intercept', 'a', 'b', 'a:b', 'c', 'a:c', 'b:c', 'a:b:c']
print(patsy.dmatrices("y ~ a * b * c - a:b:c", data=data)[1].design_info.term_names)
# ['Intercept', 'a', 'b', 'a:b', 'c', 'a:c', 'b:c']

Top comments (3)

Collapse
 
wrldwzrd89 profile image
Eric Ahnell

This screams to me loud and clear: "You need to be using SciPy!" Thanks for sharing.

Collapse
 
ziaeemehr profile image
Ziaeemehr

You're welcome, using scipy in simple example seems easier, but statsmodel and patsy show the power in more complex cases.

Collapse
 
wrldwzrd89 profile image
Eric Ahnell

As is so often the case, these libraries build upon each other... and I have a lot of learning to do!