Regression analysis is a statistical method used in data analysis to model the relationship between a dependent variable and one or more independent variables. It’s widely used in various fields, including economics, finance, psychology, biology, and many others.
Imagine there is a dataset containing information about the number of hours students spend studying and their corresponding exam scores. We want to understand if there’s a linear relationship between the number of hours studied and the exam scores.
First, let’s generate some sample data for this example:
Initializing relevant libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
Generating sample data
np.random.seed(0)
hours_studied = np.random.normal(5, 2, 100) # Mean=5, Std Dev=2
exam_scores = 70 + 5 * hours_studied + np.random.normal(0, 5, 100) # Linear relationship with noise
Creating a DataFrame
data = pd.DataFrame({‘Hours_Studied’: hours_studied, ‘Exam_Scores’: exam_scores})
Visualizing the data
plt.scatter(data[‘Hours_Studied’], data[‘Exam_Scores’])
plt.xlabel(‘Hours Studied’)
plt.ylabel(‘Exam Scores’)
plt.title(‘Relationship between Hours Studied and Exam Scores’)
plt.show()
Relationship between Hours Studied and Exam Scores (visualized on Google Colab).
Now that the sample data is generated, simple linear regression is performed to model the relationship between hours studied and exam scores:
from sklearn.linear_model import LinearRegression
Extracting features (independent variable) and target (dependent variable)
X = data[[‘Hours_Studied’]] # Features (independent variable)
y = data[‘Exam_Scores’] # Target (dependent variable)
Creating and fitting the model
model = LinearRegression()
model.fit(X, y)
Getting the coefficients
intercept = model.intercept_
slope = model.coef_[0]
print(“Intercept:”, intercept)
print(“Slope:”, slope)
Output:
Intercept: 68.94203500593281
Slope: 5.28674608386595
Once the regression model is trained using our data, we can plot the data points on a graph as a scatter plot and also plot the regression line that the model has estimated based on the relationship between the independent and dependent variables.
In the context of simple linear regression, the regression line represents the best-fitting straight line through the data points.
Visualizing the regression line
plt.scatter(data[‘Hours_Studied’], data[‘Exam_Scores’])
plt.plot(data[‘Hours_Studied’], model.predict(X), color=’red’) # Regression line
plt.xlabel(‘Hours Studied’)
plt.ylabel(‘Exam Scores’)
plt.title(‘Relationship between Hours Studied and Exam Scores’)
plt.show()
Regression Line (visualized on Google Colab).
In this example, the intercept represents the predicted exam score when the number of hours studied is zero, and the slope represents the change in exam score associated with a one-unit increase in hours studied.
While both regression analysis and correlation examine relationships between variables, regression analysis focuses on predicting the value of a dependent variable based on independent variables. While correlation measures the strength and direction of the linear relationship between two variables without implying causality or prediction.
Causality refers to the relationship between cause and effect, where a change in one variable (the cause) leads to a change in another variable (the effect).
Types of Regression:
• Simple Linear Regression: Involves one independent variable and one dependent variable. The relationship between them is modeled as a straight line.
• Multiple Linear Regression: Includes multiple independent variables and one dependent variable. The relationship is modeled as a linear combination of the independent variables.
• Polynomial Regression: Used when the relationship between variables isn’t linear but can be approximated by a polynomial function.
• Logistic Regression: Used when the dependent variable is binary (e.g., yes/no, 0/1). It models the probability of occurrence of an event.
Assumptions:
• Linearity: The relationship between independent and dependent variables is linear.
• Independence: Observations are independent of each other.
• Homoscedasticity: Residuals (the differences between observed and predicted values) have constant variance. In simpler terms, it means that as the values of the independent variable change, the variability of the residuals around the regression line remains the same.
For example, suppose we’re examining the relationship between hours studied and exam scores. In a homoscedastic scenario, regardless of whether a student studies for a few or many hours, the differences between their actual exam scores and the scores predicted by the regression model would have a consistent spread or variance.
• Normality of Residuals: Residuals are normally distributed. If we were to plot a histogram or a density plot of the residuals, it would resemble a bell-shaped curve characteristic of a normal distribution. This implies that:
The majority of the residuals cluster around zero, indicating that the model’s predictions are, on average, close to the observed values.
As we move away from zero, the frequency of residuals decreases symmetrically, following the shape of a normal distribution.
Let’s visualize this:
Calculating residuals
residuals = y — model.predict(X)
Plotting histogram of residuals
plt.figure(figsize=(8, 6))
sns.histplot(residuals, kde=True, color=’skyblue’)
plt.xlabel(‘Residuals’)
plt.ylabel(‘Frequency’)
plt.title(‘Histogram of Residuals’)
plt.show()
Plotting density plot of residuals
plt.figure(figsize=(8, 6))
sns.kdeplot(residuals, color=’orange’, fill=True)
plt.xlabel(‘Residuals’)
plt.ylabel(‘Density’)
plt.title(‘Density Plot of Residuals’)
plt.show()
The density plot and histogram both resemble a bell-shape hence it is a normally distributed dataset.
Process:
A typical process for any regression analysis includes the following key steps:
• Data Collection: Gather relevant data on the variables of interest.
• Data Preparation: Clean the data, handle missing values, and transform variables if necessary.
• Model Specification: Decide which independent variables to include in the model and choose the appropriate type of regression.
• Model Estimation: Use statistical software to estimate the parameters of the regression model.
• Model Evaluation: Assess the goodness of fit of the model, check for violations of assumptions, and interpret the results.
• Prediction and Inference: Use the model to make predictions or infer relationships between variables.
Interpretations:
• Coefficient Estimates: Coefficients represent the change in the dependent variable associated with a one-unit change in the independent variable, holding other variables constant.
• R-squared: Indicates the proportion of variance in the dependent variable explained by the independent variables.
• P-values: Measure the statistical significance of the coefficients. Lower p-values indicate stronger evidence against the null hypothesis (that the coefficient is zero).
Applications:
• Predictive modeling
• Causal inference
• Forecasting
• Trend analysis
• Risk assessment
Software:
• Popular statistical software packages for regression analysis include R, Python (with libraries like NumPy, pandas, and statsmodels), SAS, and SPSS.
Hence it can be seen that understanding regression analysis basics is fundamental for conducting rigorous data analysis and making informed decisions in various fields.
Top comments (0)