Machine Learning (ML) can seem intimidating at first, but the best way to learn is by doing. In this article, we’ll walk through a beginner-friendly ML project: predicting house prices using the Boston Housing dataset. By the end of this guide, you’ll have built your first ML model using Python and Scikit-learn. Let’s get started!
What is the Boston Housing Dataset?
The Boston Housing dataset is a classic dataset used for regression problems. It contains information about housing prices in the Boston area, along with features that might influence those prices, such as:
- CRIM: Per capita crime rate by town.
- RM: Average number of rooms per dwelling.
- AGE: Proportion of owner-occupied units built before 1940.
- DIS: Weighted distances to five Boston employment centers.
- LSTAT: Percentage of lower status of the population.
- MEDV: Median value of owner-occupied homes in $1000s (the target variable we want to predict).
Our goal is to build a model that predicts the median house price (MEDV) based on these features.
Step 1: Set Up Your Environment
Before we start, make sure you have the necessary libraries installed. You can install them using pip:
pip install numpy pandas scikit-learn matplotlib
Step 2: Load the Dataset
Scikit-learn provides the Boston Housing dataset as part of its built-in datasets. Let’s load it and explore the data.
# Import libraries
import numpy as np
import pandas as pd
from sklearn.datasets import load_boston
import matplotlib.pyplot as plt
# Load the dataset
boston = load_boston()
# Convert it to a Pandas DataFrame for easier manipulation
data = pd.DataFrame(boston.data, columns=boston.feature_names)
data['MEDV'] = boston.target # Add the target variable to the DataFrame
# Display the first few rows
print(data.head())
Step 3: Explore the Data
Before building a model, it’s important to understand the data. Let’s perform some basic exploratory data analysis (EDA).
# Check for missing values
print(data.isnull().sum())
# Get basic statistics
print(data.describe())
# Visualize the relationship between features and the target variable
plt.scatter(data['RM'], data['MEDV'])
plt.xlabel('Average Number of Rooms (RM)')
plt.ylabel('Median House Price (MEDV)')
plt.title('Rooms vs. Price')
plt.show()
From the scatter plot, you can see that houses with more rooms tend to have higher prices. This is a good sign that our features are relevant to the target variable.
Step 4: Prepare the Data
Next, we’ll split the data into features (X) and labels (y), and then split it into training and testing sets.
from sklearn.model_selection import train_test_split
# Features (X) and labels (y)
X = data.drop('MEDV', axis=1) # All columns except 'MEDV'
y = data['MEDV'] # Only the 'MEDV' column
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Step 5: Build and Train the Model
We’ll use a Linear Regression model, which is a simple and effective algorithm for regression problems.
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
# Initialize the model
model = LinearRegression()
# Train the model
model.fit(X_train, y_train)
# Make predictions on the test set
y_pred = model.predict(X_test)
Step 6: Evaluate the Model
To see how well our model performs, we’ll calculate two common metrics: Mean Squared Error (MSE) and R-squared (R²).
# Calculate Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse:.2f}")
# Calculate R-squared (R²)
r2 = r2_score(y_test, y_pred)
print(f"R-squared: {r2:.2f}")
- MSE: Measures the average squared difference between the predicted and actual values. Lower is better.
- R²: Represents the proportion of variance in the target variable that’s explained by the model. Closer to 1 is better.
Step 7: Interpret the Results
Let’s interpret the results:
- A low MSE indicates that the model’s predictions are close to the actual values.
- An R² value close to 1 suggests that the model explains a large portion of the variance in house prices.
For example, if your R² is 0.75, it means that 75% of the variability in house prices can be explained by the features in the dataset.
Step 8: Make Predictions
Now that the model is trained, you can use it to make predictions on new data. For example, let’s predict the price of a house with the following features:
# Example input (replace with your own values)
new_house = np.array([[0.02731, 0.0, 7.07, 0.0, 0.469, 6.421, 78.9, 4.9671, 2.0, 242.0, 17.8, 396.90, 9.14]])
# Predict the price
predicted_price = model.predict(new_house)
print(f"Predicted Price: ${predicted_price[0] * 1000:.2f}")
Real-World Applications
Predicting house prices is just one example of how ML can be applied in the real world. Here are some other applications of regression models:
- Stock Price Prediction: Predicting the future price of stocks based on historical data.
- Sales Forecasting: Estimating future sales based on past trends and external factors.
- Healthcare: Predicting patient outcomes based on medical data.
Conclusion
We just built our first ML project using the Boston Housing dataset. Here’s a quick recap of what we covered:
- Loaded and explored the dataset.
- Prepared the data for training.
- Built and trained a Linear Regression model.
- Evaluated the model’s performance.
- Made predictions on new data.
This is just the beginning of your ML journey. As you continue learning, you can explore more advanced algorithms, work with larger datasets, and tackle real-world problems.
If you have any questions or want to share your results, feel free to leave a comment below.
Follow Me
For more updates, check out my Mastodon blog: @prezaei@mastodon.social.
Top comments (0)