Pejman Rezaei

Posted on Feb 1

Predicting House Prices as Your First ML Project

#machinelearning #ai #datascience #python

Machine Learning (ML) can seem intimidating at first, but the best way to learn is by doing. In this article, we’ll walk through a beginner-friendly ML project: predicting house prices using the Boston Housing dataset. By the end of this guide, you’ll have built your first ML model using Python and Scikit-learn. Let’s get started!

What is the Boston Housing Dataset?

The Boston Housing dataset is a classic dataset used for regression problems. It contains information about housing prices in the Boston area, along with features that might influence those prices, such as:

CRIM: Per capita crime rate by town.
RM: Average number of rooms per dwelling.
AGE: Proportion of owner-occupied units built before 1940.
DIS: Weighted distances to five Boston employment centers.
LSTAT: Percentage of lower status of the population.
MEDV: Median value of owner-occupied homes in $1000s (the target variable we want to predict).

Our goal is to build a model that predicts the median house price (MEDV) based on these features.

Step 1: Set Up Your Environment

Before we start, make sure you have the necessary libraries installed. You can install them using pip:

pip install numpy pandas scikit-learn matplotlib

Step 2: Load the Dataset

Scikit-learn provides the Boston Housing dataset as part of its built-in datasets. Let’s load it and explore the data.

# Import libraries
import numpy as np
import pandas as pd
from sklearn.datasets import load_boston
import matplotlib.pyplot as plt

# Load the dataset
boston = load_boston()

# Convert it to a Pandas DataFrame for easier manipulation
data = pd.DataFrame(boston.data, columns=boston.feature_names)
data['MEDV'] = boston.target  # Add the target variable to the DataFrame

# Display the first few rows
print(data.head())

Step 3: Explore the Data

Before building a model, it’s important to understand the data. Let’s perform some basic exploratory data analysis (EDA).

# Check for missing values
print(data.isnull().sum())

# Get basic statistics
print(data.describe())

# Visualize the relationship between features and the target variable
plt.scatter(data['RM'], data['MEDV'])
plt.xlabel('Average Number of Rooms (RM)')
plt.ylabel('Median House Price (MEDV)')
plt.title('Rooms vs. Price')
plt.show()

From the scatter plot, you can see that houses with more rooms tend to have higher prices. This is a good sign that our features are relevant to the target variable.

Step 4: Prepare the Data

Next, we’ll split the data into features (X) and labels (y), and then split it into training and testing sets.

from sklearn.model_selection import train_test_split

# Features (X) and labels (y)
X = data.drop('MEDV', axis=1)  # All columns except 'MEDV'
y = data['MEDV']  # Only the 'MEDV' column

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Step 5: Build and Train the Model

We’ll use a Linear Regression model, which is a simple and effective algorithm for regression problems.

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Initialize the model
model = LinearRegression()

# Train the model
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

Step 6: Evaluate the Model

To see how well our model performs, we’ll calculate two common metrics: Mean Squared Error (MSE) and R-squared (R²).

# Calculate Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse:.2f}")

# Calculate R-squared (R²)
r2 = r2_score(y_test, y_pred)
print(f"R-squared: {r2:.2f}")

MSE: Measures the average squared difference between the predicted and actual values. Lower is better.
R²: Represents the proportion of variance in the target variable that’s explained by the model. Closer to 1 is better.

Step 7: Interpret the Results

Let’s interpret the results:

A low MSE indicates that the model’s predictions are close to the actual values.
An R² value close to 1 suggests that the model explains a large portion of the variance in house prices.

For example, if your R² is 0.75, it means that 75% of the variability in house prices can be explained by the features in the dataset.

Step 8: Make Predictions

Now that the model is trained, you can use it to make predictions on new data. For example, let’s predict the price of a house with the following features:

# Example input (replace with your own values)
new_house = np.array([[0.02731, 0.0, 7.07, 0.0, 0.469, 6.421, 78.9, 4.9671, 2.0, 242.0, 17.8, 396.90, 9.14]])

# Predict the price
predicted_price = model.predict(new_house)
print(f"Predicted Price: ${predicted_price[0] * 1000:.2f}")

Real-World Applications

Predicting house prices is just one example of how ML can be applied in the real world. Here are some other applications of regression models:

Stock Price Prediction: Predicting the future price of stocks based on historical data.
Sales Forecasting: Estimating future sales based on past trends and external factors.
Healthcare: Predicting patient outcomes based on medical data.

Conclusion

We just built our first ML project using the Boston Housing dataset. Here’s a quick recap of what we covered:

Loaded and explored the dataset.
Prepared the data for training.
Built and trained a Linear Regression model.
Evaluated the model’s performance.
Made predictions on new data.

This is just the beginning of your ML journey. As you continue learning, you can explore more advanced algorithms, work with larger datasets, and tackle real-world problems.

If you have any questions or want to share your results, feel free to leave a comment below.

Follow Me

For more updates, check out my Mastodon blog: @prezaei@mastodon.social.

DEV Community

Predicting House Prices as Your First ML Project

What is the Boston Housing Dataset?

Step 1: Set Up Your Environment

Step 2: Load the Dataset

Step 3: Explore the Data

Step 4: Prepare the Data

Step 5: Build and Train the Model

Step 6: Evaluate the Model

Step 7: Interpret the Results

Step 8: Make Predictions

Real-World Applications

Conclusion

Follow Me

Top comments (0)

Read next

DeepSeek R1 vs o3-mini para Desenvolvedores: Qual é o melhor?

DeepSeek R1 vs o3-mini for Developers: Which is the Best?

Introducing Jolt: AI Codegen and Chat for 100K to Multi-Million Line Codebases

Mastering AI Coding for Beginners: Build a Responsive Menu with Next.js