Attention: This project is not affiliated with Netflix, it's just a fun experiment.
Why We Need a Better Netflix
Netflix is the undisputed leader in streaming, with over 200 million subscribers. It offers a library of more than 17000 movies and TV shows from across the globe, and this catalog continues to expand.
One of the major features of the platform is its recommendation system. According to Netflix, the platform suggests new content based on your viewing history, similar shows, trends from other users, and popular titles. But what if we could replicate this system without relying too heavily on extensive data sets or complex algorithms?
In this project, we'll try to build a simple movie recommendation system that includes direct search capabilities, a vector-based recommendation engine, and a way to make recommendations using AI-generated content.
The Tech Stack
For this application, the foundation is built on Next.js, TailwindCSS, and Shadcn UI. The idea for the project came from a talk by Tejas Kumar, where he developed a movie recommendation system using Datastax. Our tech stack, however, draws significant inspiration from the book inventory project by Lee Robinson.
To ensure fast performance, we'll utilize Next.js's SSR capabilities, along with its Image component to pre-render images and aggressive prefetching of each page.
Data storage will be handled by a Neon cloud PostgreSQL database, while the Vercel AI SDK will power our embeddings and search queries. Finally, Drizzle will serve as the ORM to interact with the database.
Feeding the Data
We’ll start by using an open-source dataset containing over 8,000 movies and TV shows, available on Kaggle. The dataset is in CSV format, which we can import directly into the database to get started. Additionally, the OMDB API is used to fetch poster images and IMDb scores for each title.
The demo csv file uses cast
as a column name, but in the database we'll use _cast
as a column name to avoid conflicts with the cast
function.
COPY tmp_titles(show_id, title, description, director, genres, _cast, production_country, release_date, rating, duration)
FROM '/data/netflix_titles.csv'
DELIMITER ','
CSV HEADER;
We’ll need to establish relationships between our tables, so we’ll create a few additional tables alongside a shows
table to serve as the main source of data for our application.
Here's how we can define the shows
table:
CREATE TABLE "shows" (
"id" serial PRIMARY KEY NOT NULL,
"title" text NOT NULL,
"poster" text, -- url to the poster image
"description" text, -- short description of the show
"release_year" integer, -- year of release
"duration" text, -- duration of the show
"imdb_score" numeric, -- imdb score of the show
"imdb_id" text, -- imdb id of the show
"content_type" text -- type of the show (movie or tv show)
);
Next, we'll create a table for actors and set up many-to-many relationships between the shows
and actors
tables:
CREATE TABLE "actors" (
"id" serial PRIMARY KEY NOT NULL,
"name" text NOT NULL,
CONSTRAINT "actors_name_unique" UNIQUE("name")
);
CREATE TABLE "show_actors" (
"show_id" integer NOT NULL,
"actor_id" integer NOT NULL,
CONSTRAINT "show_actors_show_id_actor_id_pk" PRIMARY KEY("show_id", "actor_id")
);
This allows us to establish connections between shows and their actors. We can continue by recursively linking actors to the shows they've appeared in. This is done by matching the shows from the tmp_titles
table with the shows
table and parsing the _cast
column to associate actors with their respective shows.
-- Insert actors from tmp_titles._cast into actors table
WITH RECURSIVE split_cast AS (
SELECT DISTINCT -- Trim whitespace and split the _cast string on commas
trim(unnest(string_to_array(_cast, ','))) as actor_name
FROM tmp_titles
WHERE _cast IS NOT NULL
)
INSERT INTO actors (name)
SELECT actor_name
FROM split_cast
WHERE actor_name != '' -- Skip empty strings
ON CONFLICT (name) DO NOTHING;
This same process can be applied to the categories
, directors
, and countries
tables.
Full Text Search
To improve search capabilities, we’ll add a search_text
column to the shows
table. This column will store a combination of text from the title
, description
, content_type
, actors
, directors
, and categories
columns, making it easier to perform full-text search queries.
We’ll utilize PostgreSQL’s pg_trgm
extension to improve text search performance, which breaks down text into three-character sequences for more efficient matching.
-- Create the extension for text search
CREATE EXTENSION IF NOT EXISTS pg_trgm;
-- Add the search_text column
ALTER TABLE "shows"
ADD COLUMN "search_text" text;
-- Create the index for the search_text column
CREATE INDEX "idx_shows_search_text_tsv" ON "shows" USING gin (to_tsvector('english', "search_text"));
Once this migration is completed, we can populate the search_text
column using a simple script:
import { db } from '@/lib/db';
const shows = await db.select().from(shows);
for (const show of batch) {
const searchText = `
${show.title}
${show.contentType}
${show.actors || ""}
${show.directors || ""}
${show.categories || ""}
`
.toLowerCase()
.replace(/\n/g, " ")
.replace(/\s+/g, " ");
await db.update(shows).set({ searchText }).where(eq(shows.id, show.id));
}
Now, with everything set up, we can perform full-text searches using the search_text
column on the server side:
const exactMatches = await db
.select({
id: shows.id,
title: shows.title,
poster: shows.poster,
imdbScore: shows.imdbScore,
})
.from(shows)
.where(
sql`to_tsvector('english', search_text) @@ to_tsquery('english', ${freeTextQuery})`
)
.orderBy(desc(shows.imdbScore))
.limit(10);
The tsvector
and tsquery
functions are used to match the vector and the query against the search_text
column.
Vector Search with AI Embeddings
While full-text search provides powerful functionality, it doesn’t fully meet our needs. We also want to support semantic searches, where users can find shows that are similar to a given title or based on abstract criteria.
This is where the Vercel AI SDK comes in. With the AI SDK, we can easily create a client that generates embeddings—vector representations of the metadata—enabling semantic search capabilities.
For the object detection model, we’ll use Mistral:
import { mistral } from "@ai-sdk/mistral";
export const objectModel = mistral("mistral-large-latest", {});
Next, we define the Zod schema for show metadata that the AI will use to generate embeddings:
const showMetadataSchema = z.object({
mood: z.array(z.string()).optional(),
targetAudience: z.array(z.string()),
themes: z.array(z.string()),
characters: z.array(z.string()),
setting: z.array(z.string()),
viewerEngagement: z.array(z.string()),
plotComplexity: z.array(z.string()).optional(),
plot: z.string().optional(),
eras: z.array(z.string()).optional(),
});
Using this schema, we can ask the AI model to analyze a show’s information and generate metadata in the required format.
const result = await generateObject({
// @ts-ignore
model: objectModel,
schemaName: "showMetadata",
schemaDescription: "Information for semantic search",
schema: showMetadataSchema,
prompt: `
Analyze the following show information and generate metadata, return the metadata as a JSON object.
plotComplexity should be a list of keywords that describe the plot complexity of the show.
characters should be a list of character names.
setting should be a list of keywords that describe the setting of the show. Use countries, places, cities or other locations that are relevant to the show.
eras should be a list of keywords that describe the eras of the show.
viewerEngagement should be a list of single keywords that describe the viewer engagement of the show, how the show sticks with the viewer.
targetAudience should be a list of single keywords that describe the target audience of the show.
mood should be a list of single keywords that describe the mood of the show.
plot should be a single sentence that describes the plot of the show.
Here is the show information:
Title: ${show.title}
Description: ${show.description || "N/A"}
IMDB Rating: ${show.imdbScore || "N/A"}
Type: ${show.contentType || "N/A"}
Actors: ${show.actors || "N/A"}
Categories: ${show.categories || "N/A"}
Directors: ${show.directors || "N/A"}
Duration: ${show.duration || "N/A"}
Genres: ${show.genres || "N/A"}
Based on this information, provide the metadata as specified in the schema.
`,
});
The result could look like this for Breaking Bad:
{
"mood": [
"Dark",
"Intense",
"Drama"
],
"targetAudience": [
"Adult",
"Crime Enthusiasts",
"Drama Lovers"
],
"themes": [
"Crime",
"Drugs",
"Family",
"Morality"
],
"characters": [
"Walter White",
"Jesse Pinkman",
"Skyler White",
"Hank Schrader",
"Saul Goodman",
"Gustavo Fring",
"Mike Ehrmantraut",
"Walter White Jr.",
"Marie Schrader",
"Steven Gomez"
],
"setting": [
"Albuquerque",
"New Mexico",
"United States"
],
"viewerEngagement": [
"Binge-worthy",
"Edge of your seat",
"Thought-provoking"
],
"plotComplexity": [
"Character-driven",
"Moral dilemmas",
"Twists and turns"
],
"plot": "A high school chemistry teacher dying of cancer teams with a former student to secure his family's future by manufacturing and selling crystal meth.",
"eras": [
"2000s",
"Contemporary"
]
}
With this metadata in hand, we can store the embeddings for each show in the shows
table and enable vector search. First, we create the necessary extension and add a new column to the table.
The embedding column is a vector of size 1024, which is the default size for the Mistral model.
-- Create the extension
CREATE EXTENSION IF NOT EXISTS vector;
-- Add the embedding column
ALTER TABLE "shows"
ADD COLUMN "embedding" vector(1024);
-- Create the index for the embedding column
CREATE INDEX "embeddingIndex" ON "shows" USING ivfflat ("embedding" vector_cosine_ops);
Once the extension is set up, we generate and store the embeddings using the AI SDK:
import { mistral } from "@ai-sdk/mistral";
const embeddingModel = mistral.textEmbeddingModel("mistral-embed");
const shows = await db.select().from(shows).limit(1);
const show = shows[0];
const embedding = await embed(`
Title: ${show.title}
Description: ${show.description || "N/A"}
IMDB Rating: ${show.imdbScore || "N/A"}
Type: ${show.contentType || "N/A"}
Actors: ${show.actors || "N/A"}
Categories: ${show.categories || "N/A"}
Directors: ${show.directors || "N/A"}
`);
await db.update(shows).set({ embedding }).where(eq(shows.id, show.id));
With the embeddings in place, we can query the nearest neighbors for any given show:
const query = "A show with strong female characters";
const embedding = await embed(query);
const results = await db
.select({
id: shows.id,
title: shows.title,
poster: shows.poster,
imdbScore: shows.imdbScore,
})
.from(shows)
.where(sql`vector_cosine_ops(embedding, vector(${embedding})) > 0.5`);
// Returns Borgen, Gilmore Girls, La Femme etc.
Moving Forward
If you want to try this out for yourself, you can clone the repository and run the project locally. You can find the repository here and the live demo here.
Try to add a new feature to the project, for example add a new type of recommendation or improve the search functionality using different models like OpenAI or Gemini. Additionally you can also expand the dataset to include more shows and movies as well as try to add more metadata to the shows.
Final Thoughts
In conclusion, by leveraging simple yet effective technologies like PostgreSQL, Next.js, and AI-driven embeddings, we can build a highly functional and scalable recommendation engine for a any data heavy platform.
The combination of full-text search and semantic AI search provides a more robust, personalized experience, allowing users to discover content that resonates with them based on both explicit and inferred preferences.
Top comments (0)