Build your own AI search engine similar to Perplexity

#ai #learning #webdev #database

I'm sure you have heard of Perplexity by now. It's an AI chatbot-style search engine that combines traditional search engines with Large Language Models (LLMs) to answer your questions. Think of it as ChatGPT but entirely focused on searching the web and answering your questions.

It's an interesting tool and I just recently gave it a try after hearing about it all over the place. My initial reaction was that "this is pretty cool" but my engineering brain got the best of me and I started thinking about how it works and how I could build something similar for myself.

As it turns out, it's actually quite easy to build something similar yourself. Of course, Perplexity has a lot of things going for it given the size of their team and the resources they have access to, but the core functionality? There's no moat there.

In this post, I'm going to walk you through the process of building a very simple version of Perplexity for yourself. Keep in mind that this is going to be a very trivial implementation for the sake of simplicity and that you can take each step and make it as complex as you want.

Installing Dependencies

Before we can start coding, we need to install the necessary dependencies. These are the tools and libraries that our AI agent will use to search the web and convert the results to markdown.

We'll need to install the following packages:

duck-duck-scrape: This package allows our agent to search the web using DuckDuckGo.
agentmarkdown: This package helps convert HTML to markdown format.
ai: This package provides tools for generating text and working with AI models.
zod: This package is used for runtime type checking and validation.
@ai-sdk/google: This package allows us to use Google's Generative AI models.

To install these dependencies, we'll use npm, the package manager for JavaScript. We'll run the following command in our project directory:

npm install duck-duck-scrape agentmarkdown ai zod @ai-sdk/google

Once the installation is complete, we can start writing our code.

Breaking Down the Code

Now, let's walk through the code step by step.

Importing Dependencies

First, we need to import the necessary dependencies. Duh.

import * as DDG from "duck-duck-scrape";
import { AgentMarkdown } from "agentmarkdown";
import { generateText, tool } from "ai";
import { z } from "zod";
import { createGoogleGenerativeAI } from "@ai-sdk/google";

Here, we're importing all the functions and classes we'll need from our dependencies. This includes the DDG object for web searching via DuckDuckGo, AgentMarkdown for converting HTML to markdown, and various functions from the ai package for generating text and working with AI models.

Setting Up Google Generative AI

Next, we set up Google's Generative AI model:

const google = createGoogleGenerativeAI({
  apiKey: process.env.GEMINI_API_KEY,
});

This code creates a new instance of Google's Generative AI model using an API key stored in an environment variable. This model will be used later to generate text based on the information our agent finds. Make sure to set the GEMINI_API_KEY environment variable in your system before running the code.

Fetching and Converting to Markdown

Next, we need a function that will fetch a web page and convert it to markdown. This is a very simple function that uses the fetch function to get the HTML content of a web page and then uses the AgentMarkdown class to convert the HTML to markdown. If there's an error during this process, it's logged to the console, and an empty string is returned.

async function fetchAndConvertToMarkdown(url: string): Promise<string> {
  try {
    const response = await fetch(url);
    const html = await response.text();
    const markdown = await AgentMarkdown.render({ html });
    return markdown.markdown;
  } catch (error) {
    console.error(`Error fetching ${url}:`, error);
    return "";
  }
}

As I said earlier, this is a trivial implementation and will not be handle all the edge cases such as rendering client side JavaScript or handling redirects. You can always improve this by adding either a headless browser or using an external service to fetch the web page. Anyway, this will do for now.

Once we have the function to fetch and convert the web page to markdown, we can define the search tool.

Defining the Search Tool

Tools are a way to add custom functionality to an AI agent. In this case, we want to give our agent the ability to search the web and get the results in markdown so it can answer our questions.

To define a tool, we're using the tool function from the ai package. We're passing in a description of the tool and the parameters it expects. In this case, we're expecting a query string. F

const search = tool({
  description: "Search the web for information",
  parameters: z.object({
    query: z.string(),
  }),
  execute: async ({ query }) => {
    const count = 3;
    console.log("Searching for", query);
    const searchResults = await DDG.search(query, {
      safeSearch: DDG.SafeSearchType.STRICT,
    });
    const result = searchResults.results.slice(0, count);
    const urls = result.map((r) => r.url);
    const markdownResults = await Promise.all(
      urls.map(async (url) => {
        const markdown = await fetchAndConvertToMarkdown(url);
        return markdown;
      })
    );
    const markdown = markdownResults.join("\n\n");
    return markdown;
  },
});

This tool takes a query as input and uses the DDG object to search the web. It retrieves the top three results, extracts their URLs, and then uses the fetchAndConvertToMarkdown function to convert each page to markdown. The resulting markdown strings are joined together and returned as the final output.

The Main Function

Next, we can define our AI agent and put it all together. As you can see, we're using the generateText function from the ai package define the agent. The tools parameter is where we define the additional functionality we want to add to the agent. In this case, we're adding the search tool we defined earlier.

I'm also going to use Gemini 2.0 Flash as my Large Language Model (LLM) for this example. This is mainly because of it's massive context window compared to other models such as GPT-4o or Claude 3.5 Sonnet.

const main = async () => {
  const { text } = await generateText({
    model: google("gemini-2.0-flash-exp"),
    tools: { search },
    maxSteps: 10,
    system:
      "You are a helpful assistant that can search the web for information. When you need to search the web, use the `search` tool to search the web for information. Once you have the information, return it in markdown format.",
    prompt: "What's the current weather in Tokyo?",
  });

  console.log(text);
};

main();

Finally, we define the Agent system prompt and pass our query to it as the prompt parameter. Now we can run the code and see our agent in action.

Searching for What's the current weather in Tokyo?

# Current Weather in Tokyo

🌡️ **Temperature:** 20°C (68°F)
☀️ **Conditions:** Clear skies

Final Code

Now that we've gone through the code step by step, let's put it all together. Here's the complete code that puts everything together:

import * as DDG from "duck-duck-scrape";
import { AgentMarkdown } from "agentmarkdown";
import { generateText, tool } from "ai";
import { z } from "zod";
import { createGoogleGenerativeAI } from "@ai-sdk/google";

const google = createGoogleGenerativeAI({
  apiKey: process.env.GEMINI_API_KEY,
});

async function fetchAndConvertToMarkdown(url: string): Promise<string> {
  try {
    const response = await fetch(url);
    const html = await response.text();
    const markdown = await AgentMarkdown.render({ html });
    return markdown.markdown;
  } catch (error) {
    console.error(`Error fetching ${url}:`, error);
    return "";
  }
}

const search = tool({
  description: "Search the web for information",
  parameters: z.object({
    query: z.string(),
  }),
  execute: async ({ query }) => {
    const count = 3;
    console.log("Searching for", query);
    const searchResults = await DDG.search(query, {
      safeSearch: DDG.SafeSearchType.STRICT,
    });
    const result = searchResults.results.slice(0, count);
    const urls = result.map((r) => r.url);
    const markdownResults = await Promise.all(
      urls.map(async (url) => {
        const markdown = await fetchAndConvertToMarkdown(url);
        return markdown;
      })
    );
    const markdown = markdownResults.join("\n\n");
    return markdown;
  },
});

const main = async () => {
  const { text } = await generateText({
    model: google("gemini-2.0-flash-exp"),
    tools: { search },
    maxSteps: 10,
    system:
      "You are a helpful assistant that can search the web for information. When you need to search the web, use the `search` tool to search the web for information. Once you have the information, return it in markdown format.",
    prompt: "What's the current weather in Tokyo?",
  });

  console.log(text);
};

main();

Conclusion

As you can see, building your simple version of Perplexity is actually quite easy! All we needed was a way to search the web, read the results and pass them as markdown to our LLM to answer our questions.

This, of course, is a very simplified version of Perplexity and there's a lot of things we could add to it to make it more useful. For example, we can start by improving the web crawler to handle edge cases. We can break down our agent to multiple agents and have it select which of the search results to use instead of just using the first three. From there we can add RAG (Retrieval Augmented Generation) to the mix, add a database to store search results, and a lot more.

If you're interested in learning more about building similar systems, I've written some other posts that might help:

Setting up Postgres and pgvector with Docker for building RAG applications - This post will guide you through setting up Postgres and pgvector using Docker, which is useful for RAG applications.
Creating AI Agents in Node Using the AI SDK - Here, you'll find out how to develop AI agents in Node with the help of the AI SDK.
How to Enrich Customer Data with LLMs and Web Crawling - This article explains how to use LLMs and web crawling to improve customer data.

If you have any questions or feedback, please let me know in the comments below.