AIRabbit

Posted on Oct 22, 2024 • Edited on Nov 26, 2024

Crawl and Chat with any Website using Apify and GPT or LLAMA

#medium #repost

Navigating the vast amount of information on the web can be overwhelming, especially when you're looking for specific details. This challenge is exacerbated when a website offers limited search capabilities and relies solely on basic keyword searches. Fortunately, there's a transformative approach that can improve your interaction with online content.

What's more, many websites have vast amounts of content but lack the powerful search and AI-powered capabilities we've seen from chatgpt, perplexcity and others. And if you are wondering why not use Perplexity or GPT search in the first place, check out my previous blog about the shortcomings of these search tools when it comes to accuracy and high level of detail (remember, they have to make the entire web searchable, so a low level of granularity is almost inevitable).

Sounds abstract? Let's take an example:

Imagine you want to understand how to build a solution in n8n (a workflow tool) and you do not know how to do it. You could spend some time navigating through the documentation (which is definitely useful to get familiar with the framework), but if you are short on time, maybe a short answer would suffice.

This is where AI and chatting with your website can be a game changer.

How does "Chatting with" work?
In a nutshell, it will use embeddings (the vector representation of your data on the site) and AI to search for and answer your very specific question - in the same way we know from GPT, but this time with high accuracy based on the complete documentation. This can become an incredibly powerful tool where data quality matters.

Why Not Just Use the Site's Search Function?

Traditional site searches often fall short, especially when the information is buried deep within the site's structure. They are typically limited to keyword-based queries, which may not effectively guide you to the information you need. With the "Chat With" approach, you can interact with the website's content using natural language. This means asking detailed questions and receiving precise answers, enhancing your ability to find exactly what you're looking for.

Why Not Use ChatGPT Directly Without Crawling?

Directly asking an AI model without providing it with specific documentation can lead to inaccuracies or "hallucinations," where the model generates incorrect or misleading information. By crawling the website and feeding the actual content to the model, we reduce this risk, ensuring the information you receive is accurate and reliable. This process grounds the AI in the real content, providing trustworthy responses.

The Solution: Crawl + Embed + Chat

To effectively query unstructured but valuable information on the web, we use a combination of crawling, embedding, and chatting with an LLM (Large Language Model). While there are guides on how to do this with a file or folder, applying it to the complex structure of the web—complete with bot protection and intricate link navigation—requires a more sophisticated approach.

Key Considerations When Crawling Beyond a Single Web Page:

Handling JavaScript Rendering: Modern websites often rely heavily on JavaScript, which standard crawlers might not process correctly.
Bypassing Bot Protection: Utilizing proxies can help navigate around anti-bot mechanisms.
Depth of Link Navigation: Controlling how deep the crawler explores the site's link hierarchy.
Automatic Scrolling: Capturing content that loads upon scrolling, which is common in dynamic websites.
Extracting Relevant Text: Focusing on main content while skipping irrelevant elements like navigation menus.
And More: Each website may present unique challenges that require tailored solutions.

Implementing a Complete Solution

To build an effective system, we need to:

Crawl the Website: Gather all the relevant data from the site.
Extract and Clean Up Data: Process the data to remove unnecessary elements.
Embed for Vector Search: Convert the data into a format suitable for AI models to understand.
Conduct QA with LLM: Interact with the data using an AI model to get accurate answers.

While some commercial solutions offer AI search features for entire sites, they often lack the control needed for reliable results. You might not have control over the crawler, and the data processing steps like chunking and embedding may be limited. For critical applications, it's advisable to use specialized tools for each part of the process, avoiding reliance on a single vendor and gaining full control over your workflow.

An Example: Creating a RAG for Chatting with Product Documentation

Let's consider a practical example: creating a Retrieval-Augmented Generation (RAG) system to chat with a product's documentation—specifically when the website's search function is inadequate, but you need information that's present yet not easily accessible.

Imagine you're using Obsidian, a powerful note-taking application with extensive documentation. However, finding specific information through its website can be challenging due to limited search capabilities. By crawling the entire documentation site and feeding it into an LLM, you can interact with the content in natural language, asking specific questions and receiving precise answers.

Our Pipeline Consists of Two Main Parts

Part I: Gathering the Data

Crawl ➔ Extract/Cleanup

For this step, you can use crawlers like Zyte or Apify. I chose Apify because it allows quick crawling and data download within minutes. Platforms like Apify specialize in handling complex crawling tasks, including navigating JavaScript-rendered pages and bypassing bot protections.

Steps:

Set Up the Crawler on Apify:

Create a New Task: Open the website content crawler actor on Apify and create a new task, naming it (e.g., "Obsidian Documentation").
Configure the Start URL: Input https://help.obsidian.md as the starting point.
Set Crawling Patterns: Use patterns like https://help.obsidian.md/** to include all relevant pages within the documentation.
Select Output Format: Choose to save the output in Markdown format for easier processing later.

Run the Crawler:

Start the Task: Initiate the crawling process.
Monitor Progress: Watch as the crawler navigates through the website, capturing the necessary data.
Export the Data: Once completed, export the data in JSON format, which includes the Markdown content of each page.

Part II: Analyzing and Utilizing the Data

Embedding ➔ Retrieval / QA with LLM

For this part, we'll use Open WebUI for embedding the data and interacting with it through an LLM.

If you do not have Open Web UI, you can use just Obsidian with a gpt plugin instead. More on this in my previous blog post where I introduce the Obsidian GPT plugins. And this is just one of many ways you can use the data you have just collected. Any tool that supports RAG or embeddings would do it.

Steps:

Prepare the Data in Open WebUI:

Access the Knowledge Section: Navigate to Workspace ➔ Knowledge in Open WebUI.
Create a New Knowledge Base: Click on the plus sign to add a new knowledge base, naming it (e.g., "Obsidian Doc").
Upload the Data: Import the JSON file exported from Apify.
Wait for Embedding: Allow the system to process and embed the data. This might take a few minutes.

Interact with the Knowledge Base:

Start a New Chat: Open a new chat session in Open WebUI.
Select the Knowledge Base: Use the command #Obsidian Doc to select your knowledge base.
Ask Specific Questions: Pose your questions in natural language (e.g., "How can I add a canvas in Obsidian?").
Receive Accurate Answers: The LLM provides detailed responses with references to the original documentation.

Benefits of This Approach:

Accuracy: By providing the LLM with the actual website content, you minimize the risk of inaccuracies.
Efficiency: Quickly find specific information without manually searching through numerous pages.
Control: Customize the crawling and data processing to suit your specific needs.

Conclusion

In this tutorial, we've demonstrated how to:

Crawl an Entire Website: Using Apify to gather all relevant data from a site with limited search functionality.
Create a Knowledge Base: Processing and embedding the data in Open WebUI for easy access.
Interact Using Natural Language: Leveraging an LLM to ask questions and receive accurate, detailed answers grounded in the actual content.

This method empowers you to chat with any website, combining the wealth of information available online with the advanced capabilities of AI models like GPT. It's especially useful for accessing information that's present but not easily discoverable through traditional search functions.

Happy Crawling!

If you found this guide helpful and would like to support my work, feel free to buy me a coffee ☕ ☕☕ 😉.

DEV Community

Crawl and Chat with any Website using Apify and GPT or LLAMA

The Solution: Crawl + Embed + Chat

Implementing a Complete Solution

An Example: Creating a RAG for Chatting with Product Documentation

Our Pipeline Consists of Two Main Parts

Part I: Gathering the Data

Part II: Analyzing and Utilizing the Data

Conclusion

Top comments (0)

Read next

"Unlocking AI Potential: In-Context Reinforcement Learning & Machine Unlearning"

Revisit base container image for AWS services

Integrate ASP.NET Core DataGrid With Boilerplate and Perform CRUD Actions

Integrating MediaPipe with DeepSeek for Enhanced AI Performance