In part 1 I went over a basic solution on how to scrape data from any website into any desired format using AI. This part covers the next steps to improve performance and reduce costs.
This part is a bit shorter because I no longer have the time to delve too deep into these subjects, but if you're on a similar path and enjoyed reading part 1 then I hope these 'field notes' can serve as inspiration in your own research. If you got any more tips, or questions, feel free to leave a comment below.
Reduce costs & optimize performance
As mentioned in part 1, using OpenAI's models as-is can get quite expensive. This section covers the different strategies I've found to reduce costs.
Convert HTML to LLM-friendly text
HTML elements contain quite a lot of 'bloat' by design. All of that extra markup isn't all that useful to our model, but it does count towards our token spend limit. In order to avoid that, we could try and convert that HTML into a different format (like markdown) that is not only smaller but also easier to parse by the model.
Jina Reader
Using the Jina Reader API or by self-hosting its models, we can convert the HTML to Markdown:
As you can see, there's way less bloat in there compared to the raw HTML output. Pass this to the model and it should still work while costing you less tokens.
Firecrawl
Another solution is Firecrawl, an open source project to crawl, scrape and clean your data. They offer a hosted paid version but its core features are free and open source on GitHub. So you could set up your own instance for free.
Crawl4ai
If speed is your uppermost priority or you simply don't like firecrawl, the completely free and open source crawl4ai project might be a better option to look into.
Different models
OpenAI isn't the only player in the AI game. There are more affordable options available. To name a few:
There's plenty of options to choose from, but not all of them work well with structured/JSON output mode (yet). Let me know in the comments which provider ended up working well for you!
Other challenges
The real challenges that comes with a project like this are the following:
- Scraping websites that are protected by a WAF (Web Application Firewall), dealing with "anti bot" challenges, captchas etc.
- Avoiding bans by rotation proxies
- Keeping AI/LLM costs low
- Handling bad data or AI hallucinations
- Scaling the infrastructure
So please think twice before your start yet another business around this (very saturated) idea ;)
Hi 👋 thanks for reading! If you enjoyed reading my content, consider following me on Twitter to stay in the loop ❤️.
Top comments (0)