Forem

Cover image for I created my own search engine
techtech
techtech

Posted on

I created my own search engine

Hello everyone,

I'm excited to share some new features and improvements in my custom search engine project. This search engine is designed to work seamlessly with my web crawler, providing efficient and accurate search results. Let's dive into the latest updates!
Image description

@aminnairi has already asked why I don't use a NoSQL database. The software and the web crawler now use mongoDB as a noSQL database, which leads to faster search results.

The AI ​​is still from Llama, now llama-3.3-70b.
In the search results, you can right-click to display a preview of the website. In addition, the favicons are only loaded when all search results have been successfully loaded. These are stored locally temporarily so that they do not have to be retrieved again each time.

Image description

GitHub logo SchBenedikt / search-engine

The matching search engine to my web crawler.

search-engine

The matching search engine to my web crawler.

The Docker image is currently not working https://hub.docker.com/r/schbenedikt/search

Features

  • Display of the search speed.
  • Ask AI for help.
  • Uses MongoDB for database operations.

Docker Instructions

Building the Docker Image

To build the Docker image, run the following command in the root directory of the repository:

docker build -t ghcr.io/schbenedikt/search-engine:latest .
Enter fullscreen mode Exit fullscreen mode

Running the Docker Container

To run the Docker container, use the following command:

docker run -p 5560:5560 ghcr.io/schbenedikt/search-engine:latest
Enter fullscreen mode Exit fullscreen mode

This will start the Flask application using Gunicorn as the WSGI server, and it will be accessible at http://localhost:5560.

Pulling the Docker Image

The Docker image is publicly accessible. To pull the Docker image from GitHub Container Registry, use the following command:

docker pull ghcr.io/schbenedikt/search-engine:latest
Enter fullscreen mode Exit fullscreen mode

Note

Ensure that the tags field in the GitHub Actions workflow is correctly set to ghcr.io/schbenedikt/search-engine:latest to avoid multiple packages.

Running with Docker Compose

To run…

The databases can now be managed via the settings page. There is now also the option to add multiple databases at the same time. When a search is made, the system checks whether there are websites that are saved multiple times so that the same website is not displayed multiple times.

This brings us to the filter functions:
The meta data is used to retrieve the various website types, which can then be used to filter. However, since there may be a "website" type and a "website" type, these can be combined into a "all websites" type.

GitHub logo SchBenedikt / web-crawler

A simple web crawler using Python that stores the metadata of each web page in a database.

web-crawler

A simple web crawler using Python that stores the metadata and main content of each web page in a database.

Purpose and Functionality

The web crawler is designed to crawl web pages starting from a base URL, extract metadata such as title, description, image, locale, type, and main content, and store this information in a MongoDB database. The crawler can handle multiple levels of depth and respects the robots.txt rules of the websites it visits.

Dependencies

The project requires the following dependencies:

  • requests
  • beautifulsoup4
  • pymongo

You can install the dependencies using the following command:

pip install -r requirements.txt
Enter fullscreen mode Exit fullscreen mode

Setting Up and Running the Web Crawler

  1. Clone the repository:
git clone https://github.com/schBenedikt/web-crawler.git
cd web-crawler
Enter fullscreen mode Exit fullscreen mode
  1. Install the dependencies:
pip install -r requirements.txt
Enter fullscreen mode Exit fullscreen mode
  1. Ensure that MongoDB is running on your local machine. The web crawler connects to MongoDB at localhost:27017 and uses a database named search_engine.

  2. Run the web crawler:

python
Enter fullscreen mode Exit fullscreen mode

Instrument Sans is now used as the default font and there is also a dark mode that is automatically activated when the system is also in dark mode.

Image description

Image description

What do you think of the project?
Feel free to write it in the comments!

Top comments (0)