NEW POST:

I have created my own search engine and I'll tell you how it works.

First of all, I created my own web crawler. It runs on Python and stores various meta data in a mysql database.

SchBenedikt / web-crawler

A simple web crawler using Python that stores the metadata of each web page in a database.

web-crawler

A simple web crawler using Python that stores the metadata and main content of each web page in a database.

Purpose and Functionality

The web crawler is designed to crawl web pages starting from a base URL, extract metadata such as title, description, image, locale, type, and main content, and store this information in a MongoDB database. The crawler can handle multiple levels of depth and respects the robots.txt rules of the websites it visits.

Dependencies

The project requires the following dependencies:

requests
beautifulsoup4
pymongo

You can install the dependencies using the following command:

pip install -r requirements.txt

Setting Up and Running the Web Crawler

Clone the repository:

git clone https://github.com/schBenedikt/web-crawler.git
cd web-crawler

Install the dependencies:

pip install -r requirements.txt

Ensure that MongoDB is running on your local machine. The web crawler connects to MongoDB at localhost:27017 and uses a database named search_engine.
Run the web crawler:

python

…

View on GitHub

The search engine, which I created with Bootstrap, then retrieves the results from mySQL. The bottom right always shows how long the query took.

SchBenedikt / search-engine

The matching search engine to my web crawler.

search-engine

The matching search engine to my web crawler.

The Docker image is currently not working https://hub.docker.com/r/schbenedikt/search

Features

Display of the search speed.
Ask AI for help.
Uses MongoDB for database operations.

Docker Instructions

Building the Docker Image

To build the Docker image, run the following command in the root directory of the repository:

docker build -t ghcr.io/schbenedikt/search-engine:latest .

Running the Docker Container

To run the Docker container, use the following command:

docker run -p 5560:5560 ghcr.io/schbenedikt/search-engine:latest

This will start the Flask application using Gunicorn as the WSGI server, and it will be accessible at http://localhost:5560.

Pulling the Docker Image

The Docker image is publicly accessible. To pull the Docker image from GitHub Container Registry, use the following command:

docker pull ghcr.io/schbenedikt/search-engine:latest

Note

Ensure that the tags field in the GitHub Actions workflow is correctly set to ghcr.io/schbenedikt/search-engine:latest to avoid multiple packages.

Running with Docker Compose

To run…

View on GitHub

Please note that no robots.txt files are currently taken into account, which is why you cannot simply crawl every page.

What do you think of the project?

Feel free to write it in the comments!

List of free Quantum Toolkits

Santhosh Balasa - Dec 10 '24

Errors as a learning

DMS DB - Dec 10 '24

2024 Update: Top 10 Alternatives to Postman

Velan<> - Dec 10 '24

The Importance of Minimizing Database Queries in Backend Development

Yasser Elgammal - Dec 9 '24

Top comments (7)

Amin • Jun 9 '24

Pretty daunting task to create a search engine but I welcome the effort and wish you the best.

Any reasons not to use a NoSQL or a Graph database for this kind of project? Not criticizing your work, just curious.

techtech • Jun 10 '24

I don't have any experience with the other database systems yet, so it's easiest to use this one. Do you have any other recommendations?

Amin • Jun 10 '24

I think you might like MongoDB, plus it is a great fit for fast read/write access and you get to host your database for free on MongoAtlas in the cloud (their free tier is very generous).

Good luck for your project!

Since I have my own server and domain, I am not currently looking for a free provider. If MongoDB is really faster, I will definitely come back to it. At the moment, however, I would like to improve the search algorithm.

Waleed • Jun 15 '24

Cool👍👍

Big Ok • Jun 9 '24

How many rows of data have you stored so far?

techtech • Jun 9 '24

It's about 44000, but I'm not going to expand the list because it's still under development at the moment.

DEV Community

I created my own search engine

NEW POST:

I created my own search engine

techtech ・ Feb 19

SchBenedikt / web-crawler

A simple web crawler using Python that stores the metadata of each web page in a database.

web-crawler

Purpose and Functionality

Dependencies

Setting Up and Running the Web Crawler

SchBenedikt / search-engine

The matching search engine to my web crawler.

search-engine

Features

Docker Instructions

Building the Docker Image

Running the Docker Container

Pulling the Docker Image

Note

Running with Docker Compose

Top comments (7)

Read next

List of free Quantum Toolkits

Errors as a learning

2024 Update: Top 10 Alternatives to Postman

The Importance of Minimizing Database Queries in Backend Development