Hello everyone,
I'm excited to share some new features and improvements in my custom search engine project. This search engine is designed to work seamlessly with my web crawler, providing efficient and accurate search results. Let's dive into the latest updates!
@aminnairi has already asked why I don't use a NoSQL database. The software and the web crawler now use mongoDB as a noSQL database, which leads to faster search results.
The AI is still from Llama, now llama-3.3-70b.
In the search results, you can right-click to display a preview of the website. In addition, the favicons are only loaded when all search results have been successfully loaded. These are stored locally temporarily so that they do not have to be retrieved again each time.
SchBenedikt
/
search-engine
The matching search engine to my web crawler.
search-engine
The matching search engine to my web crawler.
The Docker image is currently not working https://hub.docker.com/r/schbenedikt/search
Features
- Display of the search speed.
- Ask AI for help.
- Uses MongoDB for database operations.
Docker Instructions
Building the Docker Image
To build the Docker image, run the following command in the root directory of the repository:
docker build -t ghcr.io/schbenedikt/search-engine:latest .
Running the Docker Container
To run the Docker container, use the following command:
docker run -p 5560:5560 ghcr.io/schbenedikt/search-engine:latest
This will start the Flask application using Gunicorn as the WSGI server, and it will be accessible at http://localhost:5560
.
Pulling the Docker Image
The Docker image is publicly accessible. To pull the Docker image from GitHub Container Registry, use the following command:
docker pull ghcr.io/schbenedikt/search-engine:latest
Note
Ensure that the tags
field in the GitHub Actions workflow is correctly set to ghcr.io/schbenedikt/search-engine:latest
to avoid multiple packages.
Running with Docker Compose
To run…
The databases can now be managed via the settings page. There is now also the option to add multiple databases at the same time. When a search is made, the system checks whether there are websites that are saved multiple times so that the same website is not displayed multiple times.
This brings us to the filter functions:
The meta data is used to retrieve the various website types, which can then be used to filter. However, since there may be a "website" type and a "website" type, these can be combined into a "all websites" type.
SchBenedikt
/
web-crawler
A simple web crawler using Python that stores the metadata of each web page in a database.
web-crawler
A simple web crawler using Python that stores the metadata and main content of each web page in a database.
Purpose and Functionality
The web crawler is designed to crawl web pages starting from a base URL, extract metadata such as title, description, image, locale, type, and main content, and store this information in a MongoDB database. The crawler can handle multiple levels of depth and respects the robots.txt
rules of the websites it visits.
Dependencies
The project requires the following dependencies:
requests
beautifulsoup4
pymongo
You can install the dependencies using the following command:
pip install -r requirements.txt
Setting Up and Running the Web Crawler
- Clone the repository:
git clone https://github.com/schBenedikt/web-crawler.git
cd web-crawler
- Install the dependencies:
pip install -r requirements.txt
-
Ensure that MongoDB is running on your local machine. The web crawler connects to MongoDB at
localhost:27017
and uses a database namedsearch_engine
. -
Run the web crawler:
python
…Instrument Sans is now used as the default font and there is also a dark mode that is automatically activated when the system is also in dark mode.
What do you think of the project?
Feel free to write it in the comments!
Top comments (0)