What is Database Indexing?
Database indexing is a technique that makes searching and retrieving data from a database faster. It is like creating a quick guide for finding information in a large book. It helps speed up searches and makes finding things easier.
Indexing speeds up SELECT queries and WHERE clauses. On the other hand slows down INSERT and UPDATE queries.
Fig: Database Index Data Structure
Why Indexing?
Imagine you have a database of books, and you want to find all the books that have the word "programming" in the title. Without an index, the database would have to scan every row in the table to find the books that match the search criteria. This could take a long time, especially if there are a lot of books in the table.
However, if you create an index on the title column, the database can quickly find the rows that match the search criteria. The index is a separate data structure that stores the values of the title column in sorted order. The database can use the index to quickly find the rows that contain the word "programming" in the title.
Indexing A Table With 50 Million Rows
For this example, we will create a database pg-million
in PostgreSQL containing table customers
with columns: first_name
, last_name
, mobile_no
, country
.
Insert 50 million rows of random data
CREATE TABLE customers(first_name VARCHAR(50), last_name VARCHAR(50), mobile_no INTEGER, country VARCHAR(50))
INSERT INTO customers (first_name, last_name, mobile_no, country)
SELECT substr(md5(random()::text), 1, 10),
substr(md5(random()::text), 1, 10),
(random() * 70 + 10)::integer,
(CASE WHEN random() < 0.5 THEN 'India' ELSE 'United Kingdom' END)
FROM generate_series(1, 50000000);
Create an index on country
column
We create an index on country
column to have a well-organized list that lets us quickly locate all the customers from a particular country without searching through the entire list.
CREATE INDEX idx_partial_country ON customers (country) WHERE country IN ('India', 'United Kingdom')
Time to create index: 2m 2s
For this example, we are using partial indexes. A partial index is created based on a condition that filters rows for specific values. This allows the database to index and optimize only the relevant rows, reducing the index size and improving query performance for those specific values.
Note: The syntax for creating indexes and types of indexes differs among different databases. You should use appropriate syntax and index type depending on your database and use-case.
Measuring Query Execution Time Before and After Indexing
Consider the following query
SELECT * FROM customers WHERE country='United Kingdom';
Query Execution Time without index: 41836.270 ms
Query Execution Time with index: 24254.644 ms
Improvement in query execution time ~42.03%
(For better understanding you can find all the code here
How Well Are The Indexes Performing?
It is important to gain insights into index effectiveness. A few helpful metrics include:
Index Usage Statistics: Monitor the usage of indexes to understand which indexes are actively contributing to query performance. (Ex: Track the size of indexes, as larger indexes may impact disk space and I/O performance)
Query Performance Metrics: Monitor query execution times and response times for queries that involve indexed columns. (Ex: A sudden increase in query execution time may indicate index-related issues.)
Index Maintenance Metrics: Regularly assess the health of indexes and their impact on database operations. (Ex: Track index bloat, which occurs when indexes become inefficient due to excessive insertions, updates, or deletions.)
When To Use Indexing?
Frequent Search Queries: Use indexing when you frequently search for specific data in a large dataset. It helps to find the desired information quickly.
Performance Improvement: Indexing can improve the speed of data retrieval operations, especially for complex queries, by avoiding scanning the entire dataset.
Large Data Volumes: Indexing is used when dealing with sizable amounts of data, as it helps maintain efficient query performance even as the dataset grows.
When To Not Use Indexing?
Frequent Write Operations: Avoid excessive indexing if your database experiences frequent insert, update, or delete operations, as indexes can slow down these write operations and consume additional storage space. Indexes should not be used on the columns that are frequently manipulated.
Small Datasets: For relatively small datasets, indexing may not provide significant performance gains and can introduce unnecessary overhead. In such cases, the benefits may not outweigh the costs.
Conclusion
If you are looking for ways to improve the performance of your database, then database indexing is a good place to start. By creating indexes on the columns that are frequently used in queries, you can significantly improve the performance of your database and make your queries faster. However, it is important to weigh the benefits and drawbacks of indexing before making a decision.
Top comments (14)
Amazing post!
Now I will try this with 500 million rows instead 😎
🙌
This is awesome. Your content is great.
Thanks Karthik
Wow. This is really intuisive. Thanks
Thank you
good content
Thanks 🙌
Great post
Thank you
Thank you for your post.
On my machine the numbers: 6.348s and 1.945s
But for checking this kind of performance I used aggregate function COUNT (dbeaver automatically limiting):
SELECT count(*) FROM customers WHERE country='United Kingdom';
Query Patterns: Understand which queries are frequently executed on the large tables. Analyze the SELECT, JOIN, and WHERE clauses to determine the most common access patterns.
Choose Appropriate Index Columns: Select columns that are frequently used in WHERE clauses and JOIN conditions. Focus on columns with high selectivity (many unique values), as indexing on these columns will yield better results.
Understand Index Types: Different database systems offer various index types, such as B-tree, Bitmap, or Hash indexes. Understand the strengths and limitations of each type of depthcrypto and choose the most suitable for your scenario....
Hmm I wonder how this would work with larger data, like the content of a post or a blog perhaps... maybe we would need to use some natural language processing to extract keywords and then indexing based on that? I'm not sure though just a thought...
Database indexing accelerates queries, yet impacts write operations. Balance its benefits for better database performance. Consider data volume and query frequency.