shallowdepth

Posted on Jan 23, 2022 • Originally published at shallowdepth.online

5 ways to implement case-insensitive search in SQLite with full Unicode support

#database #python #sql #sqlite

Recently I needed a case-insensitive search in SQLite to check if an item with the same name already exists in one of my projects – listOK. At first, it looked like a simple task, but upon deeper dive, it turned out to be easy, but not simple at all, with many twists and turns.

Built-in SQLite capabilities and their drawbacks

In SQLite you can get a case-insensitive search in three ways:

-- 1. Use a NOCASE collation
-- (we will look at other ways for applying collations later):
SELECT * 
    FROM items 
    WHERE text = "String in AnY case" COLLATE NOCASE;

-- 2. Normalize all strings to the same case,
-- does not matter lower or upper:
SELECT * 
    FROM items 
    WHERE LOWER(text) = "string in lower case";

-- 3. Use LIKE operator which is case insensitive by default:
SELECT * 
    FROM items 
    WHERE text LIKE "String in AnY case";

If you use SQLAlchemy and its ORM these approaches will look as follows:

from sqlalchemy import func
from sqlalchemy.orm.query import Query

from package.models import YourModel


text_to_find = "Text in AnY case"

# NOCASE collation
Query(YourModel)
.filter(
    YourModel.field_name.collate("NOCASE") == text_to_find
)

# Normalizing text to the same case
Query(YourModel)
.filter(
    func.lower(YourModel.field_name) == text_to_find.lower()
).all()

# LIKE operator. No need to use SQLAlchemy's ilike
# since SQLite LIKE is already case-insensitive.
Query(YourModel)
.filter(YourModel.field_name.like(text_to_find))

All these approaches are not ideal. First, without special considerations they do not make use of indexes on the field they are working on, with LIKE being the worst offender: in most cases it is incapable of using indexes. More on the use of indexes for case-insensitive queries is below.

Second, and more importantly, they have a rather limited understanding of what case-insensitive mean:

SQLite only understands upper/lower case for ASCII characters by default. The LIKE operator is case sensitive by default for unicode characters that are beyond the ASCII range. For example, the expression 'a' LIKE 'A' is TRUE but 'æ' LIKE 'Æ' is FALSE.

It is not a problem if you plan to work with strings that contain only English alphabet letters, numbers, etc. I needed the full Unicode spectrum, so a better solution was in order.

Below I summarize five ways to achieve case insensitive search/comparison in SQLite for all Unicode symbols. Some of these solutions can be adapted to other databases and for implementing Unicode-aware LIKE, REGEXP, MATCH, and other functions, although these topics are out of the scope of this post.

We will look at the pros and cons of each approach, implementation details, and, finally, at indexes and performance considerations.

Solutions

1. ICU extension

Official SQLite documentation mentions the ICU extension as a way to add complete support for Unicode in SQLite. ICU stands for International Components for Unicode.

ICU solves the problems of both case-insensitive LIKE and comparison/search, plus adds support for different collations for a good measure. It may even be faster than some of the later solutions since it is written in C and is more tightly integrated with SQLite.

However, it comes with its challenges:

It is a new type of dependency: not a Python library, but an extension that should be distributed together with the application.
ICU needs to be compiled before use, potentially for different OS and platforms (not tested).
ICU does not itself implement Unicode conversions, but relies on the underline operating system – I have seen multiple mentions of OS-specific issues, especially with Windows and macOS.

All other solutions will depend on your Python code to perform the comparison, so it is important to choose the right approach to converting and comparing strings.

Choosing the right python function for case-insensitive comparison

To perform case-insensitive comparison and search we need to normalize strings to one case. My first instinct was to use str.lower() for this. It will work in most circumstances, but it is not the proper way. Better to use str.casefold() (docs):

Return a casefolded copy of the string. Casefolded strings may be used for caseless matching.

Casefolding is similar to lowercasing but more aggressive because it is intended to remove all case distinctions in a string. For example, the German lowercase letter 'ß' is equivalent to "ss". Since it is already lowercase, lower() would do nothing to 'ß'; casefold() converts it to "ss".

Therefore, below we will use the str.casefold() function for all conversions and comparisons.

2. Application-defined collation

To perform a case-insensitive search for all Unicode symbols we need to define a new collation in the application after connecting to the database (documentation). Here you have a choice – overload the built-in NOCASE or create your own – we will discuss the pros and cons below. For the sake of an example we will use a new name:

import sqlite3

# Custom collation, maybe it is more efficient
# to store strings
def unicode_nocase_collation(a: str, b: str):
    if a.casefold() == b.casefold():
        return 0
    if a.casefold() < b.casefold():
        return -1
    return 1

connection.create_collation(
    "UNICODE_NOCASE", unicode_nocase_collation
)

# Connect to the DB and register the function
connection = sqlite3.connect("your_db_path")
connection.create_collation(
    "UNICODE_NOCASE", unicode_nocase_collation
)

# Or, if you use SQLAlchemy you need to register
# the collation via an event
@sa.event.listens_for(sa.engine.Engine, 'connect')
def sqlite_engine_connect(connection, _):
    connection.create_collation(
    "UNICODE_NOCASE", unicode_nocase_collation
)

Collations have several advantages compared to the next solutions:

They are easy to use. You can specify collation in the table schema and it will be automatically applied to all queries and indexes on this field unless you specify otherwise:

CREATE TABLE test (text VARCHAR COLLATE UNICODE_NOCASE);

For the sake of completeness, let's look at two more ways to use collations:

-- In a particular query:
SELECT * FROM items
    WHERE text = "Text in AnY case" COLLATE UNICODE_NOCASE;

-- In an index:
CREATE INDEX IF NOT EXISTS idx1 
    ON test (text COLLATE UNICODE_NOCASE);

-- Word of caution: your query and index 
-- must match exactly,including collation, 
-- otherwise, SQLite will perform a full table scan.
-- More on indexes below.
EXPLAIN QUERY PLAN
    SELECT * FROM test WHERE text = 'something';
-- Output: SCAN TABLE test
EXPLAIN QUERY PLAN
    SELECT * FROM test WHERE text = 'something' COLLATE NOCASE;
-- Output: SEARCH TABLE test USING COVERING INDEX idx1 (text=?)

Collation provides case-insensitive sorting with ORDER BY out of the box. It is especially easy to get if you define the collation in the table schema.

Performance-wise collations have some peculiarities, which we will discuss further.

3. Application-defined SQL function

Another way to achieve case-insensitive search is to create an application-defined SQL function (documentation):

import sqlite3

# Custom function
def casefold(s: str):
    return s.casefold()

# Connect to the DB and register the function
connection = sqlite3.connect("your_db_path")
connection.create_function("CASEFOLD", 1, casefold)

# Or, if you use SQLAlchemy you need to register 
# the function via an event
@sa.event.listens_for(sa.engine.Engine, 'connect')
def sqlite_engine_connect(connection, _):
    connection.create_function("CASEFOLD", 1, casefold)

In both cases create_function accepts up to four arguments:

name of the function as it will be used in the SQL queries
number of arguments the function accepts
the function itself
optional bool deterministic, default False (added in Python 3.8) – it is important for indexes, which we will discuss below.

As with collations, you have a choice – overload built-in function (for instance, LOWER) or create new. We will look at it in more detail later.

4. Compare in the application

Another way of case-insensitive search would be comparing in the app itself, especially if you could narrow down the search by using an index on other fields. For instance, in listOK case-insensitive comparison is needed for items in a particular list. Therefore, I could select all items in the list, normalize them to one case and compare them with the normalized new item.

Depending on your circumstances it is not a bad solution, especially if the subset you will be comparing with is small. However, you will not be able to utilize database indexes on the text, only on other parameters you will be using to narrow down the scope.

The advantage of this approach is its flexibility: in the application you can check not only equality but, for instance, implement "fuzzy" comparison to take into account possible misprints, singular/plural forms, etc. This is the route I chose for listOK since the bot needed fuzzy comparison for the "smart" item creation.

In addition, it eliminates any coupling with the database – it is simple storage that does not know anything about the data.

5. Store normalized field separately

There is one more solution: create a separate column in the database and keep there normalized text you will be searching on. For instance, the table may have this structure (only relevant fields):

id	name	name_normalized
1	Sentence capitalization	sentence capitalization
2	CAPITAL LETTERS	capital letters
3	Non-ASCII symbols: Найди Меня	non-ascii symbols: найди меня

This may look excessive at first: you always need to keep the normalized version updated and effectively doubling the size of the name field. However, with ORMs or even manually it is easy to do and the disk space plus RAM is relativity cheap.

Advantages of this approach:

It completely decouples the application and database – you can easily switch.
You can pre-process normalized filed if your queries require it (trim, remove punctuation or spaces, etc.).

Should you overload built-in functions and collations?

When using application-defined SQL functions and collations you often have a choice: use a unique name or overload built-in functionality. Both approaches have their pros and cons in two main dimensions:

First, reliability/predictability when for some reason (a one-off mistake, bug, or intentionally) you do not register these functions or collations:

Overloading: the database will still work, but the results may not be correct:
- the built-in function/collation will behave differently than their custom counterparts;
- if you used now absent collation in an index it will appear to work, but results may be wrong even when reading;
- if the table with index and index using custom function/collation gets updated, the index may get corrupted (updated using built-in implementation), but continue working as if nothing happened.
Not overloading: the database will not work in any respects where the absent functions or collations are used:
- if you use an index on an absent function you will be able to use it for reading, but not for updates;
- indexes with application-defined collation will not work at all, since they use the collation while searching in the index.

Second, accessibility outside of the main application: migrations, analytics, etc.:

Overloading: you will be able to modify the database without problem, keeping in mind the risk of corrupting indexes.
Not overloading: in many cases you will need to register these functions or collations or take extra steps to avoid parts of the database that depend on it.

If you decide to overload, it may be a good idea to rebuild indexes based on custom functions or collations in case they get wrong data recorded there, for example:

-- Rebuild all indexes using this collation
REINDEX YOUR_COLLATION_NAME;

-- Rebuild particular index
REINDEX index_name;

-- Rebuild all indexes
REINDEX;

Performance of application-defined functions and collations

Custom functions or collation are much slower than built-in functions: SQLite "returns" to your application each time it calls the function. You can easily check it by adding a global counter to the function:

counter = 0

def casefold(a: str):
    global counter
    counter += 1
    return a.casefold()

# Work with the database

print(counter)
# Number of times the function has been called

If you are querying rarely or your database is small, you will not see any meaningful difference. However, if you do not use an index on this function/collation, the database may perform a complete table scan applying the function/collation on each row. Depending on the size of the table, hardware, and the number of requests, the low performance may be surprising. Later I will publish a review of application-defined functions and collations performance.

Strictly speaking, collations are a bit slower than SQL functions since for each comparison they need to casefold two strings, instead of one. Although this difference is very small: in my tests, the casefold function was faster than similar collation for around 25% which amounted to a difference of 10 seconds after 100 million iterations.

Indexes and case-insensitive search

Indexes and functions

Let's start with the basics: if you define an index on any field, it will not be used in queries on a function applied to this field:

CREATE TABLE table_name (id INTEGER, name VARCHAR);
CREATE INDEX idx1 ON table_name (name);
EXPLAIN QUERY PLAN
    SELECT id, name FROM table_name WHERE LOWER(name) = 'test';
-- Output: SCAN TABLE table_name

For such queries you need a separate index with the function itself:

CREATE INDEX idx1 ON table_name (LOWER(name));
EXPLAIN QUERY PLAN
    SELECT id, name 
        FROM table_name WHERE LOWER(name) = 'test';
-- Output: SEARCH TABLE table_name USING INDEX idx1 (<expr>=?)

In SQLite, it can be done on a custom function as well, but it must be marked as deterministic (meaning that with the same inputs it returns the same result):

connection.create_function(
    "CASEFOLD", 1, casefold, deterministic=True
)

After that you can create an index on a custom SQL function:

CREATE INDEX idx1 
    ON table_name (CASEFOLD(name));
EXPLAIN QUERY PLAN
    SELECT id, name 
        FROM table_name WHERE CASEFOLD(name) = 'test';
-- Output: SEARCH TABLE table_name USING INDEX idx1 (<expr>=?)

Indexes and collations

The situation with collations and indexes is similar: for a query to utilize an index they must use the same collation (implied or provided expressly), otherwise, it will not work.

-- Table without specified collation will use BINARY
CREATE TABLE test (id INTEGER, text VARCHAR);

-- Create an index with a different collation
CREATE INDEX IF NOT EXISTS idx1 ON test (text COLLATE NOCASE);


-- Query will use default column collation -- BINARY
-- and the index will not be used
EXPLAIN QUERY PLAN
    SELECT * FROM test WHERE text = 'test';
-- Output: SCAN TABLE test


-- Now collations match and index is used
EXPLAIN QUERY PLAN
    SELECT * FROM test WHERE text = 'test' COLLATE NOCASE;
-- Output: SEARCH TABLE test USING INDEX idx1 (text=?)

As noted above, collation can be specified for a column in the table schema. This is the most convenient way – it will be applied to all queries and indexes on the respective field automatically unless you specify otherwise:

-- Using application defined collation UNICODE_NOCASE from above
CREATE TABLE test (text VARCHAR COLLATE UNICODE_NOCASE);

-- Index will be built using the collation
CREATE INDEX idx1 ON test (text);

-- Query will utilize index and collation automatically
EXPLAIN QUERY PLAN
    SELECT * FROM test WHERE text = 'something';
-- Output: SEARCH TABLE test USING COVERING INDEX idx1 (text=?)

Which solution to choose?

To choose a solution we need some criteria for comparison:

Simplicity – how difficult it is to implement and maintain it
Performance – how fast your queries will be
Extra space – how much additional database space the solution requires
Coupling – how much your solution intertwines the code and storage

Solution	Simplicity	Performance (relative, without index)	Extra space	Coupling
ICU extension	Difficult: requires a new type of dependency and compiling	Medium to high	No	Yes
Custom collation	Simple: allows to set collation in the table schema and apply it automatically to any query on the field	Low	No	Yes
Custom SQL function	Medium: requires either building an index based on it or using in all relevant queries	Low	No	Yes
Comparing in the app	Simple	Depends on the use case	No	No
Storing normalized string	Medium: you need to keep the normalized string updated	Low to Medium	x2	No

As usual, the choice of the solution will depend on your use case and performance demands. Personally, I would go either with custom collation, comparing in the app, or storing a normalized string. For example, in listOK, I first used a collation and moved to comparing in the app when added fuzzy search.

Top comments (3)

Rasmus Schultz • Dec 27 '24

Having spent more than a day on this, I came up with a 6th way 😅

Install sqlpkg
Run sqlpkg init (only once in a new project)
Run sqlpkg install nalgeon/text

Load the extension from .sqlpkg/nalgeon/text/text.so, and start using COLLATE TEXT_NOCASE in your SQL. 🙂👍

So just to explain:

webinstall.dev is this wonderful initiative to provide simple, cross-platform installers, without requiring a package manager. You can learn more in this 4-minute video.

sqlpkg (sqlpkg.org) is a package manager for SQLite extensions. Note that this gets installed for the user, meaning one shared install for your projects.

The sqlpkg init command creates a .sqlpkg folder for installed SQLite extensions in your project folder - so make sure you run it from your project root. (This is optional - sqlpkg will install to your user folder by default, but you most likely want to control the installed extensions per project, and this makes it easier to locate the installed extension in code.)

The sqlpkg install nalgeon/text command installs the nalgeon/text extension, which includes a text_nocase collation, which supports Unicode.

I don't know if this is better or worse than the official icu extension from the SQLite team, but the docs for the official extension includes a list of disconcerting known bugs and caveats - and it's also not listed in the sqlpkg registry. this extension avoids replacing the like operator, or any standard functions, and instead provides a text_like function, text_upper and text_lower etc. so as not to conflict with SQL queries that rely on the standard behavior. this sounds like an overall better approach, and the developer is apparently an experienced author of many SQLite extensions.

I don't use Python, but I tested this with Deno and jsr:@db/sqlite, and it seems to work very nicely - for example, run SELECT ${"cómo estás"} = ${"CÓMO ESTÁS"} COLLATE TEXT_NOCASE and you should get 1, meaning the Unicode comparison works. ✨

@shallowdepth I would love to hear your analysis of this approach? I would say we're good on simplicity, at least, and I'd assume the performance is good - based on those criteria, I think this may be the best solution I've seen so far? It does require a new type of dependency, but it is very easy to install, and doesn't require compilation. What do you think? 🙂

Rasmus Schultz • Dec 26 '24

Thanks for this very long article about something that should be totally trivial and built-into any serious database engine. 😔

Seriously though, with all the hype around using SQLite in production for smaller web apps microservices etc., I had decided to give this a try - and instead, I ended up spending an entire day researching and trying the most basic things, ultimately with the conclusion that, basically, nothing works out of the box with SQLite. "batteries not included."

As you table in the article nicely sums up, there are no good options for Unicode support in SQLite - the only option that is reasonably within reach for the average JS developer is custom collations, which lead to poor performance.

I just don't even understand... all these clients for Node, Deno, Bun, etc. - none of them ship with Unicode support. Just to make things even more frustrating, they do ship with things like full-text search - which is futile without Unicode support. So what exactly are these clients used for? The entire internet is Unicode.

I do understand why SQLite itself does not ship with Unicode support enabled by default - their priority, being an embedded database first of all, is to be small and fast. Cool for operating system registries, cell phones, devices, etc. I guess.

But I do not understand any of the hype around "using SQLite in production" - by default, with any of the available clients, you can't even do a case insensitive substring search, or sort a table of names correctly, and forget getting anything useful out of the full-text search features, which for some reason are included.

Seriously, what am I missing here? 😅

(and for the record, none of this is a comment on your excellent article, which provides a good overview of all the bad options we have available to actually build anything on the internet with SQLite. lol)

Marcin Szolke/Scholke • Aug 24 '23

SQLIte database in SharedFolder
What is your experience with such a solution, how many people can use this application in parallel ?

DEV Community