mehmet akar

Posted on Jan 22

Postgresql Vector Database: Pgvector Tutorial

#vectordatabase #postgres #postgresqlvectordatabase #pgvector

"Postgresql Vector Database: Pgvector" is an open-source extension for PostgreSQL that enables vector similarity searches, supporting exact and approximate nearest neighbor search. With pgvector, you can store vector embeddings alongside your other relational data, ensuring seamless integration into your existing database workflows.

Postgresql Vector Database: Pgvector Key Features

Similarity Metrics: Supports L2 distance, inner product, cosine similarity, L1 distance, Hamming distance, and Jaccard distance.
Vector Types: Single-precision, half-precision, binary, and sparse vectors.
Indexes: Includes HNSW and IVFFlat for approximate nearest neighbor search.
PostgreSQL Features: Leverages PostgreSQL’s ACID compliance, joins, and point-in-time recovery.

Postgresql Vector Database: Pgvector Installation

On Linux and macOS

cd /tmp
git clone --branch v0.8.0 https://github.com/pgvector/pgvector.git
cd pgvector
make
sudo make install

Alternative Methods

Docker: docker pull pgvector/pgvector:pg17
Homebrew: brew install pgvector
PGXN: pgxn install vector
APT (Ubuntu/Debian): Follow PostgreSQL APT repository instructions, then sudo apt install postgresql-17-pgvector.

On Windows

Install Visual Studio C++.
Build with nmake:

set "PGROOT=C:\Program Files\PostgreSQL\16"
cd %TEMP%
git clone --branch v0.8.0 https://github.com/pgvector/pgvector.git
cd pgvector
nmake /F Makefile.win
nmake /F Makefile.win install

Getting Started: Postgresql Vector Database: Pgvector

Enabling the Extension

Run the following in your PostgreSQL database:

CREATE EXTENSION vector;

Creating Tables with Vector Columns

Create a Table:

CREATE TABLE items (id bigserial PRIMARY KEY, embedding vector(3));

Add a Vector Column to an Existing Table:

ALTER TABLE items ADD COLUMN embedding vector(3);

Insert Vector Data:

INSERT INTO items (embedding) VALUES ('[1,2,3]'), ('[4,5,6]');

Query Nearest Neighbors:

SELECT * FROM items ORDER BY embedding <-> '[3,1,2]' LIMIT 5;

Querying Vectors

Supported Distance Metrics

L2 Distance: <->
Inner Product: <#>
Cosine Distance: <=>
L1 Distance: <+>
Hamming Distance (binary vectors): <~>
Jaccard Distance (binary vectors): <%>

Examples

Find Nearest Neighbors:

SELECT * FROM items ORDER BY embedding <-> '[3,1,2]' LIMIT 5;

Filter by Distance:

SELECT * FROM items WHERE embedding <-> '[3,1,2]' < 5;

Distance Aggregates:

SELECT AVG(embedding) FROM items;

Indexing for Performance

pgvector supports two types of indexes for approximate nearest neighbor searches: HNSW and IVFFlat.

HNSW Index

Create an Index:

CREATE INDEX ON items USING hnsw (embedding vector_l2_ops);

Configure Query Options:

SET hnsw.ef_search = 100;

IVFFlat Index

Create an Index:

CREATE INDEX ON items USING ivfflat (embedding vector_l2_ops) WITH (lists = 100);

Adjust Query Options:

SET ivfflat.probes = 10;

Advanced Features

Half-Precision and Sparse Vectors

Half-Precision:

CREATE TABLE items (embedding halfvec(3));

Sparse Vectors:

CREATE TABLE items (embedding sparsevec(5));
INSERT INTO items (embedding) VALUES ('{1:1,3:2,5:3}/5');

Hybrid Search

Combine pgvector with PostgreSQL’s full-text search for hybrid queries:

SELECT id, content FROM items, plainto_tsquery('search term') query
WHERE textsearch @@ query
ORDER BY ts_rank_cd(textsearch, query) DESC LIMIT 5;

Performance Optimization

Bulk Loading

Use COPY for efficient bulk loading:

COPY items (embedding) FROM STDIN WITH (FORMAT BINARY);

Tuning Parameters

Increase memory for index builds:

SET maintenance_work_mem = '8GB';

Adjust parallel workers for faster indexing:

SET max_parallel_maintenance_workers = 7;

Query Analysis

Use EXPLAIN ANALYZE to debug query performance:

EXPLAIN ANALYZE SELECT * FROM items ORDER BY embedding <-> '[3,1,2]' LIMIT 5;

Scaling and Monitoring

Scaling Options

Vertical Scaling: Increase server resources.
Horizontal Scaling: Use replicas or sharding with tools like Citus.

Monitoring

Monitor queries using pg_stat_statements:

CREATE EXTENSION pg_stat_statements;
SELECT query, calls, avg_time_ms FROM pg_stat_statements ORDER BY avg_time_ms DESC LIMIT 5;

Postgresql Vector Database: Pgvector empowers PostgreSQL to handle vector similarity searches efficiently, integrating seamlessly into existing data infrastructures. With support for various vector types, advanced indexing, and PostgreSQL’s robust feature set, pgvector is a powerful choice for embedding-based applications.

DEV Community