Introduction
If you're just starting out getting into the new wave of generative AI and natural language processing tools you won't get far before you start hearing about vector databases like Pinecode or the vector extension for Postgresql 'pg_vector'. These tools generally bill themselves as missing middleware or 'long term memory' for your custom AI applications. There's quite a few example applications floating around but they tend to just jump right into building an AI application and if you're like me, totally unfamiliar with vector math because it's been a while since you took that in school you might prefer a more introductory approach. We're going to eventually build a very simple AI based application but for the moment we are just going to examine WTF a vector is, how it works, and why is it useful. For this we are using the opensource Postgresql extension 'pg_vector' which is straightforward to install, or you can get a docker image to play with via the support site. For the rest of this tutorial I will just assume you have a running copy of Postgresql that you can access via 'psql' and already has the pg_vector extension installed.
What is a Vector
Ok so I'm not a math expert but the way I'm seeing this is that you have a set of things that are related in some manner, for example a list of cities. Given that list you can label each city in various dimensions. For example one dimension might be population, another could be country of location, another could be average yearly temperature. The labeling is the important bit, but also the most tedious to code up. Usually that means you will only create labels and dimensions that are meaningful to whatever you are doing.
So let's say you have a set list of cities with dimensions labeled as I just described (population, country, average yearly temperature). Once you have that you could visualize a three dimension graph with a dot for each city plotted along each of the three axis. In that cube you will be able to see that some cities are closer together than others, but all cities could be connected with a line. Think for now of that line as a vector. It has a length and an angle. Cities with similar properties will have an overall shorter vector between them than cities with dissimilar ones. Now that you have that graph, wouldn't it be nice if you could for example get a list of cities that are similar in average yearly temperature and population? Well, that's exactly what you can do with pg_vector. You can create a table with data coded up just like I said and run such a similarity query.
Example
The cities example is a great thought experiment, but for actual playing around with SQL let's make it even more simple because I really don't feel like gathering up all that data. Let's use colors. Colors are a great set example because we have already a nice simple way of labeling them up: using the RBG colorspace. If you've ever done CSS you'll know that every color you can draw on the screen can be represented by its amount of red, green and blue as a value from zero to 255. That gives us a nice three dimensional graph to play with. Ok so you're logged into a modern instance of Postgresql with pg_vector installed. Lets make a table to hold the data:
CREATE TABLE colors (
id SERIAL PRIMARY KEY,
name VARCHAR(100),
color_vector VECTOR(3)
);
So we are declaring a vector with 3 dimensions, one for each value of RBG. Let's insert some colors. I asked ChatGPT to do this for me so I could save time :). Here's a sample so you get the idea, if you're following along at home the full file is gisted here.
INSERT INTO colors (name, color_vector) VALUES
-- Primary colors
('Red', '[255, 0, 0]'),
('Green', '[0, 255, 0]'),
('Blue', '[0, 0, 255]'),
-- Various shades of Red
('Crimson', '[220, 20, 60]'),
('Maroon', '[128, 0, 0]'),
('Indian Red', '[205, 92, 92]'),
('Fire Brick', '[178, 34, 34]'),
('Dark Red', '[139, 0, 0]'),
-- Various shades of Green
('Lime Green', '[50, 205, 50]'),
('Forest Green', '[34, 139, 34]'),
('Olive Drab', '[107, 142, 35]'),
('Sea Green', '[46, 139, 87]'),
('Dark Green', '[0, 100, 0]'),
-- Various shades of Blue
('Dodger Blue', '[30, 144, 255]'),
('Sky Blue', '[135, 206, 235]'),
('Navy', '[0, 0, 128]'),
('Midnight Blue', '[25, 25, 112]'),
('Slate Blue', '[106, 90, 205]'),
Ok so now you have a nice table with various colors and shades with their RGB values encoded as a nice 3 dimensional vector. How do you use this? Well, let's say you want to get the top five colors most similar to Green (but not Green itself). Here's the SQL:
SELECT id, name, color_vector
FROM colors
WHERE name <> 'Green'
ORDER BY color_vector <=> (
SELECT color_vector
FROM colors where name = 'Green'
)
LIMIT 5;
id | name | color_vector
----+--------------+--------------
13 | Dark Green | [0,100,0]
9 | Lime Green | [50,205,50]
10 | Forest Green | [34,139,34]
56 | Lawn Green | [124,252,0]
57 | Chartreuse | [127,255,0]
(5 rows)
Ok so that's pretty much as expected, from the list of about 60 colors of various shades we got the five closest to but not exactly green.
How about colors similar to both red and blue?
SELECT
id,
name,
color_vector
FROM colors
ORDER BY color_vector <=> (
SELECT AVG(color_vector)
FROM colors
WHERE name='Red' OR name='Blue'
)
LIMIT 5;
id | name | color_vector
----+---------------+--------------
26 | Magenta | [255,0,255]
20 | Purple | [128,0,128]
55 | Dark Orchid | [153,50,204]
38 | Indigo | [75,0,130]
52 | Medium Orchid | [186,85,211]
Again pretty much as expected, various shades of purple.
Summary
As you can see using pg_vector has great application outside of AI. For example you can made good use of this technique for various types of reporting needs, particularly when you are needing to group things by their overall similarity. However the hard part is always going to be labeling the dimensions and coding up the values. Lots of cases are not going to be as easy as colors where you have a reasonable ready to go set of labeled data. Generative AI can help with this!
Things we didn't cover
We left out a few bits like adding an index, which you'll want if you have a more complex table and I really recommend reading the docs for pg_vector because there's all sorts of settings to choose your preferred balance of speed and accuracy. And you still might be wondering how this related to generative AI tools like ChatGPT. We'll cover that in depth in the next blog but just to give you a taste you can think of a large language model as a huge set of vector encodings. You'll see what I mean!
Top comments (0)