In the very beginning at least, Python reigned supreme in terms of tooling for AI development. However, recently came the answer from the Spring community, and it's called Spring AI! This means that if you're a Spring developer with working knowledge of concepts such as beans, auto-configurations and starters, you're covered, and you can write your AI apps following the standard patterns you're already familiar with.
In this post, I want to share with you an exploration that started with the goal to take Spring AI for a little spin and try out the capabilities of open-source LLMs (large language models). I got curious along the way, and ended up also looking at some pretty low-level details about data storage for AI applications as well.
To support the exploration, I've developed a simple Retrieval Augmented Generation (RAG) workflow that works completely locally on the laptop for free. If you're interested, you can find the code itself here. Basically, I've used Testcontainers
to create a Postgres database container with the pgvector extension to store text embeddings and an open source LLM with which I send requests to: Meta's llama3
(8B) through ollama
.
Alright, let's have a look at what I'm building with this setup.
Fun with sayings and essays
The RAG workflow works like this. I first ask it to generate a couple of sayings, for example this is one among the ones I got back:
A bad excuse is better than none.
Then, I ask it to "elaborate" and genarate an essay (of under 200 words) about the saying it provided me earlier. Here is one such essay generated by llama3
for the saying above:
The saying suggests that while an inadequate justification may not be satisfactory, it's still preferable to having no explanation at all. This phrase highlights the importance of accountability and acknowledging responsibility for one's actions or decisions.
In essence, a bad excuse implies that some effort has been made to justify or rationalize one's behavior, even if the reasoning is flawed or unconvincing. On the other hand, not offering any excuse whatsoever can come across as evasive or dishonest, leading to mistrust and resentment.
This saying encourages people to take ownership of their mistakes and apologize for any wrongdoing, rather than remaining silent and avoiding accountability. It also acknowledges that everyone makes errors and that a bad excuse is better than none because it shows a willingness to acknowledge and learn from those mistakes.
Ultimately, the saying promotes honesty, responsibility, and personal growth by emphasizing the value of taking ownership of one's actions, no matter how imperfect the explanation may be.
Then, I will take these essays and create embeddings from them, which I will store in Postgres, using the pgvector
extension in columns of vector
data type. All with the help of Spring AI abstractions and least amount of custom code.
I will skip the part of this process called "chunking". When you are dealing with very large documents, or want to isolate sections in your data (like in e-mails where you have subject, sender, etc..) you might look into doing that.
So far so good. At this point, we have stored the data we need in the next steps.
I will then take each saying and do a similarity search on the embeddings to retrieve the corresponding essay for each saying. Lastly, I will supply the retrieved essays back again to the LLM, and now ask it to guess the original saying from which the essay was generated. Finally I will check how many it got right.
What do you think, will it manage to correctly guess the saying from just the essay? After all, it has generated the essays from those sayings itself in the first place. A human would have no problem doing this.
But let's first have a look at how the program is set up from a technical perspective. We will look at the results and find out how capable is the LLM a bit later.
The LLM and the vector store in Testcontainers
Testcontainers
makes it very easy to integrate services that each play a specific role for use-cases like this. All that is required to set up a database and the LLM are the couple of lines below and you're good to go!
@TestConfiguration(proxyBeanMethods = false)
class RagDemoApplicationConfiguration {
private static final String POSTGRES = "postgres";
@Bean
@ServiceConnection
PostgreSQLContainer<?> postgreSQLContainer() {
return new PostgreSQLContainer<>("pgvector/pgvector:pg16")
.withUsername(POSTGRES)
.withPassword(POSTGRES)
.withDatabaseName(POSTGRES)
.withInitScript("init-script.sql");
}
}
@Bean
@ServiceConnection
OllamaContainer ollamaContainer() {
return new OllamaContainer("ollama/ollama:latest");
}
}
I've used the @ServiceConnections
annotation that allows me to type less configuration code. I can do this for the ollama
container too only since recently, thanks to this recent contribution from Eddú Meléndez.
You might have noted there's an init script there. It's only a single line of code, and has the purpose to install a Postgres extension called pg_buffercache which lets me inspect the contents of the Postgres shared buffers in RAM. I'm interested in having a look at this in order to better understand the operational characteristics of working with vectors. With other words, what are the memory demands?
create extension pg_buffercache;
Now, to fully initialise our LLM container such that it's ready to actually handle our requests for our sayings and essays, we need to pull the models we want to work with, like so:
ollama.execInContainer("ollama", "pull", "llama3");
ollama.execInContainer("ollama", "pull", "nomic-embed-text");
If you rerun the program again you will see that it will pull the models again. You can have a look at this repo and consider using the baked images that have the models within them already.
You will notice that besides the llama3
that I mentioned before which will take care of generating text, I am also pulling a so-called embedding model: nomic-embed-text
. This is to be able to convert text into embeddings, to be able store them.
The ones I'm using are not the only options. New LLM bindings and embedding models are added all the time in both Spring AI and ollama, so refer to the docs for the up-to-date list, as well as the ollama website.
Configuration properties
Let's have a look at the vector store configuration. Here's how that looks:
@DynamicPropertySource
static void pgVectorProperties(DynamicPropertyRegistry registry) {
registry.add("spring.ai.vectorstore.pgvector.index-type", () -> "HNSW");
registry.add("spring.ai.vectorstore.pgvector.distance-type", () -> "COSINE_DISTANCE");
registry.add("spring.ai.vectorstore.pgvector.dimensions", () -> 768);
}
The first one is called index-type
. This means that we are creating an index in our vector store. We don't necessarily need to always use an index - it's a trade-off. With indexing, the idea is that we gain speed (and other things, like uniqueness, etc) at the expense of storage space. With indexing vectors however, the trade-off also includes the relevance aspect. Without indexing, the similarity search is based on the kNN algorithm (k-nearest neigbours) where it checks all vectors in the table. However with indexing, it will perform an aNN (approximate nearest neighbours) which is faster but might miss some results. Indeed, it's quite the balancing act.
Let's have a look at the other configuration options for indexing, which I extracted from the Spring AI code:
NONE,
IVFFLAT,
HNSW;
In the beginning, there used to be only one option for indexing in pgvector, namely ivfflat
. More recently, the HNSW
(Hierarchical Navigable Small Worlds) one was added, which is based on different construction principles and is more performant, and keeps getting better. The general recommendation is to go for HNSW
as of now.
The next configuration option is the distance-type
which is the procedure it uses to compare vectors in order to determine similarity. Here are our options:
EUCLIDEAN_DISTANCE,
NEGATIVE_INNER_PRODUCT,
COSINE_DISTANCE;
I'll go with the cosine distance, but it might be helpful to have a look at their properties because it might make a difference for your use-case.
The last configuration property is called dimensions
, which represent the number of components (tokenized float values) that the embeddings will be represented on. This number has to be correlated with the number of dimensions we set up in our vector store. In our example, the nomic-embedding-text
one has 768, but others have more, or less. If the model returns the embeddings in more dimensions than we have set up our table, it won't work. Now you might wonder, should you strive to have as high number of dimensions as possible? Actually the answer to this question is apparently no, this blog from Supabase shows that fewer dimensions are better.
Under the hood - what's created in Postgres?
Let's explore what Spring AI has created for us with this configuration in Postgres. In a production application however, you might want to take full control and drive the schema through SQL files managed by migration tools such as Flyway. We didn't do this here for simplicity.
Firstly, we find it created a table called vector_store
with the following structure:
postgres=# \d vector_store;
Table "public.vector_store"
Column | Type | Collation | Nullable | Default
-----------+-------------+-----------+----------+--------------------
id | uuid | | not null | uuid_generate_v4()
content | text | | |
metadata | json | | |
embedding | vector(768) | | |
Indexes:
"vector_store_pkey" PRIMARY KEY, btree (id)
"spring_ai_vector_index" hnsw (embedding vector_cosine_ops)
Nothing surprising here. It's in-line with the configuration we saw above in the Java code I showed you earlier. For example, we notice the embedding
column of type vector
, of 768 dimensions. We notice also the index - spring_ai_vector_index
and the vector_cosine_ops
operator class, which we expected given what we set in the "distance-type" setting earlier. The other index, namely vector_store_pkey
, is created automatically by Postgres. It creates such an index for every primary key by itself.
The command that Spring AI used to create our index is the following:
CREATE INDEX IF NOT EXISTS %s ON %s USING %s (embedding %s)
This creates an index with the default configuration. It might be good to know that you have a couple of options if you'd like to tweak the index configuration for potentially better results (depends on use-case):
- m - the max number of connections per layer
- ef_construction - the size of the dynamic candidate list for constructing the graph
Theres are the boundaries you can pick from for these settings:
Setting | default | min | max |
---|---|---|---|
m | 16 | 2 | 100 |
ef_construction | 64 | 4 | 1000 |
In order to understand the internals of this index and what effect changing the above options might have, here is a link to the original paper. See also this post by J. Katz in which he presents results of experimenting with various combinations of the above settings.
When you know what values you want to set for these settings you can create the index like so:
CREATE INDEX ON vector_store
USING hnsw (embedding vector_cosine_ops)
WITH (m = 42, ef_construction = 42);
In case you get an error when constructing an index, it's worth looking into if it has enough memory to perform this operation. You can adjust the memory for it through the maintenance_work_mem
setting.
Let's now check how our embedding
column is actually stored on disk. We use the following query which will show us our next step.
postgres=# select
att.attname,
case
att.attstorage
when 'p' then 'plain'
when 'm' then 'main'
when 'e' then 'external'
when 'x' then 'extended'
end as attstorage
from
pg_attribute att
join
pg_class tbl on tbl.oid = att.attrelid
join
pg_namespace ns on tbl.relnamespace = ns.oid
where
tbl.relname = 'vector_store' and
ns.nspname = 'public' and
att.attname = 'embedding';
Result:
-[ RECORD 1 ]---------
attname | embedding
attstorage | external
Alright, so it uses the external
storage type. This means that it will store this column in a separate, so-called TOAST table. Postgres does this when columns are so large it can't fit at least 4 rows in a page. But interesting that it will not attempt to also compress it to shrink it even more. For compressed columns it would have said extended
instead of external
in the result above.
Normally, when you update one or multiple columns of a row, Postgres will, instead of overwriting, make a copy of the entire row (it's an MVCC database). But if there are any large TOASTed columns, then during an update it will copy only the other columns. It will copy the TOASTed column only when that is updated. This makes it more efficient by minimising the amount of copying around of large values.
Where are these separate tables though? We haven't created them ourselves, they are managed by Postgres. Let's try to locate this separate TOAST table using this query:
postgres=# select
relname,
oid
from
pg_class,
(select
reltoastrelid
from
pg_class
where
relname = 'vector_store') as vector_store
where
oid = vector_store.reltoastrelid or
oid = (select
indexrelid
from
pg_index
where
indrelid = vector_store.reltoastrelid
);
relname | oid
----------------------+-------
pg_toast_16630 | 16634
pg_toast_16630_index | 16635
So far so good. We now have the TOAST table ID. Let's use it to have a look at the structure of the TOAST table. For example, what columns does it have? Note that these tables are in the pg_toast
schema, by the way, so to get there, we have to set the search_path
to pg_toast
, like below:
postgres=# set search_path to pg_toast;
SET
postgres=# \d pg_toast_16630;
TOAST table "pg_toast.pg_toast_16630"
Column | Type
------------+---------
chunk_id | oid
chunk_seq | integer
chunk_data | bytea
Owning table: "public.vector_store"
Indexes:
"pg_toast_16630_index" PRIMARY KEY, btree (chunk_id, chunk_seq)
We can learn a couple of things from this. As expected, the large columns in the main table that have to be "TOASTed" are chunked (split up) and each chunk is identified by a sequence, and is always retrieved using an index.
Postgres has a mechanism to avoid "blasting" the entire shared buffer cache when it needs to do large reads, like sequential scans of a large table. When it has to do this, it actually uses a 32 page ring buffer so that it doesn't evict other data from the cache. But this mechanism will not kick in for TOAST tables, so vector-based workloads will be run without this form of protection.
Okay! We had a very good look at the database part. Let's now "resurface" for a moment and have a look at other topics pertaining to the high level workflow of interacting with the LLM.
Template-based prompts
Initially, I had constructed the prompts for the request to the LLM in the same class where I was using them. However, I found the following different approach in the Spring AI repository itself and adopted it, because it's indeed cleaner to do it this way. It's based on externalised resource files, like so:
@Value("classpath:/generate-essay.st")
protected Resource generateEssay;
@Value("classpath:/generate-saying.st")
protected Resource generateSaying;
@Value("classpath:/guess-saying.st")
protected Resource guessSaying;
This is how one of them looks inside.
Write a short essay under 200 words explaining the
meaning of the following saying: {saying}.
As you can see, I have not applied any sophisticated prompt engineering whatsoever, and kept it simple and direct for now.
Calling the LLM
Alright, the pieces are starting to fit together! The next thing I'd like to show you is how to call the LLM.
chatModel
.withModel(model)
.call(createPromptFrom(promptTemplate, promptTemplateValues))
.getResult()
.getOutput()
.getContent();
I am using the so-called Chat Model API
, a powerful abstraction over AI models. This design allows us to switch between models with minimal code changes. If you want to work with a different model, you just change the runtime configuration. This is a nice example of the Dependency Inversion Principle; where we have higher level modules that do not depend on low-level modules, both depend on abstractions.
Storing the embeddings
To store the embeddings, I must say that I found it a pretty complicated procedure:
vectorStore.add(documents);
Just kidding, that's it!
This single command will do several things. First convert the documents (our essays) to embeddings with the help of the embeddings model, then it will run the following batched insert statement to get the embeddings into our vector_store
table:
INSERT INTO vector_store (id, content, metadata, embedding) VALUES (?, ?, ?::jsonb, ?) ON CONFLICT (id) DO UPDATE SET content = ? , metadata = ?::jsonb , embedding = ?
We can see it actually performs an update of the content
column in case there is already one row with that ID (taken care of by the ON CONFLICT
part in the query) present in the database.
Similarity searches
To do a similarity search on the stored vectors with Spring AI, it's just a matter of:
vectorStore
.similaritySearch(SearchRequest.query(saying))
.getFirst()
.getContent();
Again you get a couple of things done for you by Spring AI. It takes the parameter you supply ("saying" in our case), and first it creates its embedding using the embedding model we talked about before. Then it uses it to retrieve the most similar results, from which we pick only the first one.
With this configuration (cosine similarity), the SQL query that it will run for you is the following:
SELECT *, embedding <=> ? AS distance FROM vector_store WHERE embedding <=> ? < ? AND metadata::jsonb @@ <nativeFilterExpression>::jsonpath ORDER BY distance LIMIT ?
It selects all the columns in the table and adds a column with the calculated distance. The results are ordered by this distance column, and you can also specify a similarity threshold and a native filter expression using Postgres' jsonpath functionality.
One thing to be noted, is that if you'd write the query yourself and run it with without letting Spring AI create it for you, you can customise the query by supplying different values for the ef_search
parameter (default: 40, min: 1, max: 1000), like so:
SET hnsw.ef_search = 42;
With it, you can influence the number of neighbours that it considers for the search. The more that are checked, the better the recall, but it will be at the expense of performance.
Now that we know how to do perform similarity searches to retrieve semantically close data, we can also make a short incursion into how Postgres uses memory (shared buffers) when performing these retrievals.
How much of the shared buffers got filled up?
Let's now increase a bit the amount of essays we're working with to 100
, and have a look what's in the Postgres shared buffers after we run the program. We'll use the pg_buffercache
extension that I mentioned before, which was installed in the init script.
But first, let's start with looking at the size of the table and index, just to get some perspective.
postgres=# \dt+ vector_store;
List of relations
Schema | Name | Type | Owner | Persistence | Access method | Size | Description
--------+--------------+-------+----------+-------------+---------------+--------+-------------
public | vector_store | table | postgres | permanent | heap | 584 kB |
postgres=# \di+ spring_ai_vector_index;
List of relations
Schema | Name | Type | Owner | Table | Persistence | Access method | Size | Description
--------+------------------------+-------+----------+--------------+-------------+---------------+--------+-------------
public | spring_ai_vector_index | index | postgres | vector_store | permanent | hnsw | 408 kB |
(1 row)
Okay, so the table is 584 kB
and the index is 408 kB
. It seems the index gets pretty big, close to being about the same size of the table. We don't mind that much at such small scale, but if we assume this proportion will be maintained at large scale too, we will have to take it more seriously.
To contrast with how other indexes behave, I checked a table we have at work that amounts to 40Gb
. The corresponding B-tree primary key index is 10Gb
, while other indexes of the same type for other columns are just 3Gb
.
I'm using the following query to get an overview of what's in the shared buffers:
select
c.relname,
count(*) as buffers
from
pg_buffercache b
inner join
pg_class c on b.relfilenode = pg_relation_filenode(c.oid) and
b.reldatabase in (0, (select
oid
from
pg_database
where
datname = current_database()
)
)
group by
c.relname
order by
2 desc
limit
10;
relname | buffers
--------------------------------+---------
pg_proc | 61
pg_toast_16630 | 53
spring_ai_vector_index | 51
pg_attribute | 35
pg_proc_proname_args_nsp_index | 30
pg_depend | 23
pg_operator | 19
pg_statistic | 19
pg_class | 18
vector_store | 18
(10 rows)
We see that all the index in its entirety is in there. We deduced this because the size of the index is 408 Kb
, as we saw before, and if we divide that by 8 Kb
, which is the size of a Postgres page, we get exactly 51
like we see in the above table (third row above).
We can draw a conclusion from this - working with vectors in Postgres is going to be pretty demanding in terms of memory. As reference, vectors that have 1536 dimensions (probably the most common case) will occupy each about 6Kb
. One million of them already gets us to 6Gb
. In case we have other workloads next to the vectors, they might be affected in the sense that we start seeing cache evictions because there's no free buffer. This means we might even need to consider separating the vectors from the other data we have, in separate databases, in order to isolate the workloads in case we notice the performance going downhill.
The @ParameterizedTest
JUnit annotation
Alright, a last remark I want to make about this program is that it's set up to be able to experiment with other open-source LLMs. The entrypoint method I'm using to run the workflow, is a JUnit parameterized test where the arguments for each run can be the names of other LLM models distributed with ollama. This is how you set it up to run multiple times with a different LLM for every execution:
@ParameterizedTest
@ValueSource(strings = {"llama3", "llama2", "gemma", "mistral"})
void rag_workflow(String model) {
...
}
Outputs
Finally it's time to review how well did the LLM manage to guess the sayings. With no other help except for the initial essays provided in the prompt, it managed to guess the saying perfectly a grand total of... once!
Saying | LLM Guess |
---|---|
Your most powerful moments are born from the ashes of your greatest fears. | What doesn't kill you... |
Every sunrise holds the promise of a new masterpiece. | What lies within is far more important than what lies without. |
Every step forward is a declaration of your willingness to grow. | Any Step Forward |
Your most beautiful moments are waiting just beyond your comfort zone. | What lies within... |
Light reveals itself in the darkness it creates. | The darkness is not the absence of light but the presence of a different kind |
Courage is not the absence of fear, but the willingness to take the next step anyway. | Be brave. |
Small sparks can ignite entire galaxies. | Small sparks can ignite entire galaxies. |
Believe in yourself, take the leap and watch the universe conspire to make your dreams come true. | Take the leap |
Life begins at the edge of what you're willing to let go. | Take the leap. |
Some responses are quite amusing, like when it tries to be "mysterious" or more conversational by not completing the sentence fully and just ending it in three dots ("What doesn't kill you..."), and the ones where it reaches for extreme succintness ("Take the leap.", "Be brave.").
Let's give it some help now. In the prompt, this time I'll provide all the sayings it initially generated as a list of options to pick from. Will it manage to pick the correct one from the bunch this way?
Turns out, indeed, if I gave it options to pick from, it picked the right one, every time. Quite the difference between with or without RAG!
Conclusion
Spring AI is a well designed application framework that helps you achieve a lot with little code. You can see it as the "linchpin" that helps you set up and easily evolve your AI use-cases in in stand-alone new applications or integrated with your existing Spring ones. It already has many integrations with specialised AI services and the list keeps growing constantly.
The open-source LLMs I tried have not raised to the occasion, and haven't passed my "challenge" to guess the initial sayings they themselves generated from their (also own) essays. They seem not ready to perform well for use-cases that require this kind of precise and correct "synthesised" answers, but I will keep trying new models as they are made available.
However, they are still useful if you know what you can expect from them - they are very good for storing and retrieving many loosely connected facts, a clear value-add when needing to brainstorm for example.
When I gave it the options, the difference is like night and day compared to when I didn't. If given the options, it picked the right answer every time, flawlessly.
We also looked at how embeddings are stored internally with the pgvector
extension, and how "memory-hungry" this is - we should account for this and make some arrangements at the beginning of the project in order to have smooth operation when the scale grows.
Next steps
While writing this post, pgvectorscale
got released. It's an interesting new project that makes pgvector
more performant and cost-effective. I'd like to compare the difference between how fast it is compared with pgvector
for the sayings and essays use-case I presented above. It should be easy to just switch between the two - everything stays the same, just initialise the database a bit differently in the beginning.
I have made a separate branch in the same repo where instead of pgvector
like in the sections above, I'm now trying to start up pgvectorscale
instead:
@Bean
@ServiceConnection
PostgreSQLContainer<?> postgreSQLContainer() {
return new PostgreSQLContainer<>(DockerImageName
.parse("timescale/timescaledb-ha:pg16-all")
.asCompatibleSubstituteFor("postgres"))
.withUsername(POSTGRES)
.withPassword(POSTGRES)
.withDatabaseName(POSTGRES)
.withInitScript("init-script.sql");
}
In the init script, I've added the following, per the instructions:
create extension if not exists vectorscale CASCADE;
To be continued!
Thanks for reading!
Top comments (2)
This post was incredibly detailed and informative! For someone relatively new to Spring AI and pgvector, it really helps to see the step-by-step breakdown. Could you possibly delve into potential pitfalls or common mistakes when setting up a similar configuration?
Thanks a lot @fred_functional! Happy you enjoyed it. After you set things up and it works, the main thing is to have good monitoring in place to know how well is your database "coping" with the vector-based workload, as I have shown in the article above they are quite demanding memory-wise. For example, ideally, in PostgreSQL you should have a 99% cache hit ratio. I'm planning to write an article on PostgreSQL monitoring soon which will shed some light on this topic.
A good principle to follow in general is to not let the data out of the database, as much as possible. Check what database features or extensions you can leverage for example for updating the vectors when the underlying data changes.
In this blog I've used open-source free LLM, but for paid proprietary LLM endpoints like with Amazon Bedrock, you also have the cost factor that you'd have to balance, to not get a surprise bill, you know. Also, keep in mind they sometimes throttle requests to reduce load on their servers.
It is good to know that to get an answer from LLMs it currently takes a good couple of seconds. You therefore might want to consider "offline" precomputing some data so that when the user request comes in, you don't ask the live LLM endpoint but consult your local precomputed "cache" which is faster. This means you might be paying for more than you strictly need, because users might not "access" what you precomputed. It's helpful to look at the providers' cost model as some offer discounts for bulk operations.
Let me know if this helps and if you have other questions, happy to help and good luck 👍