DEV Community

Cover image for Spring AI, Llama 3 and pgvector: bRAGging rights!
Mircea Cadariu
Mircea Cadariu

Posted on • Updated on

Spring AI, Llama 3 and pgvector: bRAGging rights!

In the very beginning at least, Python reigned supreme in terms of tooling for AI development. However, recently came the answer from the Spring community, and it's called Spring AI! This means that if you're a Spring developer with working knowledge of concepts such as beans, auto-configurations and starters, you're covered, and you can write your AI apps following the standard patterns you're already familiar with.

In this post, I want to share with you an exploration that started with the goal to take Spring AI for a little spin and try out the capabilities of open-source LLMs (large language models). I got curious along the way, and ended up also looking at some pretty low-level details about data storage for AI applications as well.

To support the exploration, I've developed a simple Retrieval Augmented Generation (RAG) workflow that works completely locally on the laptop for free. If you're interested, you can find the code itself here. Basically, I've used Testcontainers to create a Postgres database container with the pgvector extension to store text embeddings and an open source LLM with which I send requests to: Meta's llama3 (8B) through ollama.

Alright, let's have a look at what I'm building with this setup.

Fun with sayings and essays

The RAG workflow works like this. I first ask it to generate a couple of sayings, for example this is one among the ones I got back:

A bad excuse is better than none.

Then, I ask it to "elaborate" and genarate an essay (of under 200 words) about the saying it provided me earlier. Here is one such essay generated by llama3 for the saying above:

The saying suggests that while an inadequate justification may not be satisfactory, it's still preferable to having no explanation at all. This phrase highlights the importance of accountability and acknowledging responsibility for one's actions or decisions.

In essence, a bad excuse implies that some effort has been made to justify or rationalize one's behavior, even if the reasoning is flawed or unconvincing. On the other hand, not offering any excuse whatsoever can come across as evasive or dishonest, leading to mistrust and resentment.

This saying encourages people to take ownership of their mistakes and apologize for any wrongdoing, rather than remaining silent and avoiding accountability. It also acknowledges that everyone makes errors and that a bad excuse is better than none because it shows a willingness to acknowledge and learn from those mistakes.

Ultimately, the saying promotes honesty, responsibility, and personal growth by emphasizing the value of taking ownership of one's actions, no matter how imperfect the explanation may be.

Then, I will take these essays and create embeddings from them, which I will store in Postgres, using the pgvector extension in columns of vector data type. All with the help of Spring AI abstractions and least amount of custom code.

I will skip the part of this process called "chunking". When you are dealing with very large documents, or want to isolate sections in your data (like in e-mails where you have subject, sender, etc..) you might look into doing that.

So far so good. At this point, we have stored the data we need in the next steps.

I will then take each saying and do a similarity search on the embeddings to retrieve the corresponding essay for each saying. Lastly, I will supply the retrieved essays back again to the LLM, and now ask it to guess the original saying from which the essay was generated. Finally I will check how many it got right.

What do you think, will it manage to correctly guess the saying from just the essay? After all, it has generated the essays from those sayings itself in the first place. A human would have no problem doing this.

But let's first have a look at how the program is set up from a technical perspective. We will look at the results and find out how capable is the LLM a bit later.

The LLM and the vector store in Testcontainers

Testcontainers makes it very easy to integrate services that each play a specific role for use-cases like this. All that is required to set up a database and the LLM are the couple of lines below and you're good to go!

@TestConfiguration(proxyBeanMethods = false)
class RagDemoApplicationConfiguration {
    private static final String POSTGRES = "postgres";

    @Bean
    @ServiceConnection
    PostgreSQLContainer<?> postgreSQLContainer() {
        return new PostgreSQLContainer<>("pgvector/pgvector:pg16")
                .withUsername(POSTGRES)
                .withPassword(POSTGRES)
                .withDatabaseName(POSTGRES)
                .withInitScript("init-script.sql");
        }
    }

    @Bean
    @ServiceConnection
    OllamaContainer ollamaContainer() {
        return new OllamaContainer("ollama/ollama:latest");
    }
}
Enter fullscreen mode Exit fullscreen mode

I've used the @ServiceConnections annotation that allows me to type less configuration code. I can do this for the ollama container too only since recently, thanks to this recent contribution from Eddú Meléndez.

You might have noted there's an init script there. It's only a single line of code, and has the purpose to install a Postgres extension called pg_buffercache which lets me inspect the contents of the Postgres shared buffers in RAM. I'm interested in having a look at this in order to better understand the operational characteristics of working with vectors. With other words, what are the memory demands?

create extension pg_buffercache;
Enter fullscreen mode Exit fullscreen mode

Now, to fully initialise our LLM container such that it's ready to actually handle our requests for our sayings and essays, we need to pull the models we want to work with, like so:

ollama.execInContainer("ollama", "pull", "llama3");
ollama.execInContainer("ollama", "pull", "nomic-embed-text");
Enter fullscreen mode Exit fullscreen mode

If you rerun the program again you will see that it will pull the models again. You can have a look at this repo and consider using the baked images that have the models within them already.

You will notice that besides the llama3 that I mentioned before which will take care of generating text, I am also pulling a so-called embedding model: nomic-embed-text. This is to be able to convert text into embeddings, to be able store them.

The ones I'm using are not the only options. New LLM bindings and embedding models are added all the time in both Spring AI and ollama, so refer to the docs for the up-to-date list, as well as the ollama website.

Configuration properties

Let's have a look at the vector store configuration. Here's how that looks:

@DynamicPropertySource
static void pgVectorProperties(DynamicPropertyRegistry registry) {
  registry.add("spring.ai.vectorstore.pgvector.index-type", () -> "HNSW");
  registry.add("spring.ai.vectorstore.pgvector.distance-type", () -> "COSINE_DISTANCE");     
  registry.add("spring.ai.vectorstore.pgvector.dimensions", () -> 768);
}
Enter fullscreen mode Exit fullscreen mode

The first one is called index-type. This means that we are creating an index in our vector store. We don't necessarily need to always use an index - it's a trade-off. With indexing, the idea is that we gain speed (and other things, like uniqueness, etc) at the expense of storage space. With indexing vectors however, the trade-off also includes the relevance aspect. Without indexing, the similarity search is based on the kNN algorithm (k-nearest neigbours) where it checks all vectors in the table. However with indexing, it will perform an aNN (approximate nearest neighbours) which is faster but might miss some results. Indeed, it's quite the balancing act.

Let's have a look at the other configuration options for indexing, which I extracted from the Spring AI code:

NONE,
IVFFLAT,
HNSW;
Enter fullscreen mode Exit fullscreen mode

In the beginning, there used to be only one option for indexing in pgvector, namely ivfflat. More recently, the HNSW (Hierarchical Navigable Small Worlds) one was added, which is based on different construction principles and is more performant, and keeps getting better. The general recommendation is to go for HNSW as of now.

The next configuration option is the distance-type which is the procedure it uses to compare vectors in order to determine similarity. Here are our options:

 EUCLIDEAN_DISTANCE,
 NEGATIVE_INNER_PRODUCT,
 COSINE_DISTANCE;
Enter fullscreen mode Exit fullscreen mode

I'll go with the cosine distance, but it might be helpful to have a look at their properties because it might make a difference for your use-case.

The last configuration property is called dimensions, which represent the number of components (tokenized float values) that the embeddings will be represented on. This number has to be correlated with the number of dimensions we set up in our vector store. In our example, the nomic-embedding-text one has 768, but others have more, or less. If the model returns the embeddings in more dimensions than we have set up our table, it won't work. Now you might wonder, should you strive to have as high number of dimensions as possible? Actually the answer to this question is apparently no, this blog from Supabase shows that fewer dimensions are better.

Under the hood - what's created in Postgres?

Let's explore what Spring AI has created for us with this configuration in Postgres. In a production application however, you might want to take full control and drive the schema through SQL files managed by migration tools such as Flyway. We didn't do this here for simplicity.

Firstly, we find it created a table called vector_store with the following structure:

postgres=# \d vector_store;
                     Table "public.vector_store"
  Column   |    Type     | Collation | Nullable |      Default       
-----------+-------------+-----------+----------+--------------------
 id        | uuid        |           | not null | uuid_generate_v4()
 content   | text        |           |          | 
 metadata  | json        |           |          | 
 embedding | vector(768) |           |          | 
Indexes:
    "vector_store_pkey" PRIMARY KEY, btree (id)
    "spring_ai_vector_index" hnsw (embedding vector_cosine_ops)
Enter fullscreen mode Exit fullscreen mode

Nothing surprising here. It's in-line with the configuration we saw above in the Java code I showed you earlier. For example, we notice the embedding column of type vector, of 768 dimensions. We notice also the index - spring_ai_vector_index and the vector_cosine_ops operator class, which we expected given what we set in the "distance-type" setting earlier. The other index, namely vector_store_pkey, is created automatically by Postgres. It creates such an index for every primary key by itself.

The command that Spring AI used to create our index is the following:

CREATE INDEX IF NOT EXISTS %s ON %s USING %s (embedding %s)
Enter fullscreen mode Exit fullscreen mode

This creates an index with the default configuration. It might be good to know that you have a couple of options if you'd like to tweak the index configuration for potentially better results (depends on use-case):

  • m - the max number of connections per layer
  • ef_construction - the size of the dynamic candidate list for constructing the graph

Theres are the boundaries you can pick from for these settings:

Setting default min max
m 16 2 100
ef_construction 64 4 1000

In order to understand the internals of this index and what effect changing the above options might have, here is a link to the original paper. See also this post by J. Katz in which he presents results of experimenting with various combinations of the above settings.

When you know what values you want to set for these settings you can create the index like so:

CREATE INDEX ON vector_store
USING hnsw (embedding vector_cosine_ops)
WITH (m = 42, ef_construction = 42);
Enter fullscreen mode Exit fullscreen mode

In case you get an error when constructing an index, it's worth looking into if it has enough memory to perform this operation. You can adjust the memory for it through the maintenance_work_mem setting.

Let's now check how our embedding column is actually stored on disk. We use the following query which will show us our next step.

postgres=# select 
             att.attname, 
          case 
             att.attstorage
               when 'p' then 'plain'
               when 'm' then 'main'
               when 'e' then 'external'
               when 'x' then 'extended'
            end as attstorage
           from 
            pg_attribute att  
           join 
            pg_class tbl on tbl.oid = att.attrelid   
           join 
            pg_namespace ns on tbl.relnamespace = ns.oid   
           where 
            tbl.relname = 'vector_store' and 
            ns.nspname = 'public' and   
            att.attname = 'embedding';
Enter fullscreen mode Exit fullscreen mode

Result:

-[ RECORD 1 ]---------
attname    | embedding
attstorage | external
Enter fullscreen mode Exit fullscreen mode

Alright, so it uses the external storage type. This means that it will store this column in a separate, so-called TOAST table. Postgres does this when columns are so large it can't fit at least 4 rows in a page. But interesting that it will not attempt to also compress it to shrink it even more. For compressed columns it would have said extended instead of external in the result above.

Normally, when you update one or multiple columns of a row, Postgres will, instead of overwriting, make a copy of the entire row (it's an MVCC database). But if there are any large TOASTed columns, then during an update it will copy only the other columns. It will copy the TOASTed column only when that is updated. This makes it more efficient by minimising the amount of copying around of large values.

Where are these separate tables though? We haven't created them ourselves, they are managed by Postgres. Let's try to locate this separate TOAST table using this query:

postgres=# select 
             relname, 
             oid
           from 
             pg_class, 
           (select 
              reltoastrelid 
            from 
              pg_class
            where 
              relname = 'vector_store') as vector_store 
            where 
              oid = vector_store.reltoastrelid or 
              oid = (select 
                      indexrelid 
                     from 
                      pg_index
                     where 
                      indrelid = vector_store.reltoastrelid
                      );
Enter fullscreen mode Exit fullscreen mode
       relname        |  oid  
----------------------+-------
 pg_toast_16630       | 16634
 pg_toast_16630_index | 16635
Enter fullscreen mode Exit fullscreen mode

So far so good. We now have the TOAST table ID. Let's use it to have a look at the structure of the TOAST table. For example, what columns does it have? Note that these tables are in the pg_toast schema, by the way, so to get there, we have to set the search_path to pg_toast, like below:

postgres=# set search_path to pg_toast;
SET
postgres=# \d pg_toast_16630;
TOAST table "pg_toast.pg_toast_16630"
   Column   |  Type   
------------+---------
 chunk_id   | oid
 chunk_seq  | integer
 chunk_data | bytea
Owning table: "public.vector_store"
Indexes:
    "pg_toast_16630_index" PRIMARY KEY, btree (chunk_id, chunk_seq)
Enter fullscreen mode Exit fullscreen mode

We can learn a couple of things from this. As expected, the large columns in the main table that have to be "TOASTed" are chunked (split up) and each chunk is identified by a sequence, and is always retrieved using an index.

Postgres has a mechanism to avoid "blasting" the entire shared buffer cache when it needs to do large reads, like sequential scans of a large table. When it has to do this, it actually uses a 32 page ring buffer so that it doesn't evict other data from the cache. But this mechanism will not kick in for TOAST tables, so vector-based workloads will be run without this form of protection.

Okay! We had a very good look at the database part. Let's now "resurface" for a moment and have a look at other topics pertaining to the high level workflow of interacting with the LLM.

Template-based prompts

Initially, I had constructed the prompts for the request to the LLM in the same class where I was using them. However, I found the following different approach in the Spring AI repository itself and adopted it, because it's indeed cleaner to do it this way. It's based on externalised resource files, like so:

@Value("classpath:/generate-essay.st")
protected Resource generateEssay;

@Value("classpath:/generate-saying.st")
protected Resource generateSaying;

@Value("classpath:/guess-saying.st")
protected Resource guessSaying;
Enter fullscreen mode Exit fullscreen mode

This is how one of them looks inside.

Write a short essay under 200 words explaining the 
meaning of the following saying: {saying}.
Enter fullscreen mode Exit fullscreen mode

As you can see, I have not applied any sophisticated prompt engineering whatsoever, and kept it simple and direct for now.

Calling the LLM

Alright, the pieces are starting to fit together! The next thing I'd like to show you is how to call the LLM.

chatModel
 .withModel(model)
 .call(createPromptFrom(promptTemplate, promptTemplateValues))
 .getResult()
 .getOutput()
 .getContent();
Enter fullscreen mode Exit fullscreen mode

I am using the so-called Chat Model API, a powerful abstraction over AI models. This design allows us to switch between models with minimal code changes. If you want to work with a different model, you just change the runtime configuration. This is a nice example of the Dependency Inversion Principle; where we have higher level modules that do not depend on low-level modules, both depend on abstractions.

Storing the embeddings

To store the embeddings, I must say that I found it a pretty complicated procedure:

vectorStore.add(documents);
Enter fullscreen mode Exit fullscreen mode

Just kidding, that's it!

This single command will do several things. First convert the documents (our essays) to embeddings with the help of the embeddings model, then it will run the following batched insert statement to get the embeddings into our vector_store table:

INSERT INTO vector_store (id, content, metadata, embedding) VALUES (?, ?, ?::jsonb, ?) ON CONFLICT (id) DO UPDATE SET content = ? , metadata = ?::jsonb , embedding = ?
Enter fullscreen mode Exit fullscreen mode

We can see it actually performs an update of the content column in case there is already one row with that ID (taken care of by the ON CONFLICT part in the query) present in the database.

Similarity searches

To do a similarity search on the stored vectors with Spring AI, it's just a matter of:

vectorStore
.similaritySearch(SearchRequest.query(saying))
.getFirst()
.getContent();
Enter fullscreen mode Exit fullscreen mode

Again you get a couple of things done for you by Spring AI. It takes the parameter you supply ("saying" in our case), and first it creates its embedding using the embedding model we talked about before. Then it uses it to retrieve the most similar results, from which we pick only the first one.

With this configuration (cosine similarity), the SQL query that it will run for you is the following:

SELECT *, embedding <=> ? AS distance FROM vector_store WHERE embedding <=> ? < ? AND metadata::jsonb @@ <nativeFilterExpression>::jsonpath ORDER BY distance LIMIT ?
Enter fullscreen mode Exit fullscreen mode

It selects all the columns in the table and adds a column with the calculated distance. The results are ordered by this distance column, and you can also specify a similarity threshold and a native filter expression using Postgres' jsonpath functionality.

One thing to be noted, is that if you'd write the query yourself and run it with without letting Spring AI create it for you, you can customise the query by supplying different values for the ef_search parameter (default: 40, min: 1, max: 1000), like so:

SET hnsw.ef_search = 42;
Enter fullscreen mode Exit fullscreen mode

With it, you can influence the number of neighbours that it considers for the search. The more that are checked, the better the recall, but it will be at the expense of performance.

Now that we know how to do perform similarity searches to retrieve semantically close data, we can also make a short incursion into how Postgres uses memory (shared buffers) when performing these retrievals.

How much of the shared buffers got filled up?

Let's now increase a bit the amount of essays we're working with to 100, and have a look what's in the Postgres shared buffers after we run the program. We'll use the pg_buffercache extension that I mentioned before, which was installed in the init script.

But first, let's start with looking at the size of the table and index, just to get some perspective.

postgres=# \dt+ vector_store;
                                       List of relations
 Schema |     Name     | Type  |  Owner   | Persistence | Access method |  Size  | Description 
--------+--------------+-------+----------+-------------+---------------+--------+-------------
 public | vector_store | table | postgres | permanent   | heap          | 584 kB | 
Enter fullscreen mode Exit fullscreen mode
postgres=# \di+ spring_ai_vector_index;
                                                   List of relations
 Schema |          Name          | Type  |  Owner   |    Table     | Persistence | Access method |  Size  | Description 
--------+------------------------+-------+----------+--------------+-------------+---------------+--------+-------------
 public | spring_ai_vector_index | index | postgres | vector_store | permanent   | hnsw          | 408 kB | 
(1 row)
Enter fullscreen mode Exit fullscreen mode

Okay, so the table is 584 kB and the index is 408 kB. It seems the index gets pretty big, close to being about the same size of the table. We don't mind that much at such small scale, but if we assume this proportion will be maintained at large scale too, we will have to take it more seriously.

To contrast with how other indexes behave, I checked a table we have at work that amounts to 40Gb. The corresponding B-tree primary key index is 10Gb, while other indexes of the same type for other columns are just 3Gb.

I'm using the following query to get an overview of what's in the shared buffers:

select 
  c.relname, 
  count(*) as buffers
from 
  pg_buffercache b 
inner join 
  pg_class c on b.relfilenode = pg_relation_filenode(c.oid) and
                b.reldatabase in (0, (select 
                                        oid 
                                      from 
                                        pg_database
                                      where 
                                        datname = current_database()
                      )
                  )
group by 
  c.relname
order by 
  2 desc
limit 
   10;
Enter fullscreen mode Exit fullscreen mode
            relname             | buffers 
--------------------------------+---------
 pg_proc                        |      61
 pg_toast_16630                 |      53
 spring_ai_vector_index         |      51
 pg_attribute                   |      35
 pg_proc_proname_args_nsp_index |      30
 pg_depend                      |      23
 pg_operator                    |      19
 pg_statistic                   |      19
 pg_class                       |      18
 vector_store                   |      18
(10 rows)
Enter fullscreen mode Exit fullscreen mode

We see that all the index in its entirety is in there. We deduced this because the size of the index is 408 Kb, as we saw before, and if we divide that by 8 Kb, which is the size of a Postgres page, we get exactly 51 like we see in the above table (third row above).

We can draw a conclusion from this - working with vectors in Postgres is going to be pretty demanding in terms of memory. As reference, vectors that have 1536 dimensions (probably the most common case) will occupy each about 6Kb. One million of them already gets us to 6Gb. In case we have other workloads next to the vectors, they might be affected in the sense that we start seeing cache evictions because there's no free buffer. This means we might even need to consider separating the vectors from the other data we have, in separate databases, in order to isolate the workloads in case we notice the performance going downhill.

The @ParameterizedTest JUnit annotation

Alright, a last remark I want to make about this program is that it's set up to be able to experiment with other open-source LLMs. The entrypoint method I'm using to run the workflow, is a JUnit parameterized test where the arguments for each run can be the names of other LLM models distributed with ollama. This is how you set it up to run multiple times with a different LLM for every execution:

@ParameterizedTest
@ValueSource(strings = {"llama3", "llama2", "gemma", "mistral"})
void rag_workflow(String model) {
...
}
Enter fullscreen mode Exit fullscreen mode

Outputs

Finally it's time to review how well did the LLM manage to guess the sayings. With no other help except for the initial essays provided in the prompt, it managed to guess the saying perfectly a grand total of... once!

Saying LLM Guess
Your most powerful moments are born from the ashes of your greatest fears. What doesn't kill you...
Every sunrise holds the promise of a new masterpiece. What lies within is far more important than what lies without.
Every step forward is a declaration of your willingness to grow. Any Step Forward
Your most beautiful moments are waiting just beyond your comfort zone. What lies within...
Light reveals itself in the darkness it creates. The darkness is not the absence of light but the presence of a different kind
Courage is not the absence of fear, but the willingness to take the next step anyway. Be brave.
Small sparks can ignite entire galaxies. Small sparks can ignite entire galaxies.
Believe in yourself, take the leap and watch the universe conspire to make your dreams come true. Take the leap
Life begins at the edge of what you're willing to let go. Take the leap.

Some responses are quite amusing, like when it tries to be "mysterious" or more conversational by not completing the sentence fully and just ending it in three dots ("What doesn't kill you..."), and the ones where it reaches for extreme succintness ("Take the leap.", "Be brave.").

Let's give it some help now. In the prompt, this time I'll provide all the sayings it initially generated as a list of options to pick from. Will it manage to pick the correct one from the bunch this way?

Turns out, indeed, if I gave it options to pick from, it picked the right one, every time. Quite the difference between with or without RAG!

Conclusion

Spring AI is a well designed application framework that helps you achieve a lot with little code. You can see it as the "linchpin" that helps you set up and easily evolve your AI use-cases in in stand-alone new applications or integrated with your existing Spring ones. It already has many integrations with specialised AI services and the list keeps growing constantly.

The open-source LLMs I tried have not raised to the occasion, and haven't passed my "challenge" to guess the initial sayings they themselves generated from their (also own) essays. They seem not ready to perform well for use-cases that require this kind of precise and correct "synthesised" answers, but I will keep trying new models as they are made available.

However, they are still useful if you know what you can expect from them - they are very good for storing and retrieving many loosely connected facts, a clear value-add when needing to brainstorm for example.

When I gave it the options, the difference is like night and day compared to when I didn't. If given the options, it picked the right answer every time, flawlessly.

We also looked at how embeddings are stored internally with the pgvector extension, and how "memory-hungry" this is - we should account for this and make some arrangements at the beginning of the project in order to have smooth operation when the scale grows.

Next steps

While writing this post, pgvectorscale got released. It's an interesting new project that makes pgvector more performant and cost-effective. I'd like to compare the difference between how fast it is compared with pgvector for the sayings and essays use-case I presented above. It should be easy to just switch between the two - everything stays the same, just initialise the database a bit differently in the beginning.

I have made a separate branch in the same repo where instead of pgvector like in the sections above, I'm now trying to start up pgvectorscale instead:

@Bean
@ServiceConnection
PostgreSQLContainer<?> postgreSQLContainer() {
   return new PostgreSQLContainer<>(DockerImageName
                .parse("timescale/timescaledb-ha:pg16-all")
                .asCompatibleSubstituteFor("postgres")) 
                .withUsername(POSTGRES)
                .withPassword(POSTGRES)
                .withDatabaseName(POSTGRES)
                .withInitScript("init-script.sql");
}
Enter fullscreen mode Exit fullscreen mode

In the init script, I've added the following, per the instructions:

create extension if not exists vectorscale CASCADE;
Enter fullscreen mode Exit fullscreen mode

To be continued!

Thanks for reading!

Top comments (2)

Collapse
 
fred_functional profile image
Fred Functional

This post was incredibly detailed and informative! For someone relatively new to Spring AI and pgvector, it really helps to see the step-by-step breakdown. Could you possibly delve into potential pitfalls or common mistakes when setting up a similar configuration?

Collapse
 
mcadariu profile image
Mircea Cadariu • Edited

Thanks a lot @fred_functional! Happy you enjoyed it. After you set things up and it works, the main thing is to have good monitoring in place to know how well is your database "coping" with the vector-based workload, as I have shown in the article above they are quite demanding memory-wise. For example, ideally, in PostgreSQL you should have a 99% cache hit ratio. I'm planning to write an article on PostgreSQL monitoring soon which will shed some light on this topic.

A good principle to follow in general is to not let the data out of the database, as much as possible. Check what database features or extensions you can leverage for example for updating the vectors when the underlying data changes.

In this blog I've used open-source free LLM, but for paid proprietary LLM endpoints like with Amazon Bedrock, you also have the cost factor that you'd have to balance, to not get a surprise bill, you know. Also, keep in mind they sometimes throttle requests to reduce load on their servers.

It is good to know that to get an answer from LLMs it currently takes a good couple of seconds. You therefore might want to consider "offline" precomputing some data so that when the user request comes in, you don't ask the live LLM endpoint but consult your local precomputed "cache" which is faster. This means you might be paying for more than you strictly need, because users might not "access" what you precomputed. It's helpful to look at the providers' cost model as some offer discounts for bulk operations.

Let me know if this helps and if you have other questions, happy to help and good luck 👍