For almost 6 years, I’ve helped Slack app developers build their Slack apps, and I’ve loved answering their technical questions day in and day out.
But with a new chapter on the horizon —and the inevitable departure from my current role— I wanted to create something that could keep the conversation going.
That’s why I built an AI service that answers Slack platform specific questions on my behalf. You can try it out at what-would-kaz-say.fly.dev.
If you are a current or former member of the technical staff at Slack, you might know why I named it this way. I hope Cal will like it, but ... what would he say?
This project is a RAG (Retrieval-Augmented Generation)-based AI service that combines generative AI with document and Q&A data retrieval related to the Slack platform to ensure answers are both accurate and context-aware. It was my first in-depth exploration into building RAG with a substantial dataset, and the journey has been full of learning and joy.
In this article, I will share the details of what I've built and how it works.
Key System Components
Here is the list of the key components of the system:
- Streamlit powers the web interface, giving the app a dynamic, user-friendly feel.
- Chroma is used for semantic search.
- SQLite3 with BM25 ranking algorithm handles keyword queries.
- OpenAI API (chat.completions, moderations) drives the generation of responses.
- GitHub API provides publicly available quality content for Q&A.
- Docker ensures a consistent environment for the app.
- Fly.io manages deployment and scale adjustments.
- Langfuse monitors and analyzes the LLM performance.
- Sentry monitors and logs errors for continuous improvement.
Every component has been chosen to balance quality, performance, cost efficiency, and ease of development.
I initially considered using cloud services for the datastore, but I found that it could cost more than expected for this hobby project, and I didn't need many functionalities this time. Additionally, I wanted to quickly build something that just works while learning the fundamental aspects of RAG engineering with the simplest tools. For these reasons, I decided to build it this way.
Now, let me walk you through the key features.
Keeping the AI Grounded
To reduce the chance of hallucinations as much as possible, the system prompt explicitly instructs the AI to base its answers solely on retrieved documents and the session’s past messages. For example, the system prompt includes:
Your answer must consider only the
# Context
section data and the past messages in the same session. Please be particularly precise when discussing code related to SDKs. You must not respond with any speculations like suggesting any classes and methods that are not mentioned in the# Context
section.
Then, the user’s question is formatted like this:
user_message =f"""
# Question
{unified_question} You must answer this question in {detected_language}
# Context
This prompt provides {num_of_context_documents} documents as context for this question:
{context}
"""
You should be interested in unified_question
, detected_language
, and context
in the code above. They are:
-
unified_question
: I will explain this in detail in the next section, but it is a single unified question that considers previous questions in the same session. -
detected_language
: As a non-native English speaker, I wanted to support any language for asking questions; one of the preprocessing steps detects the language and translates the question into English for better search results. -
context
: A large text dataset that includes all the documents and Q&As found in the hybrid retriever's search (also,num_of_context_documents
is the number of documents in thiscontext
).
This has worked well for many basic questions, but I found that the system still lacks some commonsense knowledge about Slack platform, such as what a team_id is (it's an ID for a workspace). So, I’ve enhanced the system prompt to include this kind of universal Slack knowledge to help the AI more easily and consistently align with the fundamental concepts of the Slack platform.
Hybrid Retriever Architecture
To retrieve the most relevant documents for each question, I implemented two search layers and combined them with my own ranking logic:
- Chroma handles semantic search.
- SQLite3 (using BM25 ranking algorithm) manages traditional keyword searches.
As mentioned earlier, I recognize there might be even better solutions out there, but I went with these two simple solutions to quickly create something functional and to learn with the most straightforward solutions first.
The entire process of building the search index involves these steps:
- Fetch publicly available documents, Q&A, code examples, and unit tests from GitHub.
- Optimize some of the data and store everything in a single JSON data file (source_data.json).
- Insert all the data into Chroma and SQLite3 as part of the Docker image.
I will elaborate on steps 1 and 2 in the next section. In this part, I will focus on the process of building the Docker image.
Building Search Index within Docker Image
Both search systems run locally within the Docker container, ensuring low latency and cost efficiency. The databases are built when the Docker image is created. Here’s the relevant excerpt from the Dockerfile:
# copy the python scripts
COPY *.py /app/
# copy the source JSON data files
COPY source_data.json /app/source_data.json
# run the scripts to set up the databases
RUN python keyword_search.py build
RUN python chroma_search.py build
# delete the JSON file to reduce the image size a little bit
RUN rm source_data.json
As of today, the source data file is approximately 13MB in total, with around 6,000 text entries (the optimization I will mention later reduced this file size by 50%). The keyword search SQLite3 database is 23MB, and Chroma database is 83MB. Considering these relatively small data sizes, this setup works well for both performance and cost.
I chose python:3.12-slim-bookworm
as the base image (Python 3.13 fails to function for some of the dependencies, so I chose 3.12 for the time being), and the current image size is about 480MB.
Hybrid Search & Ranking
I started with only Chroma's semantic search, but I soon realized that for certain questions a simple semantic search wasn’t enough.
To resolve the issue, I integrated a BM25-based keyword search through SQLite3. With this enhancement, I observed that the relevance of the retrieved documents significantly improved. The process is as follows:
- Use LLM to detect the language of the question (and translate if necessary).
- Merge previous questions in the same session (if exists) into the current question using LLM (=making the unified question)
- Run semantic search with the unified question in Chroma.
- Extract a few sets of keywords from the unified question using LLM.
- Run a few SQLite3 BM25-based keyword searches on the extracted keyword sets.
- Rank the results by document appearance frequency + weights to determine the final document set.
Although this process requires three OpenAI API calls, using a lean model (gpt-4o-mini) allows the steps to complete fast and cost-effectively. Most of the processing time comes from the final OpenAI call that generates the answer with retrieved documents.
Regarding the SQLite3 database table, you can execute the following DDL:
cursor = conn.cursor()
cursor.execute("CREATE VIRTUAL TABLE IF NOT EXISTS source_content_files USING fts5(id, content, last_updated);")
After that, you can insert rows like this:
cursor.execute(
"INSERT INTO source_content_files (id, content, last_updated) VALUES (?, ?, ?)",
(key, content, last_updated)
)
With the above setup, you can now run the following query:
cursor.execute(f"""
SELECT
content
FROM
source_content_files
WHERE
source_content_files MATCH ?
ORDER BY
-- lower scores indicate better matches.
bm25(source_content_files) ASC,
last_updated DESC
LIMIT
{limit}
""",
[condition],
)
return list(map(lambda row: row[0], cursor.fetchall()))
The condition
can include a few keywords to match with the "OR" condition.
In an effort to improve the quality of search results and the final output, I explored some additional approaches, such as rule-based document attachments and naive model distillation. Unfortunately, none of these methods were successful. I will continue to explore further ideas.
When reflecting on the result evaluation process, I manually assessed the quality by running several queries. The approach worked for this app due to the small dataset and my familiarity with the topic. However, to ensure rapid and continuous improvement, I'll need to evaluate the quality using a dedicated dataset along with reliable tools like Ragas and ideally automate the process.
Source Document Data Pipeline
I slightly mentioned the optimization of the source data in the previous section. Now, I will share the details.
Let's start by explaining what source data this system collects. The quality and breadth of the documents are critical, so I gathered data from multiple sources:
- GitHub: Issues and their comments, SDK documents (markdown), code examples, and unit tests.
- My own content: Blog posts (markdown) and library code in TypeScript.
- Hand-written notes: Additional documents curated specifically for this app.
A significant effort was spent summarizing lengthy GitHub issues and unit test code. I used OpenAI API to summarize the text data to be even more efficient. For example, one GitHub issue (#1026) was condensed to capture its essence, highlighting key points about modal submissions, API errors, and the proper usage of ack()
and client.views_update()
. The summary retained important error messages, links to official documentation, and included a useful code snippet as well.
Title:
Issue in modal submission (We had some trouble connecting.Try again ?)
Summary:
The GitHub issue discusses procedural issues with triggering modals in a Slack app using Python's Socket Mode. The process involves three stages:
1. A modal is triggered through a slash command with an input field for a GitHub issue ID.
2. Upon submission, the data is parsed, and if valid, a second modal with multiple input fields is updated.
3. The final submission retrieves values to update the issue body.
During the interaction, the following error message has been reported:
Failed to run listener function (error: The request to the Slack API failed. (url: https://www.slack.com/api/views.update)
The server responded with: {'ok': False, 'error': 'not_found'})
Key points raised include:
- There were issues with using `ack()` and `client.views_update()` calls simultaneously. The recommendation is to utilize `ack(response_action="update", view= updated_view)` to handle modal updates properly, as `ack()` will close the current modal.
- Using `ack(response_action="update")` should allow for updating the view without closing it unexpectedly, provided the view structure includes all necessary properties (like the title).
- The Slack API does not permit reopening a modal with `views.open` if one is already open; instead, modals should be updated using `views.update`.
- Proper usage of the `view_id` in the `views.update` API call was emphasized.
A code example that aligns with the discussion was provided for better clarity, ensuring proper modal manipulation using `ack()` and enhancing user experience by preventing premature closure of modals. The example demonstrates an effective approach to update modals asynchronously and correctly. The discussion concludes by suggesting that unresolved issues can be addressed in new issue threads, and the current one may be closed for inactivity.
For more information, refer to the following documentation:
- [Updating View Modals](https://api.slack.com/surfaces/modals#updating_views)
- [Slack API Error Handling](https://api.slack.com/developer-tools)
Here's a snippet reflecting the corrected modal management:
@app.view("submit_ticket")
def handle_ticket_submission(ack, body, client):
errors = {}
if len(errors) > 0:
ack(response_action="errors", errors=errors)
else:
ack(response_action="update", view=updated_view)
This ensures that user interactions proceed without unnecessary closures or errors.
For generating this output from lengthy GitHub issue discussion text, I used the following prompt:
You'll be responsible for summarizing a GitHub issue, ensuring all crucial details from the discussion are preserved. You may exclude irrelevant parts such as template phrases. While there's no need to drastically shorten the text, you should refine it to remain clear and helpful for Slack app developers.
Maintain any working code examples in the summary. Include any mentioned error messages or codes. If there are valuable links to official documentation, those should also be part of the summary.
Provide only the summary in your response.
In another instance, I used OpenAI to convert unit test code into a practical example script, which is ready to use without any testing framework requirements. This approach was primarily utilized to enhance data quality, making it much more accessible for both retrievers and generative AI.
Here is an example. The AuthorizeUrlGenerator
unit tests were converted into the following code snippet with sufficient comments. These comments are very useful for both document search quality and helping generative AI understand the code better.
# This script demonstrates how to use the AuthorizeUrlGenerator and OpenIDConnectAuthorizeUrlGenerator
# classes from the Slack SDK to generate various OAuth authorization URLs.
from slack_sdk.oauth import AuthorizeUrlGenerator, OpenIDConnectAuthorizeUrlGenerator
# Using AuthorizeUrlGenerator to create OAuth URLs
def generate_authorize_urls():
# Initialize the AuthorizeUrlGenerator with necessary parameters
generator = AuthorizeUrlGenerator(
client_id="111.222", # The client ID for your Slack app
scopes=["chat:write", "commands"], # The permissions your app is requesting
user_scopes=["search:read"], # The user-level permissions your app is requesting
)
# Generate the default OAuth URL
url_default = generator.generate("state-value")
print("Default URL:", url_default)
# Expected URL:
# https://slack.com/oauth/v2/authorize?state=state-value&client_id=111.222&scope=chat:write,commands&user_scope=search:read
# Generate the OAuth URL with a custom authorization URL
generator_with_base_url = AuthorizeUrlGenerator(
client_id="111.222",
scopes=["chat:write", "commands"],
user_scopes=["search:read"],
authorization_url="https://www.example.com/authorize" # Custom base URL
)
url_base = generator_with_base_url.generate("state-value")
print("Base URL:", url_base)
# Expected URL:
# https://www.example.com/authorize?state=state-value&client_id=111.222&scope=chat:write,commands&user_scope=search:read
# Generate the OAuth URL including team information
url_team = generator.generate(state="state-value", team="T12345")
print("Team URL:", url_team)
# Expected URL:
# https://slack.com/oauth/v2/authorize?state=state-value&client_id=111.222&scope=chat:write,commands&user_scope=search:read&team=T12345
# Using OpenIDConnectAuthorizeUrlGenerator to create OpenID Connect URLs
def generate_openid_urls():
# Initialize the OpenIDConnectAuthorizeUrlGenerator
oidc_generator = OpenIDConnectAuthorizeUrlGenerator(
client_id="111.222",
redirect_uri="https://www.example.com/oidc/callback", # Where to redirect after authorization
scopes=["openid"], # OpenID scope
)
# Generate the OpenID Connect authorization URL
url_oidc = oidc_generator.generate(state="state-value", nonce="nnn", team="T12345")
print("OpenID Connect URL:", url_oidc)
# Expected URL:
# https://slack.com/openid/connect/authorize?response_type=code&state=state-value&client_id=111.222&scope=openid&redirect_uri=https://www.example.com/oidc/callback&team=T12345&nonce=nnn
# Execute the functions to generate and print the URLs
generate_authorize_urls()
generate_openid_urls()
For generating this output, I came up with the following system prompt:
You are tasked with generating useful code examples from either unit test code or example code found in Slack's official SDK repositories.
If the provided text is unit test code for an SDK, you will create a single script that executes a set of operations tested by the unit test. This involves extracting the method executions/invocations of the test target, along with their required initialization processes, such as class instantiation or setting necessary variables. You should include detailed comments on what each part of the code accomplishes in the output script. Converting the assertions in the test code isn't needed, but if the details of a returned value are informative, you should include them as comments in the code as well.
Provide only the generated script code in your response.
The entire process of fetching public data and optimizing the text is idempotent, so I plan to run it periodically to keep the source data up to date.
Streamlit UI Optimization
When I started working on this project, it was a tiny CLI tool for my own use.
The code was something like this (off topic, but if you're looking for a great solution for daily use, take a look at ShellGPT):
# pip install openai rich prompt_toolkit
# export OPENAI_API_KEY=sk-....
# python cli.py
import os
import logging
import traceback
# OpenAI API client
from openai import OpenAI, Stream
from openai.types.chat import ChatCompletionChunk
from main_prompt_builder import build_new_messages
from configuration import load_openai_api_key, load_openai_model
# To render markdown text data nicely
from rich.console import Console
from rich.markdown import Markdown
from rich.live import Live
# To store command history
from prompt_toolkit import PromptSession
from prompt_toolkit.history import FileHistory
prompt_session_history_filepath = os.path.expanduser("~/.my-openai-cli")
prompt_session = PromptSession(history=FileHistory(prompt_session_history_filepath))
# Initialize OpenAI client and model
openai_client = OpenAI(api_key=load_openai_api_key())
openai_model = load_openai_model()
# stdout
console = Console()
def p(output):
console.print(output, end="")
def main():
p(f"\nConnected to OpenAI (model: {openai_model})\n\n")
messages = build_new_messages()
while True:
prompt: str = ""
try:
prompt = prompt_session.prompt("> ")
except (EOFError, KeyboardInterrupt): # ctrl+D, ctrl+C
p("\n")
exit(0)
if not prompt.strip():
continue
try:
prepare_messages_with_latest_prompt_content(openai_client, messages, prompt)
p("\n")
stream: Stream[ChatCompletionChunk] = openai_client.chat.completions.create(
model=openai_model,
messages=messages,
temperature=0.1,
stream=True,
)
reply: str = ""
with Live(refresh_per_second=6) as live:
for chunk in stream:
item = chunk.choices[0].model_dump()
if item.get("finish_reason") is not None:
break
chunk_text = item.get("delta", {}).get("content")
if chunk_text:
reply += chunk_text
live.update(Markdown(reply))
p("\n")
messages.pop() # remove the lengthy user message with retrieved documents
messages.append({"role": "user", "content": prompt})
messages.append({"role": "assistant", "content": reply})
except KeyboardInterrupt: # ctrl+D
p("\n")
continue
except EOFError: # ctrl+C
p("\n")
exit(0)
except Exception as e:
p(f"\n[bold red] Error: {e}[/bold red]\n")
print(traceback.format_exc())
continue
if __name__ == "__main__":
main()
The app then evolved into a web UI, making it accessible to anyone. Streamlit made it easy to focus on the core logic instead of spending too much time on the interface. Here are a few standout enhancements I added:
Dynamic “Kaz is thinking …” Indicator
To tell a user that the system is now working on their question, I initially displayed static text saying "Kaz is thinking..." on the UI until the stream rendering began.
Then, to make the interaction more lively, I replaced the static text with a dynamic indicator. This indicator refreshes every 1.5 seconds while a background thread is working on the OpenAI call, ensuring the user is consistently aware that the system is actively processing the passed query.
The actual code is a bit more complex, but the core logic is illustrated below:
# the area to display the assistant's reply
response_placeholder = st.empty()
# the textarea to enter the user's question
current_question = st.text_area("Enter your question:", key="user_input", height=120)
# the button to send right after the textarea
submit_button = st.button("Send", disabled=st.session_state.button_disabled)
if submit_button and current_question.strip():
# when the "Send" button is clicked
st.session_state.button_disabled = True
st.session_state.now_answering = False
# run this function in a different thread
def answer_question():
try:
assistant_reply = ""
# make a deep copy of the messages to avoid displaying
# the actual long prompt with retrieved documents on the UI
messages = st.session_state.messages.copy()
# this internal function updates the messages
# with retrieved documents retrieved from the retriever system
# also, streamlines the prompt to consider the older user questions in the same session
update_messages_with_latest_prompt_content(
client=client,
collection=collection,
messages=messages,
prompt=current_question,
)
# update the UI with the streaming response from OpenAI
for chunk in stream_openai_response(client, openai_model, messages):
if st.session_state.now_answering is False:
st.session_state.now_answering = True
assistant_reply += chunk
response_placeholder.markdown(assistant_reply)
# append the user's input rather than the actual long prompt with retrieved documents
st.session_state.messages.append({"role": "user", "content": current_question})
st.session_state.messages.append({"role": "assistant", "content": assistant_reply})
# delete the input in the textarea to enable the user to immediately enter a new question
if "user_input" in st.session_state:
del st.session_state.user_input
st.session_state.button_disabled = False
# All the things for this question are done, so rerun the app to refresh the whole UI
st.rerun()
except Exception as e:
print(traceback.format_exc())
st.error(f"An error occurred: {e}")
# start a new thread to call OpenAI's API and reflect the streaming response on the UI
t = threading.Thread(target=answer_question, daemon=True)
# note that this is crucial to use these functions before simply starting a thread
add_script_run_ctx(t, get_script_run_ctx())
t.start()
# display lively updated "Kaz is thinking ..." indicator
count = 2
while t.is_alive() and st.session_state.now_answering is False and count < 30:
dots = "." * count
now_loading = f"**Hold on, Kaz is thinking {dots} 🤔**"
response_placeholder.markdown(now_loading)
time.sleep(1.5)
count += 1
# If the AI is still thinking for a while, encourage the user to try again
if count >= 30:
st.error("It seems Kaz is a bit busy right now ... 😔 Please try again later.")
st.stop()
# wait for the thread to finish
st.session_state.now_answering = False
t.join()
Note: The call to add_script_run_ctx(t, get_script_run_ctx())
is required before starting the thread; otherwise, your code within the thread will fail to access Streamlit's session and other data.
This seems to be a common need, but I couldn't find direct answers on how to implement it properly. So, I thought sharing this might be helpful to someone else. This is a small customization of the Streamlit UI, but I believe it makes the user experience more pleasant.
Simple Abuse Prevention & Cached Responses
While not foolproof, a simple rate-limiting mechanism may help prevent abuse of the system.
Using a local SQLite3 database to track request frequency, the app can temporarily block users who send too many requests in a short time. Here’s a simplified code snippet of the abuse prevention logic:
# If block_again is true, it means the remaining block duration will be extended.
# The flag can be set to false when a user simply visits the website.
# Contrarily, it is true when a user submits a query that incurs a cost to the system.
def detect_frequent_requests(ip_address: str, block_again: bool = True):
last_accessed, last_blocked = get_last_accessed(ip_address)
if is_this_user_blocked(last_accessed, last_blocked):
if block_again is True:
save_as_blocked(ip_address)
st.error("👋 Thanks for asking questions! but I'm unable to answer many questions at once. Please try again later.")
st.stop()
When scaling the service out, this local SQLite3 database within a container doesn't work well for this purpose. So, if this service receives far more traffic than I expect, I'll consider switching to a more robust solution. But for now, it should be better than nothing.
In connection with this, for cost efficiency on the system side, I've added a simple cache layer. The suggested prompts you'll see when visiting the website have cached responses, so the system mimics the stream rendering (Sorry about that! but it's an actual response from the system I've captured in advance). This approach helps reduce system load and minimizes unnecessary costs associated with AI retrieval and the CPU time consumption on fly.io, especially for results that are mostly repetitive.
Sign in with Google
Although the authentication isn't fully utilized yet, the idea is to allow only logged-in users to perform resource-intensive generation tasks, which helps manage costs and prevent system abuse.
Here is the code to check if a user is logged in and to display the login/logout link on the right side of the header. The is_logged_in()
function is also used when handling the frequency of the user's requests.
# added the following dependencies to the requirements.txt
# streamlit==1.42.0
# Authlib==1.4.1
def is_logged_in() -> bool:
return st.experimental_user and st.experimental_user.is_logged_in
def logged_in_user_name() -> Optional[str]:
if st.experimental_user and "name" in st.experimental_user:
return st.experimental_user.name
return None
# place the login/logout link on the right side of the header
col_left, col_right = st.columns([6, 1])
col_left.header("🤔 What would Kaz say❓")
if is_logged_in():
# pass "tertiary" type to make it look like a link
if col_right.button("Log Out", type="tertiary"):
st.logout()
else:
if col_right.button("Log In", type="tertiary"):
st.login()
The current implementation supports only Google accounts, but you can easily add more authentication providers: https://docs.streamlit.io/develop/concepts/connections/authentication
Moving Forward
Considering I spent just a few days on development, I'm quite happy with what the system can do—but I'm well aware it's far from perfect. Here are some improvements I'm planning:
- I can't wait to see the results with o3-mini or even newer models once they come to my tier.
- I've added all the documents that I could think of, but there may be more documents that could be added to cover broader topics.
- The conversion of GitHub issues and unit tests worked well, but there may be more areas to apply similar techniques.
- I understand that my search and ranking system can be improved; I will continue exploring other available solutions for even better service quality.
-
@what-would-kaz-say
bot in Slack workspaces should make sense because this is about Slack platform! - Providing APIs or webhooks should be a good idea too; other apps/platforms can leverage this service's outputs.
Building this Slack platform expert AI was a great self-learning opportunity for me. I hope this blog article will be helpful to someone else, and that the webapp will assist future Slack app developers in creating amazing integrations.
Please feel free to comment on this post, and thanks for reading!
Top comments (0)