DEV Community

Cover image for Quick tip: Replicating JSON data from MongoDB to SingleStore Kai and creating OpenAI embeddings
Akmal Chaudhri for SingleStore

Posted on • Edited on

Quick tip: Replicating JSON data from MongoDB to SingleStore Kai and creating OpenAI embeddings

Abstract

It is now straightforward to replicate JSON data from MongoDB to SingleStore Kai using the new Change Data Capture (CDC) solution. This article describes two examples of how to do this. In the first example, we'll replicate a small retail dataset. In the second example, we'll replicate a small book dataset and create OpenAI embeddings for books. Finally, we'll build a simple Streamlit application to provide a GUI for a book recommender system.

The datasets, notebook files, SQL and Streamlit code are available on GitHub.

Introduction

The new CDC solution from SingleStore makes it very easy to replicate data from MongoDB to SingleStore Kai. We'll work through two different examples in this article to see how.

MongoDB Atlas

We'll use MongoDB Atlas in an M0 Sandbox. We'll configure an admin user with atlasAdmin privileges under Database Access. We'll temporarily allow access from anywhere (IP Address 0.0.0.0/0) under Network Access. We'll note down the username, password and host.

SingleStore Kai

A previous article showed the steps to create a free SingleStoreDB Cloud account. We'll use the following settings:

  • Workspace Group Name: CDC Demo Group
  • Cloud Provider: AWS
  • Region: US East 1 (N. Virginia)
  • Workspace Name: cdc-demo
  • Size: S-00
  • Settings:
    • SingleStore Kai selected

Once the workspace is available, we'll make a note of our password and host. The host will be available from cdc-demo > Connect > MongoDB Client. We'll need this information later for Streamlit. We'll also temporarily allow access from anywhere by configuring the firewall under CDC Demo Group > Firewall.

1. Retail Sales Example

MongoDB Atlas

We'll create a new database called new_transactions with three collections called custs, items and txs. We'll load data into these collections. The data load can be accomplished in several ways, such as using MongoDB Compass or mongoimport. The datasets can be found on GitHub.

SingleStore Kai

From the left navigation pane in SingleStoreDB Cloud, we'll select DEVELOP > Data Studio > Open SQL Editor to create a new_transactions database and link, as follows:

CREATE DATABASE IF NOT EXISTS new_transactions;
USE new_transactions;

DROP LINK new_transactions.link;

CREATE LINK new_transactions.link AS MONGODB
CONFIG '{"mongodb.hosts": "<primary>:27017, <secondary>:27017, <secondary>:27017",
        "collection.include.list": "new_transactions.*",
        "mongodb.ssl.enabled": "true",
        "mongodb.authsource": "admin",
        "mongodb.members.auto.discover": "false"}'
CREDENTIALS '{"mongodb.user": "<username>",
            "mongodb.password": "<password>"}';

CREATE TABLES AS INFER PIPELINE AS LOAD DATA LINK new_transactions.link '*' FORMAT AVRO;
Enter fullscreen mode Exit fullscreen mode

We'll replace <username> and <password> with the values that we saved earlier from MongoDB Atlas. We'll also need to replace the values for <primary>, <secondary> and <secondary> with the full address for each from MongoDB Atlas.

Next, we'll start the pipelines to replicate the data, as follows:

START ALL PIPELINES;
Enter fullscreen mode Exit fullscreen mode

Shortly afterwards, we'll check the tables and confirm that the data have been copied across, as follows:

SELECT COUNT(*) FROM custs;
SELECT COUNT(*) FROM items;
SELECT COUNT(*) FROM txs;
Enter fullscreen mode Exit fullscreen mode

Load the Notebook

We'll use the notebook file available on GitHub. From the left navigation pane in SingleStoreDB Cloud, we'll select DEVELOP > Data Studio. In the top right of the web page will be New Notebook with a pulldown that has two options:

  1. New Notebook
  2. Import From File

We'll select the second option, locate the notebook file we downloaded from GitHub and load it into SingleStoreDB Cloud.

Run the Notebook

Figure 1 shows the overall flow of this application.

Figure 1. Retail Sales Workflow (Source: SingleStore).

Figure 1. Retail Sales Workflow (Source: SingleStore).

In Figure 1,

  • Step 1: Copy data from Atlas to SingleStoreDB was previously completed using the SingleStore CDC solution.
  • Step 2: Run Queries against Atlas and SingleStore allows us to run various analytical queries against both systems.

We'll enter the MongoDB Atlas details in the notebook, as follows:

myclientmongodb = pymongo.MongoClient("mongodb+srv://<username>:<password>@<host>/?retryWrites=true&w=majority")
Enter fullscreen mode Exit fullscreen mode

We'll replace <username>, <password> and <host> with the values that we saved earlier from MongoDB Atlas.

We'll now run through the Retail Sales notebook to generate and view the analytics and visualisations.

2. Bookstore Example

MongoDB Atlas

We'll create a new database called bookstore with a collection called books. We'll load data into this collection. The data load can be accomplished in several ways, such as using MongoDB Compass or mongoimport. The dataset can be found on GitHub.

SingleStore Kai

From the left navigation pane in SingleStoreDB Cloud, we'll select DEVELOP > Data Studio > Open SQL Editor to create a bookstore database and link, as follows:

CREATE DATABASE IF NOT EXISTS bookstore;
USE bookstore;

DROP LINK bookstore.link;

CREATE LINK bookstore.link AS MONGODB
CONFIG '{"mongodb.hosts": "<primary>:27017, <secondary>:27017, <secondary>:27017",
        "collection.include.list": "bookstore.*",
        "mongodb.ssl.enabled": "true",
        "mongodb.authsource": "admin",
        "mongodb.members.auto.discover": "false"}'
CREDENTIALS '{"mongodb.user": "<username>",
            "mongodb.password": "<password>"}';

CREATE TABLES AS INFER PIPELINE AS LOAD DATA LINK bookstore.link '*' FORMAT AVRO;
Enter fullscreen mode Exit fullscreen mode

We'll replace <username> and <password> with the values that we saved earlier from MongoDB Atlas. We'll also need to replace the values for <primary>, <secondary> and <secondary> with the full address for each from MongoDB Atlas.

Next, we'll start the pipeline to replicate the data, as follows:

START ALL PIPELINES;
Enter fullscreen mode Exit fullscreen mode

Shortly afterwards, we'll check the table and confirm that the data have been copied across, as follows:

SELECT COUNT(*) FROM books;
Enter fullscreen mode Exit fullscreen mode

Load the Notebook

We'll use the notebook file available on GitHub. From the left navigation pane in SingleStoreDB Cloud, we'll select DEVELOP > Data Studio. In the top right of the web page will be New Notebook with a pulldown that has two options:

  1. New Notebook
  2. Import From File

We'll select the second option, locate the notebook file we downloaded from GitHub and load it into SingleStoreDB Cloud.

Run the Notebook

Figure 2 shows the overall flow of this application.

Figure 2. Bookstore Workflow (Source: SingleStore).

Figure 2. Bookstore Workflow (Source: SingleStore).

In Figure 2,

  • The Book Data were previously loaded using the SingleStore CDC solution.
  • The notebook generates data embeddings for books using OpenAI. The embeddings are stored in a new SingleStore Kai collection.
  • The notebook obtains query embeddings using OpenAI and performs $dotProduct when we query the books. SingleStore Kai also supports $euclideanDistance as discussed in a previous article.

We'll now run through the Bookstore notebook to generate book recommendations. The query results will be output as JSON data.

Bonus: Book Recommender

For the Bookstore example, we'll build a small Streamlit application that provides a more interactive experience and outputs the results in a better format, rather than as JSON data. The Streamlit code can be found on GitHub.

We'll enter the SingleStore Kai MongoDB Client details in the Streamlit application, as follows:

def init_connection():
    return pymongo.MongoClient("mongodb://<username>:<password>@<host>:27017/?authMechanism=PLAIN&tls=true&loadBalanced=true")
Enter fullscreen mode Exit fullscreen mode

We'll replace <username>, <password> and <host> with the values that we saved earlier from SingleStoreDB Cloud.

We'll need to provide an OpenAI API Key in our environment. For example:

export OPENAI_API_KEY="<OpenAI API Key>"
Enter fullscreen mode Exit fullscreen mode

Replace <OpenAI API Key> with your key.

Next, we'll run the Streamlit application, as follows:

streamlit run streamlit_app.py
Enter fullscreen mode Exit fullscreen mode

The application presents an input box on the left side where the query can be typed, and the results are shown on the right side, formatted as a table. An example query and its output are shown in Figure 3.

Figure 3. Book Recommender.

Figure 3. Book Recommender.

Summary

This article presented two examples of how easily we can use the SingleStore CDC solution to connect to MongoDB Atlas and replicate the data to SingleStore Kai. We also created embeddings from JSON book data and stored these embeddings in SingleStore Kai. Finally, we built a simple application to query the database system to make book recommendations.

Top comments (0)