DEV Community

Cover image for Image Vector Similarity Search with Azure Computer Vision and PostgreSQL
Foteini Savvidou
Foteini Savvidou

Posted on

Image Vector Similarity Search with Azure Computer Vision and PostgreSQL

Recently, vector embeddings and vector databases have become increasingly popular, enabling the development of new intelligent applications such as vector similarity search. Since I had no prior experience with these concepts, I decided to explore them and document my journey for you.

In this article, you will explore the Image Retrieval functionality of Azure Computer Vision 4.0, which is powered by the Florence Foundation model. You will:

  • Explore the concept of vector embeddings and vector similarity search.
  • Understand how the Image Retrieval APIs of Azure Computer Vision work.
  • Use the Image Retrieval APIs in Python to search a collection of images to find those that are most similar to a given query image or text prompt.
  • Use the Azure Cosmos DB for PostgreSQL to store and query vector data.

Before starting to build your image vector similarity system, follow these steps:

  • Sign up for either an Azure free account or an Azure for Students account. If you already have an active subscription, you can use it.

  • Create a Cognitive Services resource in the Azure portal.

    The Image Retrieval APIs are available in the following regions: East US, France Central, Korea Central, North Europe, Southeast Asia, West Europe, West US. To run the Image Retrieval demo in the Vision Studio, your resource must belong to the East US region.

  • Install Python 3.x, Visual Studio Code, Jupyter Notebook and Jupyter Extension for Visual Studio Code.

Do you want to start coding right away? The source code, image files, and their vector embeddings can all be found on my GitHub repository.

What are vector embeddings?

Let's start by understanding vector embeddings. In simple terms, vector embeddings are numerical representations of data, such as images, text, videos, and audio. These vectors are high-dimensional dense vectors, with each dimension containing information about the original content. By translating data into dense vectors, we can use them to perform tasks like clustering, recommendation, or image search. Additionally, these dense representations allow computers to understand the semantic similarity between two objects. That is, if we represent objects like images, videos, documents, or audio as vector embeddings, we can quantify their semantic similarity by their proximity in a vector space. For example, the embedding vector of an image of a cat will be more similar to the embedding vector of another cat or the word "meow" than that of a picture of a dog or the word "woof".

Graph of similar vector embeddings.

Generating dense vectors

There are lots of ways to generate vector embeddings of data. To vectorize your data in Azure, you can use the Azure OpenAI Embeddings models for text documents or the Azure Cognitive Services for Vision Image Retrieval APIs for images and text. The latter is a multi-modal embedding model that enables you to vectorize image and text data and build image search systems.

If you're interested in learning more about vector embeddings and how they're created, check out the post "Vector Similarity Search: From Basic to Production" on the MLOps community website!

What is vector similarity search?

Vector embeddings capture the semantic similarity of objects, allowing for the development of semantic similarity search systems. Vector similarity search works by searching for images based on the content of the image itself, instead of relying only on manually assigned keywords, tags, or other metadata, as keyword-based search systems do. This approach is usually more efficient and accurate than traditional search techniques, which depend heavily on the user's capability to find the best search terms.

In a vector search system, the vector embedding of a user's query is compared to a set of pre-stored vector embeddings to find a list of vectors that are the most similar to the query vector. Since embeddings that are numerically close are also semantically similar, we can measure the semantic similarity by using a distance metric, such as cosine or Euclidean distance.

In this post, we will be utilizing the cosine similarity metric to measure the similarity between two vectors. This metric is equal to the cosine of the angle between the two vectors, or, equivalently, the dot product of the vectors divided by the product of their magnitudes. The cosine similarity ranges from -1 to 1, with a value close to 1 indicating that the vectors are very similar. To help illustrate this, here is a Python function that calculates the cosine similarity between two vectors.

def get_cosine_similarity(vector1, vector2):
    """
    Get the cosine similarity between two vectors
    """
    dot_product = 0
    length = min(len(vector1), len(vector2))

    for i in range(length):
        dot_product += vector1[i] * vector2[i]

    magnitude1 = math.sqrt(sum(x * x for x in vector1))
    magnitude2 = math.sqrt(sum(x * x for x in vector2))

    return dot_product / (magnitude1 * magnitude2)
Enter fullscreen mode Exit fullscreen mode

Do image retrieval using Azure Computer Vision

Azure Computer Vision 4.0's Image Retrieval APIs, powered by the Florence Foundation model, allow for the vectorization of images and text queries. This vectorization converts images and text into coordinates in a 1024-dimensional vector space, enabling users to search a collection of images using text and/or images without the need for metadata, such as image tags, labels, or captions.

Azure Computer Vision provides two Image Retrieval APIs for vectorizing image and text queries: the Vectorize Image API and the Vectorize Text API. The diagram below shows a typical image retrieval process using these APIs.

A typical image retrieval system using the Image Retrieval APIs of Azure Computer Vision.

Image Retrieval in Python

Navigate to my GitHub repository and download the source code of this article. Open the image-embedding.ipynb Jupyter Notebook in Visual Studio Code and explore the code.

Using the Vectorize Image API, this code generates vector embeddings for 200 images and exports them into a JSON file. After that, two image vector similarity search processes are highlighted: image-to-image search and text-to-image search.

  • In the image-to-image search process, the reference image is converted into a vector embedding using the Vectorize Image API and the cosine distance is used to measure the similarity between the query vector and the vector embeddings of our image collection. The top matched images are retrieved and displayed alongside the reference image.

  • In the text-to-image search process, the text query is converted into a vector embedding using the Vectorize Text API and the most similar images are retrieved and displayed based on the cosine similarity between the query vector and their vectors.

This example is inspired by the Serge Retkowsky’s Azure Computer Vision in a day workshop.

What is a vector database?

Vector databases are a specialized type of database that are designed for handling data as high-dimensional vectors, or embeddings. Vector databases are distinct from traditional databases in that they are optimized to store and query vectors with a large number of dimensions, which can range from tens to thousands, depending on the complexity of the data and the transformation function applied. The diagram below illustrates the workflow of a basic vector search system.

Workflow of a vector search system.

Let's analyze this further.

  1. First, we use an embedding model to generate vector embeddings for our raw data, such as text, images, videos, or audio.
  2. These vectors are stored in the vector database along with a reference to the original data and/or other metadata (optional).
  3. When a query is issued by an application, we use the same embedding model to generate embeddings for the query. This query can be of the same data type as our dataset (e.g., an image to search for similar images) or of a different data type (e.g., text to search for similar images).
  4. We then use the query vector to search for similar vector embeddings in the database. To determine the similarity between any two vectors, we must use a similarity measure, such as cosine similarity, to calculate the distance between them in the high-dimensional vector space.
  5. The similarity search will output a list of vectors that are most similar to the query vector. The raw data associated with each vector can then be accessed.

As of July 3, 2023, the following Azure services are available for storing and querying vector data.

  1. Azure PostgreSQL Database and Azure Cosmos DB for PostgreSQL
  2. Azure Cosmos DB for MongoDB vCore
  3. Azure Cache for Redis Enterprise
  4. Azure Data Explorer
  5. Azure Cognitive Search – Vector Search (private preview)

Vector search with Azure Cosmos DB for PostgreSQL

For this post, I wanted to try out Azure Cosmos DB for PostgreSQL in which vector similarity search is enabled by the pgvector extension. In the sections below, I will show you how to enable the extension, create a table to store vector data, and query the vectors.

Let's get started!

Create an Azure Cosmos DB for PostgreSQL cluster

  1. Sign in to the Azure portal and select + Create a resource.
  2. Search for Azure Cosmos DB. On the Azure Cosmos DB screen, select Create.
  3. On the Which API best suits your workload? screen, select Create on the Azure Cosmos DB for PostgreSQL tile.
  4. On the Create an Azure Cosmos DB for PostgreSQL cluster form fill out the following values:

    • Subscription: Select your subscription.
    • Resource group: Select an existing resource group or create a new one.
    • Cluster name: Enter a name for your Azure Cosmos DB for PostgreSQL cluster.
    • Location: Choose one of the available regions.
    • Scale: You can leave Scale as its default value or select the optimal number of nodes as well as compute, memory, and storage configuration.
    • PostgreSQL version: Choose a PostgreSQL version such as 15.
    • Database name: You can leave database name at its default value citus.
    • Administrator account: The admin username must be citus. Select a password that will be used for citus role to connect to the database.

    Screenshot of the Basic tab of the Create an Azure Cosmos DB for PostgreSQL cluster form.

  5. On the Networking tab, select Allow public access from Azure services and resources within Azure to this cluster and create your preferred firewall rule.

    Screenshot of the Networking tab of the Create an Azure Cosmos DB for PostgreSQL cluster form.

  6. Navigate to the Review + Create tab and then select Create to create the cluster.

  7. Once the deployment is complete, select Go to resource.

  8. Save the Coordinator name, Database name, and Admin username found on the Overview tab of your cluster, as you will need them in a subsequent step.

Enable the pgvector extension

The pgvector extension adds vector similarity search capabilities to your PostgreSQL database. To use the extension, you have to first create it in your database. To enable the extension, we will use the PSQL Shell.

  1. Select the Quick Start tab on the left pane of your Azure resource and then select the PostgreSQL Shell tile.
  2. Enter your password to connect to your database.
  3. The pgvector extension can be enabled using the following command:

    SELECT CREATE_EXTENSION('vector');
    

Create a table to store vector data

On the PSQL Shell, run the following command to create a table to store vector data.

CREATE TABLE imagevectors(
    file TEXT PRIMARY KEY,
    embedding VECTOR(1024)
    );
Enter fullscreen mode Exit fullscreen mode

The table has two columns: one for the image file paths of type TEXT and one for the corresponding vector embeddings of type VECTOR(1024).

Create the Python application

Create a new .ipynb file and open it in Visual Studio Code. Create a .env file and save the following variables:

Variable name Variable value
CV_KEY Azure Cognitive Services key
CV_ENDPOINT Azure Cognitive Services endpoint
POSTGRES_HOST PostgreSQL cluster coordinator name
POSTGRES_DB_NAME PostgreSQL cluster database name
POSTGRES_USER PostgreSQL cluster admin username
POSTGRES_PASSWORD PostgreSQL cluster admin password

The source code, image files, and their vector embeddings can all be found on my GitHub repository.

  1. Import the following libraries and 3 functions from the azurecv.py file.

    import os
    import glob
    import json
    import psycopg2
    from psycopg2 import pool
    from dotenv import load_dotenv
    import pandas as pd
    import csv
    from io import StringIO
    import math
    from azurecv import text_embedding, image_embedding, display_image_grid
    
  2. Insert a new code cell and add the following code, which will load the image file paths and the vector embeddings generated in a previous section.

    # images
    images_folder = "images"
    image_files = glob.glob(images_folder + "/*")
    
    # embeddings
    output_folder = "output"
    emb_json = os.path.join(output_folder, "embeddings.json")
    with open(emb_json) as f:
        image_embeddings = json.load(f)
    
    print(f"Total number of images: {len(image_files)}")
    print(f"Number of imported vector embeddings: {len(image_embeddings)}")
    
  3. In a new code cell, create a Pandas Dataframe with two columns: one for the image file paths and one for the corresponding vector embeddings (converted to a string representation).

    df_files = pd.DataFrame(image_files, columns=['file'])
    df_embeddings = pd.DataFrame([str(emb) for emb in image_embeddings], columns=['embedding'])
    df = pd.concat([df_files, df_embeddings], axis=1)
    df.head(5)
    
  4. Then, you will connect to the Azure Cosmos DB for PostgreSQL cluster. The following code forms a connection string using the environment variables for your Azure Cosmos DB for PostgreSQL cluster and creates a connection pool to your Postgres database. After that, a cursor object is created, which can be used to execute SQL queries with the execute() method.

    # Load environment variables
    load_dotenv()
    host = os.getenv("POSTGRES_HOST")
    dbname = os.getenv("POSTGRES_DB_NAME")
    user = os.getenv("POSTGRES_USER")
    password = os.getenv("POSTGRES_PASSWORD")
    sslmode = "require"
    table_name = "imagevectors"
    
    # Build a connection string from the variables
    conn_string = "host={0} user={1} dbname={2} password={3} sslmode={4}".format(host, user, dbname, password, sslmode)
    
    postgreSQL_pool = psycopg2.pool.SimpleConnectionPool(1, 20, conn_string)
    if (postgreSQL_pool):
        print("Connection pool created successfully")
    
    # Get a connection from the connection pool
    conn = postgreSQL_pool.getconn()
    cursor = conn.cursor()
    
  5. Let's add some data to our table. Inserting data into a PostgreSQL table can be done in several ways. I chose not to use the pandas.DataFrame.to_sql() method for inserting data into our PostgreSQL table because it is relatively slow. Instead, I use the COPY FROM STDIN command to add the data.

    This method is inspired by the post A Fast Method to Bulk Insert a Pandas DataFrame into Postgres by Ellis Michael Valentiner.

    The following code creates a temporary table with the same columns as the imagevectors table, and the data from the dataframe is copied into it. The data from the temporary table is then inserted into the imagevectors table, taking into account any potential conflicts that may arise due to duplicate keys (which could happen if the code is run multiple times).

    sio = StringIO()
    writer = csv.writer(sio)
    writer.writerows(df.values)
    sio.seek(0)
    
    cursor.execute("CREATE TEMPORARY TABLE tmp (file TEXT PRIMARY KEY, embedding VECTOR(1024)) ON COMMIT DROP;")
    
    cursor.copy_expert("COPY tmp FROM STDIN CSV", sio)
    
    cursor.execute(f"""INSERT INTO {table_name} (file, embedding)
                    SELECT * FROM tmp
                    ON conflict (file) DO NOTHING;""")
    
    conn.commit()
    
  6. To view the first 10 rows inserted into the table, run the following code:

    # Fetch all rows from table
    cursor.execute(f"SELECT * FROM {table_name} limit 10;")
    rows = cursor.fetchall()
    
    # Print all rows
    for row in rows:
        print(f"Data row = ({row[0]}, {row[1]})")
    
  7. After populating the table with vector data, you can use this image collection to search for images that are most similar to a reference image or a text prompt. The workflow is summarized as follows:

    • Use the Vectorize Image API or the Vectorize Text API to generate vector embeddings of an image or text, respectively.
    • To calculate similarity and retrieve images, use SELECT statements and the built-it vector operators of the PostgreSQL database.
    • Display the retrieved images using the display_image_grid() function.

    Vector similarity search workflow.

  8. Let’s understand how a simple SELECT statement works. Consider the following query:

    SELECT * FROM imagevectors ORDER BY embedding <=> '[0.003, …, 0.034]' LIMIT 5
    

    This query computes the cosine distance (<=>) between the given vector ([0.003, …, 0.034]) and the vectors stored in the imagevectors table, sorts the results by the calculated distance, and returns the five most similar images (LIMIT 5).

    The PostgreSQL pgvector extension provides 3 operators that can be used to calculate similarity:

    Operator Description
    <-> Euclidean distance
    <#> Negative inner product
    <=> Cosine distance
  9. An example of a text-to-image search process can be found below. For an example of an image-to-image search, please refer to the Jupyter Notebook in my GitHub repository.

    # Generate the embedding of the text prompt
    txt = "a seahorse"
    txt_emb = text_embedding(txt, endpoint, key)
    
    # Vector search
    topn = 9
    cursor.execute(f"SELECT * FROM {table_name} ORDER BY embedding <=> %s LIMIT {topn}", (str(txt_emb),))
    
    # Display the results
    rows = cursor.fetchall()
    for row in rows:
        print(row)
    
    # Display the similar images
    images = [row[0] for row in rows]
    captions = [f"Top {i+1}: {os.path.basename(images[i])}" for i in range(len(images))]
    ncols = 3
    nrows = math.ceil(len(images)/ncols)
    display_image_grid(images, captions, 'Search results for: "' + txt + '"', nrows, ncols)
    

Summary and next steps

In this post, we explored the concepts of “embeddings”, “vector search”, and “vector database” and created a simple image vector similarity search system with the Azure Computer Vision Image Retrieval APIs and Azure Cosmos DB for PostgreSQL. If you'd like to dive deeper into this topic, here are some helpful resources.

You can also check out my previous blog posts about Azure Computer Vision 4.0 (Florence model):


👋 Hi, I am Foteini Savvidou!
An Electrical and Computer Engineering student and Microsoft AI MVP (Most Valuable Professional) from Greece.

🌈 LinkedIn | Blog | GitHub

Top comments (0)