DEV Community

Cover image for Quick tip: Cosine Similarity revisited in SingleStore
Akmal Chaudhri for SingleStore

Posted on • Edited on

Quick tip: Cosine Similarity revisited in SingleStore

Abstract

In a previous article, we saw how to represent Cosine Similarity in SingleStore by combining the DOT_PRODUCT and SQRT functions. The SingleStore documentation also provides a way to implement a COSINE_SIMILARITY function. In this article, we'll see how.

Create a SingleStoreDB Cloud account

A previous article showed the steps to create a free SingleStoreDB Cloud account. We'll use the following settings:

  • Workspace Group Name: Iris Demo Group
  • Cloud Provider: AWS
  • Region: US East 1 (N. Virginia)
  • Workspace Name: iris-demo
  • Size: S-00

Create a Database and Table

In our SingleStore Cloud account, let's use the SQL Editor to create a new database. Call this iris_db, as follows:

CREATE DATABASE IF NOT EXISTS iris_db;
Enter fullscreen mode Exit fullscreen mode

We'll also create the iris table using the new VECTOR data type, as follows:

USE iris_db;

CREATE TABLE IF NOT EXISTS iris (
    vector VECTOR(4),
    species VARCHAR(20)
);
Enter fullscreen mode Exit fullscreen mode

The Iris data set contains 150 rows of data for three different species of flowers. Each flower has four columns of data: sepal_length, sepal_width, petal_length and petal_width. We can store these four column values together, as follows:

INSERT INTO iris VALUES
('[5.1,3.5,1.4,0.2]','Iris-setosa'),
('[4.9,3,1.4,0.2]','Iris-setosa'),
('[4.7,3.2,1.3,0.2]','Iris-setosa'),
('[4.6,3.1,1.5,0.2]','Iris-setosa'),
('[5,3.6,1.4,0.2]','Iris-setosa'),
...
('[6.7,3,5.2,2.3]','Iris-virginica'),
('[6.3,2.5,5,1.9]','Iris-virginica'),
('[6.5,3,5.2,2]','Iris-virginica'),
('[6.2,3.4,5.4,2.3]','Iris-virginica'),
('[5.9,3,5.1,1.8]','Iris-virginica');
Enter fullscreen mode Exit fullscreen mode

Only the first five and last five rows are shown above. The complete INSERT code listing is available in a GitHub Gist.

Create a Cosine Similarity Function

From the product documentation, we'll now define two functions that we can use:

DELIMITER //
CREATE OR REPLACE FUNCTION NORMALIZE(v VECTOR(4)) RETURNS VECTOR(4) AS
DECLARE
    squares VECTOR(4) = VECTOR_MUL(v, v);
    length FLOAT = SQRT(VECTOR_ELEMENTS_SUM(squares));
BEGIN
    RETURN SCALAR_VECTOR_MUL(1/length, v);
END //
DELIMITER ;
Enter fullscreen mode Exit fullscreen mode

and

DELIMITER //
CREATE OR REPLACE FUNCTION COSINE_SIMILARITY(v1 VECTOR(4), v2 VECTOR(4)) RETURNS FLOAT AS
BEGIN
    RETURN DOT_PRODUCT(NORMALIZE(v1), NORMALIZE(v2));
END //
DELIMITER ;
Enter fullscreen mode Exit fullscreen mode

In the SingleStore notebook environment, the code would be as follows:

CREATE OR REPLACE FUNCTION NORMALIZE(v VECTOR(4)) RETURNS VECTOR(4) AS
DECLARE
    squares VECTOR(4) = VECTOR_MUL(v, v);
    length FLOAT = SQRT(VECTOR_ELEMENTS_SUM(squares));
BEGIN
    RETURN SCALAR_VECTOR_MUL(1/length, v);
END;
Enter fullscreen mode Exit fullscreen mode

and

CREATE OR REPLACE FUNCTION COSINE_SIMILARITY(v1 VECTOR(4), v2 VECTOR(4)) RETURNS FLOAT AS
BEGIN
    RETURN DOT_PRODUCT(NORMALIZE(v1), NORMALIZE(v2));
END;
Enter fullscreen mode Exit fullscreen mode

In other words, remove the references to DELIMITER and //.

Query 1

First, let's try a query where we want to find the name of the flower species using an exact match for the sepal_length, sepal_width, petal_length and petal_width. We'll use the values [5.9,3,5.1,1.8] from the last row of the iris table, shown above.

SELECT species
FROM iris
ORDER BY COSINE_SIMILARITY(vector, '[5.9,3,5.1,1.8]') DESC
LIMIT 1;
Enter fullscreen mode Exit fullscreen mode

The result should be:

+----------------+
| species        |
+----------------+
| Iris-virginica |
+----------------+
Enter fullscreen mode Exit fullscreen mode

Query 2

Now, let's use some fictitious data values [5.2,3.6,1.5,0.3] to make a prediction.

SELECT species
FROM iris
ORDER BY COSINE_SIMILARITY(vector, '[5.2,3.6,1.5,0.3]') DESC
LIMIT 1;
Enter fullscreen mode Exit fullscreen mode

The result should be:

+-------------+
| species     |
+-------------+
| Iris-setosa |
+-------------+
Enter fullscreen mode Exit fullscreen mode

Cleanup:

DROP TABLE IF EXISTS iris;
DROP DATABASE IF EXISTS iris_db;
Enter fullscreen mode Exit fullscreen mode

Summary

SingleStore provides direct support for the DOT_PRODUCT and EUCLIDEAN_DISTANCE functions. We can easily implement our own COSINE_SIMILARITY function as shown in this example.

Top comments (0)