SONE

Posted on Jan 13

GraphDB for CMDB

#graphdb #postgressql #python #cmdb

1. Purpose

Comparing search speed between GraphDB and RDB in some scenarios.

2. Overview

I compared search speed of GraphDB and RDB for data which expresses the spine-leaf architecture.
I found that dataset which has a lot of nodes and depth above a certain level is suitable for GraphDB.

3. Introduction

Recently I became interest in GraphDB.
In paticular, I'm interested in applying GraphDB to managing configuration management database(CMDB) of IT systems.
In this post, I confirmed the gap between GraphDB and RDB from the perspective of performance.

4. Assumption

Environment

I selected Neo4j as GraphDB, and PostgreSQL as RDB.
I constructed the test environment with docker containers.
The definition of containers is as follows.

version: '3'
services:
  postgres:
    image: postgres:15
    ports:
      - 5433:5432
    environment:
      POSTGRES_USER: postgres
      POSTGRES_PASSWORD: postgres
      POSTGRES_DB: postgres
  neo4j:
    image: neo4j:5.26.0
    ports:
      - 7474:7474
      - 7687:7687
  adminer:
    image: adminer
    restart: always
    ports:
      - 8080:8080

Target Architecture

I prepared three scenarios for comparison.
These scenarios are based on the architecture which consists of spine-leaf network architecture and virtualization environments.
Diagrams of the architectures are as follows.

Scenario1

Scenario1 is so simple.
The total number of nodes is 19, and the depth is 4.

Scenario2

Scenario2 is more complex than scenario1.
The number of servers is six times as much as scenario1.
And also, connections between leaf switches and servers are full mesh.
Each server has two virtual machines, and each service is running on three virtual machines which belong to each different leaf switches.
The total number of nodes is 273, and the depth is 4.

Scenario3

Scenario3 is deeper than scenario2.
Each virtual machine has a pod, and each service is running on three pods which belong to each different leaf pairs.
The total number of nodes is 417, and the depth is 5.

Data Schema Definition

Neo4j(GraphDB)

I defined each node which has just its node name.
Relations between nodes are defined as edges, and I defined two types of edge.
One is has_parent which is the edge from the child node to its parent node, another is has_child which is the edge from the parent node to its child node.
In the case of scenario1, I defined data about spine and leaf swiches as follows.

CREATE (ssw1: SpineSwitch {name: "ssw1"})
CREATE (ssw2: SpineSwitch {name: "ssw2"})
CREATE (ssw3: SpineSwitch {name: "ssw3"})
CREATE (lsw1: LeafSwitch {name: "lsw1"})
CREATE (lsw2: LeafSwitch {name: "lsw2"})
CREATE (lsw3: LeafSwitch {name: "lsw3"})
CREATE (lsw4: LeafSwitch {name: "lsw4"})
CREATE (ssw1)-[:has_child]->(lsw1)
CREATE (ssw1)-[:has_child]->(lsw2)
CREATE (ssw1)-[:has_child]->(lsw3)
CREATE (ssw1)-[:has_child]->(lsw4)
CREATE (ssw2)-[:has_child]->(lsw1)
CREATE (ssw2)-[:has_child]->(lsw2)
CREATE (ssw2)-[:has_child]->(lsw3)
CREATE (ssw2)-[:has_child]->(lsw4)
CREATE (ssw3)-[:has_child]->(lsw1)
CREATE (ssw3)-[:has_child]->(lsw2)
CREATE (ssw3)-[:has_child]->(lsw3)
CREATE (ssw3)-[:has_child]->(lsw4)
CREATE (lsw1)-[:has_child]->(sv1)
CREATE (lsw2)-[:has_child]->(sv2)
CREATE (lsw3)-[:has_child]->(sv3)
CREATE (lsw4)-[:has_child]->(sv4);

PostgreSQL(RDB)

I defined two tables.
One is nodes table which store each node data, another is relationships table which store relation data between nodes.

CREATE TABLE nodes (
    id SERIAL PRIMARY KEY,
    name VARCHAR(255) NOT NULL UNIQUE,
    type VARCHAR(50) NOT NULL
);

CREATE TABLE relationships (
    id SERIAL PRIMARY KEY,
    parent_id INT NOT NULL,
    child_id INT NOT NULL,
    relationship_type VARCHAR(50) NOT NULL,
    FOREIGN KEY (parent_id) REFERENCES nodes (id),
    FOREIGN KEY (child_id) REFERENCES nodes (id)
);

Search Query

I used the query for searching paths from service1 to spine switches both of which exist in the end of architecture in order to compare search speed.
I measured each of search speed by using following scripts.

Script for Neo4j(GraphDB)

from neo4j import GraphDatabase
import time

driver = GraphDatabase.driver("bolt://localhost:7687", auth=("neo4j", "neo4j"))

def query_execution_time():
    with driver.session() as session:
        start_time = time.time()

        result = session.run("""
            MATCH path = (srv:Service {name: "srv1"})-[:has_parent*]->(ssw:SpineSwitch)
            RETURN path, nodes(path) AS allNodes;
        """)

        end_time = time.time()

        # display results
        for ind, record in enumerate(result):
            nodes = record["allNodes"]
            print(f"Path nodes: {ind+1}")
            for node in nodes:
                print(f"Node: {node['name']}, Type: {list(node.labels)}")
    # Search speed
    print(f"Execution time: {end_time - start_time:.10f} seconds")

query_execution_time()
driver.close()

Script for PostgreSQL(RDB)

import psycopg2
import time

# Connect to PostgreSQL
connection = psycopg2.connect(
    host="localhost",
    port=5433,
    dbname="postgres",
    user="postgres",
    password="postgres"
)

# search for path from srv1 -> spine(Scenario1)
def get_path_from_srv_to_spine_scenario1(srv):
    paths = []
    start_time = time.time()

    with connection.cursor() as cursor:
        # search for vm
        cursor.execute("""
            SELECT p.name
            FROM nodes p
            JOIN relationships r ON p.id = r.parent_id
            JOIN nodes c ON r.child_id = c.id
            WHERE c.name = %s AND p.type = 'VirtualMachine';
        """, (srv,))
        vms = cursor.fetchall()

        print("Virtual Machines connected to srv1:")
        for vm in vms:
            print(f"  {vm[0]}")

        # search for server
        for vm in vms:
            cursor.execute("""
                SELECT p.name
                FROM nodes p
                JOIN relationships r ON p.id = r.parent_id
                JOIN nodes c ON r.child_id = c.id
                WHERE c.name = %s AND p.type = 'Server';
            """, (vm[0],))
            servers = cursor.fetchall()

            print(f"\nServers connected to {vm[0]}:")
            for server in servers:
                print(f"  {server[0]}")

            # search for leaf
            for server in servers:
                cursor.execute("""
                    SELECT p.name
                    FROM nodes p
                    JOIN relationships r ON p.id = r.parent_id
                    JOIN nodes c ON r.child_id = c.id
                    WHERE c.name = %s AND p.type = 'LeafSwitch';
                """, (server[0],))
                leaf_switches = cursor.fetchall()

                print(f"\nLeafSwitches connected to {server[0]}:")
                for lsw in leaf_switches:
                    print(f"  {lsw[0]}")

                # search for spine
                for lsw in leaf_switches:
                    cursor.execute("""
                        SELECT p.name
                        FROM nodes p
                        JOIN relationships r ON p.id = r.parent_id
                        JOIN nodes c ON r.child_id = c.id
                        WHERE c.name = %s AND p.type = 'SpineSwitch';
                    """, (lsw[0],))
                    spine_switches = cursor.fetchall()
                    print(f"\nSpineSwitches connected to {lsw[0]}:")
                    for ssw in spine_switches:
                        print(f"  {ssw[0]}")
                        paths.append(f"{srv}->{vm[0]}->{server[0]}->{lsw[0]}->{ssw[0]}")
    end_time = time.time()
    print(paths)
    print(f"Execution time: {end_time - start_time:.10f} seconds")

# search for path from srv1 -> spine(Scenario2)
def get_path_from_srv_to_spine_scenario2(srv):
    paths = []
    start_time = time.time()

    with connection.cursor() as cursor:
        # search for vm
        cursor.execute("""
            SELECT p.name
            FROM nodes p
            JOIN relationships r ON p.id = r.parent_id
            JOIN nodes c ON r.child_id = c.id
            WHERE c.name = %s AND p.type = 'VirtualMachine';
        """, (srv,))
        vms = cursor.fetchall()

        print("Virtual Machines connected to srv1:")
        for vm in vms:
            print(f"  {vm[0]}")

        # search for server
        for vm in vms:
            cursor.execute("""
                SELECT p.name
                FROM nodes p
                JOIN relationships r ON p.id = r.parent_id
                JOIN nodes c ON r.child_id = c.id
                WHERE c.name = %s AND p.type = 'Server';
            """, (vm[0],))
            servers = cursor.fetchall()

            print(f"\nServers connected to {vm[0]}:")
            for server in servers:
                print(f"  {server[0]}")

            # search for leaf
            for server in servers:
                cursor.execute("""
                    SELECT p.name
                    FROM nodes p
                    JOIN relationships r ON p.id = r.parent_id
                    JOIN nodes c ON r.child_id = c.id
                    WHERE c.name = %s AND p.type = 'LeafSwitch';
                """, (server[0],))
                leaf_switches = cursor.fetchall()

                print(f"\nLeafSwitches connected to {server[0]}:")
                for lsw in leaf_switches:
                    print(f"  {lsw[0]}")

                # search for spine
                for lsw in leaf_switches:
                    cursor.execute("""
                        SELECT p.name
                        FROM nodes p
                        JOIN relationships r ON p.id = r.parent_id
                        JOIN nodes c ON r.child_id = c.id
                        WHERE c.name = %s AND p.type = 'SpineSwitch';
                    """, (lsw[0],))
                    spine_switches = cursor.fetchall()
                    print(f"\nSpineSwitches connected to {lsw[0]}:")
                    for ssw in spine_switches:
                        print(f"  {ssw[0]}")
                        paths.append(f"{srv}->{vm[0]}->{server[0]}->{lsw[0]}->{ssw[0]}")
    end_time = time.time()
    print(paths)
    print(f"Execution time: {end_time - start_time:.10f} seconds")

# search for path from srv1 -> spine(Scenario3)
def get_path_from_srv_to_spine_scenario3(srv):
    paths = []
    start_time = time.time()

    with connection.cursor() as cursor:
        # search for pod
        cursor.execute("""
            SELECT p.name
            FROM nodes p
            JOIN relationships r ON p.id = r.parent_id
            JOIN nodes c ON r.child_id = c.id
            WHERE c.name = %s AND p.type = 'Pod';
        """, (srv,))
        pods = cursor.fetchall()

        print("Pods connected to srv1:")
        for pod in pods:
            print(f"  {pod[0]}")
            # search for vm
            cursor.execute("""
                SELECT p.name
                FROM nodes p
                JOIN relationships r ON p.id = r.parent_id
                JOIN nodes c ON r.child_id = c.id
                WHERE c.name = %s AND p.type = 'VirtualMachine';
            """, (pod[0],))
            vms = cursor.fetchall()

            print("Virtual Machines connected to srv1:")
            for vm in vms:
                print(f"  {vm[0]}")

            # search for server
            for vm in vms:
                cursor.execute("""
                    SELECT p.name
                    FROM nodes p
                    JOIN relationships r ON p.id = r.parent_id
                    JOIN nodes c ON r.child_id = c.id
                    WHERE c.name = %s AND p.type = 'Server';
                """, (vm[0],))
                servers = cursor.fetchall()

                print(f"\nServers connected to {vm[0]}:")
                for server in servers:
                    print(f"  {server[0]}")

                # search for leaf
                for server in servers:
                    cursor.execute("""
                        SELECT p.name
                        FROM nodes p
                        JOIN relationships r ON p.id = r.parent_id
                        JOIN nodes c ON r.child_id = c.id
                        WHERE c.name = %s AND p.type = 'LeafSwitch';
                    """, (server[0],))
                    leaf_switches = cursor.fetchall()

                    print(f"\nLeafSwitches connected to {server[0]}:")
                    for lsw in leaf_switches:
                        print(f"  {lsw[0]}")

                    # search for spine
                    for lsw in leaf_switches:
                        cursor.execute("""
                            SELECT p.name
                            FROM nodes p
                            JOIN relationships r ON p.id = r.parent_id
                            JOIN nodes c ON r.child_id = c.id
                            WHERE c.name = %s AND p.type = 'SpineSwitch';
                        """, (lsw[0],))
                        spine_switches = cursor.fetchall()
                        print(f"\nSpineSwitches connected to {lsw[0]}:")
                        for ssw in spine_switches:
                            print(f"  {ssw[0]}")
                            paths.append(f"{srv}->{pod[0]}->{vm[0]}->{server[0]}->{lsw[0]}->{ssw[0]}")
    end_time = time.time()
    print(paths)
    print(f"Execution time: {end_time - start_time:.10f} seconds")

get_path_from_srv_to_spine_scenario1("srv1")
# get_path_from_srv_to_spine_scenario2("srv1")
# get_path_from_srv_to_spine_scenario3("srv1")

connection.close()

5. Result

Search Speed Comparison in each scenario

6. Consideration

I found GraphDB suitable for expressing entity which has dozens of nodes and depth above a certain level after receiving the result.
On the contrary, the gap between GraphDB and RDB is insignificant in the case of the small dataset even if its concept maches Graph expression well.

And also, I found the gap of query plainness too big to ignore.
The query of Cypher which is query used in Neo4j is very simple, but SQL is so complex.
Therefore, apart from the gap of performance, this may become a big reason we select GraphDB in order to express information which maches the format of Graph well.

DEV Community