1. Purpose
Comparing search speed between GraphDB and RDB in some scenarios.
2. Overview
I compared search speed of GraphDB and RDB for data which expresses the spine-leaf architecture.
I found that dataset which has a lot of nodes and depth above a certain level is suitable for GraphDB.
3. Introduction
Recently I became interest in GraphDB.
In paticular, I'm interested in applying GraphDB to managing configuration management database(CMDB) of IT systems.
In this post, I confirmed the gap between GraphDB and RDB from the perspective of performance.
4. Assumption
Environment
I selected Neo4j as GraphDB, and PostgreSQL as RDB.
I constructed the test environment with docker containers.
The definition of containers is as follows.
version: '3'
services:
postgres:
image: postgres:15
ports:
- 5433:5432
environment:
POSTGRES_USER: postgres
POSTGRES_PASSWORD: postgres
POSTGRES_DB: postgres
neo4j:
image: neo4j:5.26.0
ports:
- 7474:7474
- 7687:7687
adminer:
image: adminer
restart: always
ports:
- 8080:8080
Target Architecture
I prepared three scenarios for comparison.
These scenarios are based on the architecture which consists of spine-leaf network architecture and virtualization environments.
Diagrams of the architectures are as follows.
Scenario1
Scenario1 is so simple.
The total number of nodes is 19, and the depth is 4.
Scenario2
Scenario2 is more complex than scenario1.
The number of servers is six times as much as scenario1.
And also, connections between leaf switches and servers are full mesh.
Each server has two virtual machines, and each service is running on three virtual machines which belong to each different leaf switches.
The total number of nodes is 273, and the depth is 4.
Scenario3
Scenario3 is deeper than scenario2.
Each virtual machine has a pod, and each service is running on three pods which belong to each different leaf pairs.
The total number of nodes is 417, and the depth is 5.
Data Schema Definition
Neo4j(GraphDB)
I defined each node which has just its node name.
Relations between nodes are defined as edges, and I defined two types of edge.
One is has_parent which is the edge from the child node to its parent node, another is has_child which is the edge from the parent node to its child node.
In the case of scenario1, I defined data about spine and leaf swiches as follows.
CREATE (ssw1: SpineSwitch {name: "ssw1"})
CREATE (ssw2: SpineSwitch {name: "ssw2"})
CREATE (ssw3: SpineSwitch {name: "ssw3"})
CREATE (lsw1: LeafSwitch {name: "lsw1"})
CREATE (lsw2: LeafSwitch {name: "lsw2"})
CREATE (lsw3: LeafSwitch {name: "lsw3"})
CREATE (lsw4: LeafSwitch {name: "lsw4"})
CREATE (ssw1)-[:has_child]->(lsw1)
CREATE (ssw1)-[:has_child]->(lsw2)
CREATE (ssw1)-[:has_child]->(lsw3)
CREATE (ssw1)-[:has_child]->(lsw4)
CREATE (ssw2)-[:has_child]->(lsw1)
CREATE (ssw2)-[:has_child]->(lsw2)
CREATE (ssw2)-[:has_child]->(lsw3)
CREATE (ssw2)-[:has_child]->(lsw4)
CREATE (ssw3)-[:has_child]->(lsw1)
CREATE (ssw3)-[:has_child]->(lsw2)
CREATE (ssw3)-[:has_child]->(lsw3)
CREATE (ssw3)-[:has_child]->(lsw4)
CREATE (lsw1)-[:has_child]->(sv1)
CREATE (lsw2)-[:has_child]->(sv2)
CREATE (lsw3)-[:has_child]->(sv3)
CREATE (lsw4)-[:has_child]->(sv4);
PostgreSQL(RDB)
I defined two tables.
One is nodes table which store each node data, another is relationships table which store relation data between nodes.
CREATE TABLE nodes (
id SERIAL PRIMARY KEY,
name VARCHAR(255) NOT NULL UNIQUE,
type VARCHAR(50) NOT NULL
);
CREATE TABLE relationships (
id SERIAL PRIMARY KEY,
parent_id INT NOT NULL,
child_id INT NOT NULL,
relationship_type VARCHAR(50) NOT NULL,
FOREIGN KEY (parent_id) REFERENCES nodes (id),
FOREIGN KEY (child_id) REFERENCES nodes (id)
);
Search Query
I used the query for searching paths from service1 to spine switches both of which exist in the end of architecture in order to compare search speed.
I measured each of search speed by using following scripts.
Script for Neo4j(GraphDB)
from neo4j import GraphDatabase
import time
driver = GraphDatabase.driver("bolt://localhost:7687", auth=("neo4j", "neo4j"))
def query_execution_time():
with driver.session() as session:
start_time = time.time()
result = session.run("""
MATCH path = (srv:Service {name: "srv1"})-[:has_parent*]->(ssw:SpineSwitch)
RETURN path, nodes(path) AS allNodes;
""")
end_time = time.time()
# display results
for ind, record in enumerate(result):
nodes = record["allNodes"]
print(f"Path nodes: {ind+1}")
for node in nodes:
print(f"Node: {node['name']}, Type: {list(node.labels)}")
# Search speed
print(f"Execution time: {end_time - start_time:.10f} seconds")
query_execution_time()
driver.close()
Script for PostgreSQL(RDB)
import psycopg2
import time
# Connect to PostgreSQL
connection = psycopg2.connect(
host="localhost",
port=5433,
dbname="postgres",
user="postgres",
password="postgres"
)
# search for path from srv1 -> spine(Scenario1)
def get_path_from_srv_to_spine_scenario1(srv):
paths = []
start_time = time.time()
with connection.cursor() as cursor:
# search for vm
cursor.execute("""
SELECT p.name
FROM nodes p
JOIN relationships r ON p.id = r.parent_id
JOIN nodes c ON r.child_id = c.id
WHERE c.name = %s AND p.type = 'VirtualMachine';
""", (srv,))
vms = cursor.fetchall()
print("Virtual Machines connected to srv1:")
for vm in vms:
print(f" {vm[0]}")
# search for server
for vm in vms:
cursor.execute("""
SELECT p.name
FROM nodes p
JOIN relationships r ON p.id = r.parent_id
JOIN nodes c ON r.child_id = c.id
WHERE c.name = %s AND p.type = 'Server';
""", (vm[0],))
servers = cursor.fetchall()
print(f"\nServers connected to {vm[0]}:")
for server in servers:
print(f" {server[0]}")
# search for leaf
for server in servers:
cursor.execute("""
SELECT p.name
FROM nodes p
JOIN relationships r ON p.id = r.parent_id
JOIN nodes c ON r.child_id = c.id
WHERE c.name = %s AND p.type = 'LeafSwitch';
""", (server[0],))
leaf_switches = cursor.fetchall()
print(f"\nLeafSwitches connected to {server[0]}:")
for lsw in leaf_switches:
print(f" {lsw[0]}")
# search for spine
for lsw in leaf_switches:
cursor.execute("""
SELECT p.name
FROM nodes p
JOIN relationships r ON p.id = r.parent_id
JOIN nodes c ON r.child_id = c.id
WHERE c.name = %s AND p.type = 'SpineSwitch';
""", (lsw[0],))
spine_switches = cursor.fetchall()
print(f"\nSpineSwitches connected to {lsw[0]}:")
for ssw in spine_switches:
print(f" {ssw[0]}")
paths.append(f"{srv}->{vm[0]}->{server[0]}->{lsw[0]}->{ssw[0]}")
end_time = time.time()
print(paths)
print(f"Execution time: {end_time - start_time:.10f} seconds")
# search for path from srv1 -> spine(Scenario2)
def get_path_from_srv_to_spine_scenario2(srv):
paths = []
start_time = time.time()
with connection.cursor() as cursor:
# search for vm
cursor.execute("""
SELECT p.name
FROM nodes p
JOIN relationships r ON p.id = r.parent_id
JOIN nodes c ON r.child_id = c.id
WHERE c.name = %s AND p.type = 'VirtualMachine';
""", (srv,))
vms = cursor.fetchall()
print("Virtual Machines connected to srv1:")
for vm in vms:
print(f" {vm[0]}")
# search for server
for vm in vms:
cursor.execute("""
SELECT p.name
FROM nodes p
JOIN relationships r ON p.id = r.parent_id
JOIN nodes c ON r.child_id = c.id
WHERE c.name = %s AND p.type = 'Server';
""", (vm[0],))
servers = cursor.fetchall()
print(f"\nServers connected to {vm[0]}:")
for server in servers:
print(f" {server[0]}")
# search for leaf
for server in servers:
cursor.execute("""
SELECT p.name
FROM nodes p
JOIN relationships r ON p.id = r.parent_id
JOIN nodes c ON r.child_id = c.id
WHERE c.name = %s AND p.type = 'LeafSwitch';
""", (server[0],))
leaf_switches = cursor.fetchall()
print(f"\nLeafSwitches connected to {server[0]}:")
for lsw in leaf_switches:
print(f" {lsw[0]}")
# search for spine
for lsw in leaf_switches:
cursor.execute("""
SELECT p.name
FROM nodes p
JOIN relationships r ON p.id = r.parent_id
JOIN nodes c ON r.child_id = c.id
WHERE c.name = %s AND p.type = 'SpineSwitch';
""", (lsw[0],))
spine_switches = cursor.fetchall()
print(f"\nSpineSwitches connected to {lsw[0]}:")
for ssw in spine_switches:
print(f" {ssw[0]}")
paths.append(f"{srv}->{vm[0]}->{server[0]}->{lsw[0]}->{ssw[0]}")
end_time = time.time()
print(paths)
print(f"Execution time: {end_time - start_time:.10f} seconds")
# search for path from srv1 -> spine(Scenario3)
def get_path_from_srv_to_spine_scenario3(srv):
paths = []
start_time = time.time()
with connection.cursor() as cursor:
# search for pod
cursor.execute("""
SELECT p.name
FROM nodes p
JOIN relationships r ON p.id = r.parent_id
JOIN nodes c ON r.child_id = c.id
WHERE c.name = %s AND p.type = 'Pod';
""", (srv,))
pods = cursor.fetchall()
print("Pods connected to srv1:")
for pod in pods:
print(f" {pod[0]}")
# search for vm
cursor.execute("""
SELECT p.name
FROM nodes p
JOIN relationships r ON p.id = r.parent_id
JOIN nodes c ON r.child_id = c.id
WHERE c.name = %s AND p.type = 'VirtualMachine';
""", (pod[0],))
vms = cursor.fetchall()
print("Virtual Machines connected to srv1:")
for vm in vms:
print(f" {vm[0]}")
# search for server
for vm in vms:
cursor.execute("""
SELECT p.name
FROM nodes p
JOIN relationships r ON p.id = r.parent_id
JOIN nodes c ON r.child_id = c.id
WHERE c.name = %s AND p.type = 'Server';
""", (vm[0],))
servers = cursor.fetchall()
print(f"\nServers connected to {vm[0]}:")
for server in servers:
print(f" {server[0]}")
# search for leaf
for server in servers:
cursor.execute("""
SELECT p.name
FROM nodes p
JOIN relationships r ON p.id = r.parent_id
JOIN nodes c ON r.child_id = c.id
WHERE c.name = %s AND p.type = 'LeafSwitch';
""", (server[0],))
leaf_switches = cursor.fetchall()
print(f"\nLeafSwitches connected to {server[0]}:")
for lsw in leaf_switches:
print(f" {lsw[0]}")
# search for spine
for lsw in leaf_switches:
cursor.execute("""
SELECT p.name
FROM nodes p
JOIN relationships r ON p.id = r.parent_id
JOIN nodes c ON r.child_id = c.id
WHERE c.name = %s AND p.type = 'SpineSwitch';
""", (lsw[0],))
spine_switches = cursor.fetchall()
print(f"\nSpineSwitches connected to {lsw[0]}:")
for ssw in spine_switches:
print(f" {ssw[0]}")
paths.append(f"{srv}->{pod[0]}->{vm[0]}->{server[0]}->{lsw[0]}->{ssw[0]}")
end_time = time.time()
print(paths)
print(f"Execution time: {end_time - start_time:.10f} seconds")
get_path_from_srv_to_spine_scenario1("srv1")
# get_path_from_srv_to_spine_scenario2("srv1")
# get_path_from_srv_to_spine_scenario3("srv1")
connection.close()
5. Result
Search Speed Comparison in each scenario
6. Consideration
I found GraphDB suitable for expressing entity which has dozens of nodes and depth above a certain level after receiving the result.
On the contrary, the gap between GraphDB and RDB is insignificant in the case of the small dataset even if its concept maches Graph expression well.
And also, I found the gap of query plainness too big to ignore.
The query of Cypher which is query used in Neo4j is very simple, but SQL is so complex.
Therefore, apart from the gap of performance, this may become a big reason we select GraphDB in order to express information which maches the format of Graph well.
Top comments (0)