Data Tech Bridge

Posted on Mar 6

Amazon Neptune - Cheat Sheet

Overview

Amazon Neptune is a fast, reliable, fully managed graph database service that makes it easy to build and run applications that work with highly connected datasets. Neptune supports popular graph models Property Graph and W3C's RDF, and their respective query languages Apache TinkerPop Gremlin and SPARQL.

Core Concepts

Graph Database: A database optimized for storing and querying highly connected data, representing relationships as edges between nodes
Property Graph Model: A graph model where nodes (vertices) and relationships (edges) can have properties
RDF (Resource Description Framework): A standard model for data interchange on the Web using subject-predicate-object expressions
Gremlin: A graph traversal language for Property Graph model
SPARQL: A query language for RDF data

Neptune Architecture and Components

Neptune
├── Cluster
│ ├── Primary Instance
│ └── Read Replicas
├── Storage
│ ├── Shared Storage Volume
│ └── 6 Copies across 3 AZs
├── Query Languages
│ ├── Gremlin
│ ├── SPARQL
│ └── openCypher
├── Graph Models
│ ├── Property Graph
│ └── RDF
└── Features
├── ACID Transactions
├── High Availability
├── Point-in-time Recovery
└── Fast Queries

Key Features and Specifications

Feature	Description
Storage	Up to 64 TB per cluster
Instances	Up to 15 read replicas
Durability	6 copies of data across 3 AZs
Availability	99.99% SLA
Security	VPC isolation, IAM authentication, HTTPS, KMS encryption
Backup	Automated backups with point-in-time recovery
Monitoring	CloudWatch, CloudTrail integration
Scaling	Vertical scaling (instance size) and horizontal scaling (read replicas)
Bulk Loading	Neptune Bulk Loader for fast data ingestion
Query Languages	Gremlin, SPARQL, openCypher
Graph Models	Property Graph, RDF

Performance Considerations

Instance Types: Neptune offers memory-optimized R5 and R6g instance types optimized for database workloads
Read Scaling: Add up to 15 read replicas to scale read capacity
Query Optimization: Use query hints and proper indexing for better performance
Bulk Loading: Use Neptune Bulk Loader for efficient data ingestion
Connection Pooling: Implement connection pooling to reduce connection overhead
Caching: Use application-level caching for frequently accessed data

Service Limits

Maximum Storage: 64 TB per cluster
Maximum Instances: 1 primary + 15 read replicas
Maximum Connections: Depends on instance type (typically thousands)
Maximum Query Timeout: 120 seconds (configurable)
Bulk Load File Size: Maximum 150 GB per file
Concurrent Bulk Load Jobs: 1 active job per cluster at a time
Maximum Property Size: 55 MB for a single property value

Data Ingestion

Neptune Bulk Loader: Fastest way to load data from S3
Bulk Loader Format: CSV format for Property Graph, N-Quads/N-Triples for RDF
Bulk Load Rate: Up to millions of vertices/edges per minute
Streaming Data: Use AWS Lambda or custom applications with Neptune APIs
Replayability: Neptune maintains a transaction log for point-in-time recovery

Example Bulk Load Calculation

For a dataset with 100 million vertices and 500 million edges:
- Using r5.8xlarge instance
- Approximate load time: 2-4 hours
- Throughput: ~2.5-3 million edges per minute

Comparison of Graph Models

Feature	Property Graph (Gremlin/openCypher)	RDF (SPARQL)
Data Model	Vertices, Edges, Properties	Triples (Subject-Predicate-Object)
Schema	Schema-optional	Schema-optional with ontology support
Query Language	Gremlin, openCypher	SPARQL
Use Cases	Recommendation engines, fraud detection	Knowledge graphs, semantic web
Indexing	Automatic indexing	Automatic indexing
Standards	Apache TinkerPop	W3C Standard

Open Source Compatibility

Apache TinkerPop: Neptune is compatible with Apache TinkerPop Gremlin but with some differences:
- Neptune doesn't support all Gremlin steps
- Neptune adds optimizations for distributed query execution
- Neptune implements server-side sessions differently
SPARQL: Neptune supports SPARQL 1.1 Query Language with some extensions and limitations:
- Full support for SPARQL 1.1 Query Language
- Limited support for SPARQL 1.1 Update
- No support for federated queries
openCypher: Neptune supports openCypher with some limitations compared to Neo4j:
- Subset of openCypher functionality
- Different transaction semantics

High Availability and Disaster Recovery

Multi-AZ Deployment: Automatic failover to a replica in a different AZ
Read Replicas: Up to 15 read replicas for scaling read operations
Automated Backups: Daily automated backups with 35-day retention
Manual Snapshots: User-initiated snapshots with indefinite retention
Point-in-Time Recovery: Restore to any second in the backup retention period

Monitoring with CloudWatch

Metric	Description	Recommended Alarm
CPUUtilization	CPU utilization percentage	>70% for sustained periods
MainRequestQueuePendingRequests	Number of pending requests	>500 requests
GremlinRequestsPerSec	Rate of Gremlin requests	Depends on application
GremlinHttp4xx	HTTP 4xx errors for Gremlin	>0 errors
BufferCacheHitRatio	Buffer cache hit ratio	<90%
TotalRequestsPerSec	Total requests per second	Depends on application
VolumeBytesUsed	Storage volume bytes used	>80% of maximum
StatsNumStatementsScanned	Number of statements scanned	Monitor for query optimization

Implementing Throttling and Overcoming Rate Limits

Connection Pooling: Implement connection pooling in your application
Exponential Backoff: Use exponential backoff for retries
Request Distribution: Distribute requests across read replicas
Batch Operations: Use batch operations where possible
Query Optimization: Optimize queries to reduce resource consumption
Instance Scaling: Scale up instance size for higher throughput

Throughput and Latency Characteristics

Read Throughput: Scales linearly with read replicas
Write Throughput: Limited by primary instance capacity
Query Latency: Typically milliseconds for simple queries, seconds for complex queries
Bulk Load Throughput: Up to millions of vertices/edges per minute
Network Latency: Reduced by placing application in same VPC/region

Security and Compliance

VPC Isolation: Neptune runs within a VPC
IAM Authentication: IAM database authentication
Encryption: Encryption at rest using KMS and in-transit using SSL/TLS
Audit Logging: Integration with CloudTrail
Compliance: SOC, PCI DSS, HIPAA eligible, ISO, and more

Integration with AWS Services

Amazon S3: Source for bulk loading data
AWS Lambda: For serverless graph processing
Amazon SageMaker: For machine learning on graph data
AWS Glue: For ETL jobs to prepare data for Neptune
Amazon CloudWatch: For monitoring and alerting
AWS IAM: For authentication and authorization

Best Practices

Query Design: Design efficient queries that limit traversal depth
Data Modeling: Optimize data model for your access patterns
Instance Sizing: Choose appropriate instance size based on workload
Monitoring: Set up CloudWatch alarms for key metrics
Backup Strategy: Implement regular backups and test recovery
Cost Optimization: Use appropriate instance types and number of replicas

DEV Community