DEV Community

V for Volisoft

Posted on • Originally published at volisoft.org on

NoSQL Architect vs AWS expert

Introduction

Let's compare two AWS DynamoDB database schemas: one designed by an AWS expert and another one generated by automated NoSQL Architect tool. The human expert’s schema incorporated best practices and years of experience in the field. On the other hand, our tool utilizes mathematical modeling techniques to optimize the schema based on the specific characteristics of the data.

The primary focus of this comparison is to analyze cost savings, particularly in terms of read queries and storage costs.

Case study

For the experiment we take an example from AWS blog. The author discusses the concept of a single-table design.The idea is to store all application data in a single table. This may seem counterintuitive to those familiar with relational databases. However, Amazon uses this approach for its internal designs. This is also the approach we use in our NoSQL Architect tool to generate database schemas.

To summarize, the blog post walks through converting a relational model into a single AWS DynamoDB table.

single-table-1.png
Relational model of the Alleycat application from the AWS Blog.

Setup

We start by listing the access patterns from the article:

  1. Get the results for each race by racer ID.
  2. Get a list of races by class ID.
  3. Get the best performance by racer for a class ID.
  4. Get the list of top scores by race ID.
  5. Get the second-by-second performance by racer for all races.

The database entities with their attributes:

  • classes: class-id, class-name
  • races: race-id, class-id
  • race-results: race-id, racer-id, second
  • racers: racer-id, racer-name

Last attribute appears later in the article but is not listed in the relational model. We include it here as a part of the data description.

The effectiveness of any database design depends on how the data is distributed and accessed. Key factors include query frequency, the average number of records returned per query, and the size of the data set.

The AWS blog post does not provide details about data distribution or query behavior. Therefore, we made reasonable assumptions about query frequencis and record counts, summarized below. We assume a dataset size of 1 million records.

Query # # of records returned Frequency PK SK Return
1 7 1000 “racer-id” “race-id”,“second”
2 30 1000 “class-id” “second” “race-id”
3 20000 100 “class-id” “racer-id”,“race-id”
4 7 1000 “race-id” “second” “racer-id”
5 1 1000 “racer-id” “race-id” “second”

Cost criteria

To compare performance, we focused on the cost of reads and data storage. We used the following scoring method:

  • Unique Record Queries: For queries that return a single record (e.g., query #5), the schema must uniquely identify the record. If the schema allows for this, the query scores 1. Otherwise, the score reflects the actual number of records returned.
  • Projected Attributes: For queries using indexes, we account for any missing attributes that require additional requests to the main table. Each extra request adds to the cost.
  • Storage: Storage costs include both the main table and index storage. We assign one unit of cost for each record stored in the main table or index.

The AWS blogpost schema

Below are the cost estimates for the schema suggested in the AWS blogpost.

Data attribute Schema
:race-id (“gsi1_:pk” “gsi1_:f” “main1_:f” “lsi1_:sk”)
:class-id (“gsi1_:pk”)
:racer-id (“gsi1_:f” “main1_:pk” “main1_:sk”)
:racer-name (“main1_:f”)
:second (“main1_:f” “gsi1_:sk”)
:class-name (“gsi1_:f”)

Column prefixes indicate table name, e.g. `main1` refers to the main table, `gsi1` refers to the GSI1 index and so on. Suffix denotes the column type, e.g. `pk` (partition key), `sk` (sort key) or `f` (unindexed field).

Query Cost
query::race-id,second->(“racer-id”) 7
query::racer-id,race-id->(“second”) 2
query::class-id,second->(“race-id”) 30
query::racer-id->(“race-id” “second”) 5
query::class-id->(“racer-id” “race-id”) 20000
storage 4000000
total 6044000

Costs of indivudual queries are listed in the Costs table. Total cost accounts for frequency of each query execution and storage costs,

Table Records #
:gsi 1 2000000
:lsi 1 1000000
:main 1 1000000

Optimized schema

From the cost breakdown, it is clear that the most expensive query is class-id->(“racer-id” “race-id”). This is due to the small number of unique class-id values (50) and the large number of records returned. Based on these insights, NoSQL Architect restructured the indexes, reducing the costs.

Data attribute Schema
:race-id (“gsi1_:f” “main1_:sk” “gsi2_:sk” “gsi2_:pk”)
:class-id (“gsi1_:pk”)
:racer-id (“gsi1_:f” “main1_:pk” “gsi2_:f”)
:second (“gsi1_:sk” “gsi2_:f” “gsi2_:sk”)
Query Cost
query::race-id,second->(“racer-id”) 7
query::racer-id,race-id->(“second”) 1
query::class-id,second->(“race-id”) 30
query::racer-id->(“race-id” “second”) 5
query::class-id->(“racer-id” “race-id”) 20000
storage 3000000
total 5043000

Note, the optimized index structure also reduced the amount of records in storage from 4 million to 3 million records total, or 25% reduction in storage costs.

Table Records #
:gsi 1 1000000
:gsi 2 1000000
:main 1 1000000

This optimized schema results in a 25% reduction in storage costs and an overall 16.5% reduction in total costs compared to the original schema!

Summary

The optimized schema generated by NoSQL Architect reduced costs by 16.5% compared to the schema created by an AWS expert. These savings were achieved by taking into account the unique characteristics of the data, such as query frequencies and result sizes.

Beyond cost, NoSQL Architect also offers significant time savings, generating the optimized schema in under a minute, whereas manual optimization and testing could take weeks or even months to achieve similar results.

Top comments (0)