We all love metrics. We all need numbers. And different stakeholders need different numbers. Numbers that will drive key decisions inside your organization and for your customers. Becoming a data driven organization requires having reliable data in the first place.
Whatever is daily, weekly or monthly, there is always a new report to be generated. In Funambol most of these reports are generated running a proprietary Java application that queries the production MySQL databases. Once created, they are encrypted and delivered to the different internal and external stakeholders.
If it ainโt broke donโt fix it?
But how can you generate accurate reports, optimize resources, improve automation without querying and impacting the live deployment? Let us first see what issues we faced while running the reports on the production database:
- the execution time of these reports might vary between a few minutes and many hours according to the size of the deployment. And can be affected by the existing load on the production environments. Should we size the database for the peak load, handling traffic and generating reports at the same time? Should we dedicate a read replica to reporting activity only? Your cloud provider will be happy, you would be paying but giving away CPUs and storage.
the long running queries on the database make it very hard to achieve elasticity. Either you scale your database or you keep the query running. You cannot achieve both of them and a failure in the report can be costly.
reporting and users are different scenarios. And require very different configuration and indexes at the database layer. Compromising to cover both imply suboptimal performances of the database. Running reporting on an isolated deployment is going to help the production database too as you can get rid of indexes you do not need anymore. And you will save storage and a few write IOPS.
the results might not be accurate as data can change during the execution. It is a live system. If the number of users is 1000, you cannot report that there are 991 active and 10 idle users, as someone signed up while you generated the report. The difference might not be significant but what about the confidence level of your stakeholders?
What is a throw-away database?
We are all familiar with the โPets vs. Cattleโ analogy in software development. But here I define a throw-away database, our cattle database, as one that we have running for the minimum time we need to take care of our reporting.
Let us see how it helps us in decoupling KPI generation and the users, automating the process as well. Funambol deploys its infrastructure on AWS so going forward in this post I will draw an AWS Architecture Diagram and use AWS names and examples. But most of the steps can be achieved using the infrastructure and the services provided by any of the major cloud providers.
Let us assume you have a web application that connects to an Amazon RDS Multi-AZ Deployment. How can we decouple the report generation? We can draw first a diagram with the new components on the right side.
A throw-away workflow for our KPIs
Let us describe the main steps that we have to put in place in our new workflow, starting the process from a management server. In all these we are using snippets that rely on the AWS CLI, focusing only on the key steps, but you can achieve the same using the AWS SDK in the language you like most.
- snapshotting the database: one of the first steps is to have the data frozen at the time when we want to generate the report. We achieve that by creating a storage volume snapshot of the production DB instance. For example, if we want to generate the KPIs for a system at midnight, that is the time when we trigger the creation of the snapshot:
aws rds create-db-snapshot --db-snapshot-identifier "<my-kpi-snapshot>" --db-instance-identifier "<my-production-db>"
- we can now create the cattle database from the snapshot: once the snapshot is available, we create a new RDS instance:
aws rds restore-db-instance-from-db-snapshot --db-snapshot-identifier"<my-kpi-snapshot>" --db-instance-identifier "<my-kpi-db>" (...)
- we can now configure a different parameter group that will be used for the KPIs only and will allow to tune the database to the specific queries we are going to run. We introduce new indexes as well and we can optimize the instance class and type.
aws rds modify-db-instance --db-instance-identifier "<my-kpi-db>" --vpc-security-group-ids "my-kpi-environment" --backup-retention-period 0 --apply-immediately
(...)
- we can now trigger our process to generate the KPIs on our new temporary environment.
- Once the results are available and delivered to the stakeholders we can **turn off the database **and stop paying for it. We will still have the snapshot we used and could easily reprocess the data if we have any issue.
aws rds delete-db-instance --db-instance-identifier "<my-kpi-db>" --skip-final-snapshot
Monitoring
As any impact to the delivery of the reports could severely impact key stakeholders, we need to have a process in place to monitor and react promptly if the process fail. We can as well push the results to CloudWatch Logs and even rely on trend analysis / forecasting, implementing our own algorithms or using CloudWatch Anomaly Detection.
Conclusions
Let us see the benefits of adding a throw-away database to our deployment:
- elasticity and cost optimization: the production database does not need to handle sporadic peaks to generate the reports and indexes, instance sizes and configuration can be optimized both on the production and KPI databases to better handle the specific requests. The extra temporary cost of the second (single zone) RDS was significantly less than the average cost saving on the production databases thanks to the scaling down possible in many of them.
- reliable and on-time results: all the reports and KPIs are now generated on a database that represents the frozen status at the specific time: no changing data anymore during the execution of the tools. The delivery time of the reports is more predictable as it is not affected by the variable load of the production database. Failures in generating the reports are spotted quickly and can easily and promptly be rectified as we have the frozen status of the database (the snapshot).
- automation: a common automatic process is used across all the live deployments of Funambol to generate different reports and KPIs.
- decoupling: we can optimize database configuration and indexes according to the decoupled use case. As the new solution is now independent of the production database, it makes easier to plan further improvements: as the size of the largest deployment increase, we could process specific data and move and consolidate them in a common warehousing solution.
Thanks for making it this far! I am always looking for feedback to make it better, so please feel free to reach out to me via LinkedIn or email.
Credits
This post originally appeared on the Funambol Tech Blog
.
Top comments (0)