As a scaling software company, we experience periods of unpredictable platform usage spikes which can come from all over the world. This volatility fosters complexity around instance scaling and other core dev infrastructure. If our instances don’t scale properly, our users experience a less performant product. If our servers are too large, it is a waste of dollars.
Hooking into Datadog gave our company newfound and much-needed insights into our AWS infrastructure. Moreover, the alerting functionality built into Datadog’s event monitoring unlocked the ability for us to monitor, investigate, and ultimately resolve issues with our infrastructure. However, although Datadog alerted us to issues, we still had to manually manage our AWS infra. Our goal was to build a platform that could remediate infra challenges in an automated way by leveraging Datadog alerts.
...
Examples of Datadog Remediation with WayScript
1) AWS EC2 Instance Management —
I eluded to this use case in the intro paragraph. Mainly, with Datadog alerting and WayScript, users can set up alerting for high or low levels of usage of a particular instance. To do this, users quickly set up a Datadog trigger on WayScript.
Once you Datadog trigger is setup, you can run Shell Scripts, Python, JavaScript, Java, or SQL queries directly from WayScript. In this case, we use the boto3 python library to automate a variety of different EC2 remediation functions (full example tutorial and ReadMe).
For example, if we get an alert that our instance has high traffic load, we can automate running a script to add another EC2 instance to our system with this type of code:
import boto3
def turn_instance_on( instance_id ):
ec2 = build_client()
current_state = check_instance_state( instance_id )
if current_state == 'not running':
try:
response = ec2.start_instances(InstanceIds=[instance_id], DryRun=False)
new_state = response.get('StartInstances')[0].get('CurrentState').get('Name')
return 'Success'
except ClientError as response:
return response
else:
return 'Instance Already Running'
The same type of logic can be run to turn instances off with low traffic.
...
2) Rollback to a previous deployment with CircleCI after an admin confirmation via text message.
When building a remediation tool, we found ourselves needing automated tasks mixed with ‘Human-in-the-loop’ interactions. Mainly, we wanted to design a program that would roll our production server back to the previous version based on a Datadog alert. Moreover, once the alert hit our backend we wanted to generate a text message approval by our lead backend development team. Once approved by a text message response, the rollback kicks off via CircleCI.
So how does this work? First, we set up an event alert on Datadog that is linked to our Rollbar incident reporting for deployments. If this incident is marked as a bad_deploy
, our trigger fires. Next, a python script interprets the event and determines if a rollback is necessary:
event = variables['Event']
title = event.get('Title')
try:
if 'bad_deploy' in title:
status = 'bad'
variables['status'] = status
else:
status = 'good'
except:
variables['status'] = 'good'
If a rollback is necessary, we use the Twilio API to send a text message to our backend dev team. If/when a dev response with ‘approve’, CircleCI is set to rollback to the previous working version of our production system.
...
3) Terminate a deadlocked DB query and log the issues.
As we continue to scale, there have been instances where we have experienced unanticipated database issues such as deadlocks from long running queries. This type of event can cause significant degradation of performance for our user base. Therefore, we wanted to build a process for logging the deadlocked process, but then ultimately killing the query in an automated way (we determined this is better than user wide degradation).
In order to do this we set up Datadog alerts for High CPU usage (degraded status) or High memory usage on our database. When this alert hits WayScript, it kicks of a couple of processes.
Initially, we use Python & SQL to grab all currently running queries on our db (RDS on AWS). The first process builds a Pandas Dataframe of the running queries information, stores this in a file, and then emails the file to our backend dev team. The second process looks for queries that have exceeded an expected time threshold. For these queries, they are passed to a third process which kills them based on their RDS ID.
Example of Pulling the Running Queries:
import boto3
from botocore.exceptions import ClientError
def build_client():
ec2 = boto3.client(
'rds',
region_name = 'us-east-2',
aws_access_key_id=context['key_id'], #stored in .secrets
aws_secret_access_key=context['key_secret'] # stored in .secrets
)
return rds
rds = build_client()
response = rds.execute_statement(
continueAfterTimeout=False,
database='database-1',
includeResultMetadata=False,
resourceArn='aws:rds:us-east-<DB_ID>',
schema='string',
secretArn='string',
sql='string',
transactionId='string'
)
...
Our goal was to build a platform that could remediate infra challenges in an automated way by leveraging Datadog alerts.
WayScript let’s users trigger processes from Datadog monitors and events to automatically address infrastructure needs. WayScript is a virtual development environment that runs scripts in Shell, Python, or JavaScript and connects with external services like EC2, CircleCI, and SQL Server.
Top comments (0)