It can often be a learning experience or at least entertaining to hear about other people's experiences in managing distributed systems at scale. Recently we experienced a HashiCorp Vault outage and I thought it would be fun to share this experience plus a few lessons learned and tips. Looking back, it's quite interesting how different things can align so perfectly to result in such an obscure issue as you'll see.
Environment Overview
My company uses the full suite of HashiCorp tools, and quite honestly we're really big fans. Their tools have allowed us to transition to modern infrastructure practices in on-prem data center and cloud environments. This has led to improved application deployments and an overall better hosting experience. At the heart of our environment we use Nomad for container and non-containerized application orchestration, Vault for secrets storage and dynamic PKI, and Consul for service discovery and for Vault's storage backend. The total environment is fairly large with about 8400 active jobs consisting of over 60,000 containers and IIS app pools running on the Nomad clusters. Consul clusters have about 10,000 services in the catalog and about 40,000 unique services being health checked.
Issue
One evening a few weeks back, we began upgrading the Nomad agent on several pools of Nomad clients (node classes) in one of our data centers. Soon after, we noticed Vault handle time metrics (vault.core.handle_request
) were beginning to elevate. This has been somewhat of a reoccurring pattern in this specific environment for the past year, especially during maintenance activities that result in extra load.
Latencies continued to get worse as time passed however. Soon after we realized that Vault's mean handle time transitioned from ms units to minutes(!) which was something we definitely hadn't seen before.
Nomad's logging rate began increasing, full of entries like these snipped errors:
failed to create an alloc vault token: Post "/v1/auth/token/create/nomad-cluster": context deadline exceeded
failed to create an alloc vault token: Post "/v1/auth/token/create/nomad-cluster": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
At this point, whatever is occurring isn't super impactful yet since all jobs have multiple tasks running. We decided to engage HashiCorp Support since we're enterprise customers and kept digging and collecting logs.
The challenge we had was trying to identify which system was at fault here, Consul or Vault. Today in our environment, both systems are coupled very closely since we're using the Consul storage backend for Vault. This means that Vault's persistence storage is sitting (encrypted) on top of Consul's KV store. We've planned to move away from this architecture so that Vault has it's own dedicated cluster (or now raft integrated storage), but it hasn't happened yet.
We continued pouring through VM level metrics trying to identify a cause while ruling out symptoms and red herrings. There was essentially zero IO wait, CPU usage for Vault was 30% or less and Consul was 60% or less. We focused on disk performance since generally the limiting factor with Consul is disk write performance used by writing entries to the raft log. Disk IOPs on the Consul cluster were fairly low, even lower than our other data centers.
Nothing too wild being reported for disk latencies as well:
By now, we're even more stumped. HashiCorp Support has Development involved and they're worried the most about Consul commit and Vault handle times being so elevated. Our concern was that there's no way for the increasing thundering herd of Nomad token renewals/creates to ever get serviced, just compounding the issue even more.
At this point, we're suspecting an internal issue with Consul or Vault even more. Something was bottlenecking operations somewhere and it didn't seem like hardware or VMs anymore. Support was concerned about the time spent writing to the raft log still, specifically consul.raft.fsm.apply
and consul.raft.commitTime
.
The next theory is things are hanging because of the sheer number of Vault revocations (token and pki certs) occurring from the earlier Nomad maintenance. Support brought up a new feature in Vault that puts back pressure on the expiration manager. In Vault 1.6.x, this can be enabled with an environment variable (VAULT_16_REVOKE_PERMITPOOL=200
) and 1.7.x includes this feature by default.
After learning about this new revocation back pressure mechanism, we elected to try an emergency Vault upgrade from 1.5.4 to 1.6.3. We also turned on the new environment variable in the systemd unit file for Vault. Once the dust settled, we saw a slight improvement, but still..... no great recovery yet.
Discovery
Periodically throughout the event we collected consul debug
and vault debug
archives, which are really convenient ways to package up logs and profiling dumps. Support reviewed those dumps multiple times and tried to identify what's causing the slowness.
After Development reviewed the pprofs more, someone brought up a potential cause: metrics, specifically there seemed to be a high amount of time being spent in a datadog metrics library that's built into Consul.
Wait... What? We weren't using datadog for monitoring so what's going on?
To get realtime metrics from our Nomad, Consul, and Vault clusters we leverage Prometheus to scrape a Telegraf agent on every VM. That's been a convenient way to collect VM level metrics (cpu, memory, disk/io, etc) plus any other service metrics on the VM. Having Consul spew statsd style metrics to the local Telegraf agent has been working great.
Consul Development began piecing things together and found there was a known issue in the datadog metrics library that was causing blocking for the Consul process. It's an unpublished issue and was patched in Consul 1.7.6 by bumping the go-metrics/Datadog version. In the Consul configuration, we had the telemetry stanza configured with the dogstatsd_addr
pointing to the localhost Telegraf port. We couldn't remember exactly why we were using the dogstatsd output versus straight statsd (or even Prometheus). There was a vauge memory of adding additional tags to each metric, but clearly wasn't really needed since Telegraf supports [global_tags]
in its configuration.
Resolution
To see if this truly could be causing our issues we updated the telemetry stanza on the Consul servers one at at time to include just statsd output.
Previous telemetry configuration:
"telemetry": {
"disable_hostname": true,
"dogstatsd_addr": "localhost:8125",
"dogstatsd_tags": [
"env_name:production"
]
},
Updated telemetry configuration change:
"telemetry": {
"disable_hostname": true,
"statsd_address": "localhost:8125"
},
The Consul leader was restarted last and once completed, we had near instant recovery! Thank goodness!
consul.raft.commitTime
mean 120-200ms down to 7-9ms
Looking through Consul code, github.com/armon/go-metrics/datadog
imports the official github.com/DataDog/datadog-go
. I found a PR created back in October 2019(!) that fixed a blocking issue in instrumented applications (like Consul).
This fix ultimately ended up in Consul 1.6.7 released in July 2020.
Lessons Learned
I wanted to include a few lessons learned on various topics that hopefully can save you some time and frustrations in the future.
Important Metrics
Here's a couple interesting metrics to keep an eye on:
-
consul.raft.fsm.apply
- This metric tipped HashiCorp off that there was something backing up or bottlenecking when writing raft logs. Before the fix, we were in the 20-40 range across all servers, but it fell down to 1-2 after the fix. This is how many logs were committed since last interval. -
consul.raft.commitTime
- HashiCorp says this shouldn't be more than a few milliseconds. In our environments and storage solution, I think 10-25ms is more realistic. Consider throwing alerts because something is wrong above say 30 or 50ms. -
consul.catalog.register
- This tells you how long it takes to register a service into the catalog and should be very similar to Consul commit time. A goal for mean time could be 10-20ms. -
vault.core.handle_request
- Mean time to handle requests in a health cluster look to be .5-10ms. -
vault.runtime.num_goroutines
- When Consul was slow/blocking, the vault go routines continued climbing. It looks like on a healthy these should be under 2000 generally.
Quick Response
When troubleshooting issues like this, centralized metrics and logs are invaluable. It's often necessary to bundle up debug archives or logs to get in Support's hands. Here's a quick ansible playbook to gather consul logs from one or multiple servers at once. Logs end up in a single archive.
- name: get consul logs
hosts: all
vars:
date: "{{ ansible_date_time.date }}"
tasks:
- name: remote - grep consul logs on hosts
shell: "egrep 'consul:' /var/log/messages > /var/tmp/consul.log.{{ ansible_hostname }}"
- name: remote - change log file permissions
file:
path: "/var/tmp/consul.log.{{ ansible_hostname }}"
mode: 0775
- name: remote - pull consul logs locally
fetch:
src: "/var/tmp/consul.log.{{ ansible_hostname }}"
dest: "/var/tmp/consul.log.{{ ansible_hostname }}"
flat: yes
become: no # fetch needs become=no if pulling large files, otherwise MemoryError
- name: local - create tarball
archive:
path:
- /var/tmp/consul.log.*
dest: /var/tmp/consul_logs-{{ date }}.tgz
delegate_to: 127.0.0.1
run_once: true
become: no
- name: local - remove leftover local files
file:
path: "/var/tmp/consul.log.{{ ansible_hostname }}"
state: absent
delegate_to: 127.0.0.1
become: no
- name: remote - remove leftover remote files
file:
path: "/var/tmp/consul.log.{{ ansible_hostname }}"
state: absent
- debug:
msg: "Logs are ready at /var/tmp/consul_logs-{{ date }}.tgz"
Improve Your Profiling Skillset
If hardware has been ruled out, it may be time to transition to profiling and look at the application internals. It's important to be able to answer "What is my application spending time doing?" In GO based apps, pprof
is the tool of choice, but it can be confusing if you don't use it often. I'm sure HashiCorp Development was using pprof along with metrics to understand where the slowness was coming from and I wish that's something we had done ourselves sooner.
Here's the Consul flame graph before the change. The purple is the regex for metrics|datadog|statsd
. Note how wide (how much time is spent) the lines for go-metrics and datadog stacks are!
Here's the Consul profile after the remediation. Note how much less purple there is!
The above pprof profiles were viewed by running the following command:
go tool pprof -http=":8000" consul profile.prof
In the above command, consul
is the exact same version binary for Consul in my current directory. profile.prof
came from the consul debug
output.
Unsolved Issues
The last bit of advice is to not live with unsolved issues. Looking back, we should have better understood those Consul commit times and really learned what was considered high and what normal should really look like. Phantom Latency can often come back to haunt you, like we experienced! Always try to understand metrics and logs better, clean up reoccurring errors, mock up tests if needed and prove out theories. You and your team will be much better off in the long run.
Top comments (0)