Best Practices for Creating Custom Chaos Experiments

#litmus #litmuschaos #chaosengineering

Chaos Engineering is all about breaking things with a purpose- to build resilient systems. Many chaos engineering tools offer predefined faults, but where do you start if you want to build custom chaos experiments to suit your infrastructure? And how do you design chaos experiments that are both effective and safe? Let’s break it down.

1. Define Clear Objectives

Answer this question: What are you testing?
Chaos experiments should have a specific goal. Before creating a custom experiment, ask:

What failure scenario are we simulating? (CPU spikes, network latency, disk failures, etc.)
What is the expected behavior of the system? (Auto-recovery, failover, degraded performance)
What metrics define success or failure? (Response time, error rate, downtime) For example: "I want to test how my application’s microservices handle sudden node failures and verify if traffic automatically shifts to healthy nodes."

2. Choose the Right Disruption

Answer this question: For your application/system, what faults are relevant? What resources should be disrupted?
All faults have different consequences (or blast radius). It is recommended to start with low-risk faults (such as network latency, CPU/memory spike, pod restarts) before moving towards destructive tests (such as node failures, killing database instances, packet loss in critical services).
Start executing chaos experiments in a staging environment, analyze results, and gradually increase the chaos impact to different environments (QA, pre-production, production).

3. Ensure RBAC Is in Place

Chaos experiments are disruptive if not controlled. Use RBAC policies to prevent unauthorized users from running high-risk experiments. This way, only authorized users/engineers will have the permissions to execute chaos experiments.
The manifest below describes the rules for a user (“chaos-engineer”) who can perform the operations (create, get, and list) on specified resources.

kind: Role
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: chaos-engineer
rules:
  - apiGroups: ["litmuschaos.io"]
    resources: ["chaosengines", "chaosexperiments"]
    verbs: ["create", "get", "list"]

4. Monitor System Behavior

When a chaos experiment is defined, it is important to audit what goes on when the experiment is in execution. Set up monitoring/observability tools to monitor CPU, memory, and response time, as well as detect unusual errors. This way, you can answer some of the questions such as:

Did auto-scaling trigger correctly?
Did users experience downtime?
Were ale rts sent to engineers?

5. Recovery Plan in-place

Sometimes, chaos experiments may not go as expected, and may disrupt to a higher extent than expected. To address such cases, rollback and recovery should be set up.
For example, Kubernetes self-healing, restarting pods after executing the experiment, and so on.
The manifest below uses Kyverno to auto-restart pods after chaos so that the affected pods don’t persist and affect other resources.

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: restart-failed-pods
spec:
  validationFailureAction: Enforce
  rules:
    - name: restart-on-failure
      match:
        resources:
          kinds: ["Pod"]
      mutate:
        patchStrategicMerge:
          spec:
            restartPolicy: Always

Conclusion

Define clear objectives → Know what failure you’re testing.
Start small, then scale → Begin with low-risk chaos.
Use RBAC and monitoring → Keep experiments safe
Monitor system behavior → Audit what goes on when the experiment runs.
Always have a recovery plan → Have rollback/recovery plans in case something doesn’t go as expected.

Want to create custom experiments but don’t know where to start? Join the Litmus Slack channel, check out the official documentation and blogs.

DEV Community

Best Practices for Creating Custom Chaos Experiments

1. Define Clear Objectives

2. Choose the Right Disruption

3. Ensure RBAC Is in Place

4. Monitor System Behavior

5. Recovery Plan in-place

Conclusion

Top comments (0)

Read next

Understanding Module Wrapper in Node.js: The Hidden Magic Behind Modules 🧙‍♂️✨

10 CSS Tricks Every Frontend Developer Should Know

Understanding Module Scope in Node.js: Keep Your Code Safe & Organized 🚀

Fixing Docker 'Permission Denied' Error on Ubuntu 24.04: A Step-by-Step Guide