Chaos Engineering is all about breaking things with a purpose- to build resilient systems. Many chaos engineering tools offer predefined faults, but where do you start if you want to build custom chaos experiments to suit your infrastructure? And how do you design chaos experiments that are both effective and safe? Let’s break it down.
1. Define Clear Objectives
Answer this question: What are you testing?
Chaos experiments should have a specific goal. Before creating a custom experiment, ask:
- What failure scenario are we simulating? (CPU spikes, network latency, disk failures, etc.)
- What is the expected behavior of the system? (Auto-recovery, failover, degraded performance)
- What metrics define success or failure? (Response time, error rate, downtime) For example: "I want to test how my application’s microservices handle sudden node failures and verify if traffic automatically shifts to healthy nodes."
2. Choose the Right Disruption
Answer this question: For your application/system, what faults are relevant? What resources should be disrupted?
All faults have different consequences (or blast radius). It is recommended to start with low-risk faults (such as network latency, CPU/memory spike, pod restarts) before moving towards destructive tests (such as node failures, killing database instances, packet loss in critical services).
Start executing chaos experiments in a staging environment, analyze results, and gradually increase the chaos impact to different environments (QA, pre-production, production).
3. Ensure RBAC Is in Place
Chaos experiments are disruptive if not controlled. Use RBAC policies to prevent unauthorized users from running high-risk experiments. This way, only authorized users/engineers will have the permissions to execute chaos experiments.
The manifest below describes the rules for a user (“chaos-engineer”) who can perform the operations (create, get, and list) on specified resources.
kind: Role
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: chaos-engineer
rules:
- apiGroups: ["litmuschaos.io"]
resources: ["chaosengines", "chaosexperiments"]
verbs: ["create", "get", "list"]
4. Monitor System Behavior
When a chaos experiment is defined, it is important to audit what goes on when the experiment is in execution. Set up monitoring/observability tools to monitor CPU, memory, and response time, as well as detect unusual errors. This way, you can answer some of the questions such as:
- Did auto-scaling trigger correctly?
- Did users experience downtime?
- Were ale rts sent to engineers?
5. Recovery Plan in-place
Sometimes, chaos experiments may not go as expected, and may disrupt to a higher extent than expected. To address such cases, rollback and recovery should be set up.
For example, Kubernetes self-healing, restarting pods after executing the experiment, and so on.
The manifest below uses Kyverno to auto-restart pods after chaos so that the affected pods don’t persist and affect other resources.
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: restart-failed-pods
spec:
validationFailureAction: Enforce
rules:
- name: restart-on-failure
match:
resources:
kinds: ["Pod"]
mutate:
patchStrategicMerge:
spec:
restartPolicy: Always
Conclusion
- Define clear objectives → Know what failure you’re testing.
- Start small, then scale → Begin with low-risk chaos.
- Use RBAC and monitoring → Keep experiments safe
- Monitor system behavior → Audit what goes on when the experiment runs.
- Always have a recovery plan → Have rollback/recovery plans in case something doesn’t go as expected.
Want to create custom experiments but don’t know where to start? Join the Litmus Slack channel, check out the official documentation and blogs.
Top comments (0)