KUSEH SIMON WEWOLIAMO

Posted on Feb 18

Avoiding Playbook Failures:Techniques to Handle Errors in Ansible Like an Expert

#ansible #devops #cloudcomputing

Article Outline

1. Introduction
2. Techniques to Handling Errors in Ansible
3. Breakdown of Error Handling in Ansible
4. Error handling Best practices
5. Conclusion
6. References

1. Introduction: Why Error Handling Matters in Ansible

Ansible, an open source IT automation tool used for configuration management, software provisioning and application deployment is
widely used among DevOps, Cloud and IT professionals. Ansible's main strengths are it's simplicity, agentless architecture and ease of use as compared to other tools in that domain. Despite Ansible's strengths unexpected failures can and will occur due to unreachable hosts, failed commands, or misconfiguration etc.If failures are not handled properly, these errors can disrupt entire deployments and cause downtime.

Writing robust playbooks with proper error handling in Ansible is essential to ensure reliability and resilience in your Ansible automation workflows.
In this article I will take you through common techniques and methods that can assist you to properly handle errors in playbook

2. Techniques to Handling Errors in Ansible

There are several techniques for handling errors in Ansible include using ignore_errors, failed_when ,block and rescue ,any_errors_fatal etc. In this section we dive into some of these techniques and methods and how and when to use each technique.

Ignoring Failed Commands: When Failure is Acceptable

Ansible will stop executing the current task and subsequent task(s) when it encounters an error and will stop executing a playbook entirely when there is an error in one of the task.
The "ignore_errors" directive allows continues executing of subsequent task(s) despite the failure of a task. The ignore_errors directive only works when the task is able to run and returns a value as "failed". This can be useful in scenarios where certain failures are expected or do not impact the overall automation process.
You can use ignore_errors in situations where there are no non-critical tasks, testing and debugging, checking for optional dependencies and Best-Effort Action.
Bellow is an example of how to use ignore_errors.

    - name: Restart Apache (ignore failure)
      ansible.builtin.service:
         name: apache2
         state: restarted
         ignore_errors: yes

Handling Unreachable Hosts.

One other failure types that can occur in playbooks is unreachable host. This failure usually occurs when Ansible cannot establish a connection with the host. Unreachable host errors can be due to network issues, incorrect SSH credentials, or a system/server being down. By default, Ansible will stop executing the current task and the playbook on that host if that host becomes unreachable. You can use ignore_unreachable to handle a task failure due to host(s) instance being ‘UNREACHABLE. When you use the "ignore_unreachable" directive, Ansible ignores the task errors but continues to execute future tasks against the unreachable host.

Usage of ignore_unreachable a task.

   - name: This executes, fails, and the failure is ignored
     ansible.builtin.command: /bin/true
     ignore_unreachable: true

  - name: This executes, fails, and ends the play for this host
    ansible.builtin.command: /bin/true

Usage of ignore_unreachable in a playbook

  - hosts: all
  ignore_unreachable: true
  tasks:
    - name: This executes, fails, and the failure is ignored
        ansible.builtin.command: /bin/true

    - name: This executes, fails, and ends the play for this host
        ansible.builtin.command: /bin/true

Handlers and Failure: Triggering the Right Actions

In Ansible handlers are special type of tasks that only run when they are notified by another task. They are usually runned at the end of each play and normally used to Restart a service only on successful execution that service. If a task notifies a handler but the next task fails in the that same play, by default the handler will not run leaving the host in an unexpected state. To override this default behavior you can use the "force-handlers" directive either in a play or in the ansible.cfg file and Ansible will forcefully run all notified handlers.

Usage of ignore_unreachable in a playbook

- name: Demonstrate force_handlers
  hosts: webservers
  force_handlers: yes  # Ensures handlers run even if a task fails
  tasks:
    - name: Deploy a web application
      ansible.builtin.command: /usr/local/bin/deploy_app.sh
      notify: Stop Web Service  # Handler will be triggered

    - name: Simulate a failure
      ansible.builtin.command: /usr/local/bin/failing_task.sh
      ignore_errors: no  # This will fail the playbook

  handlers:
    - name: Stop Web Service
      ansible.builtin.service:
        name: nginx
        state: stopped

Defining Failure and Change: Controlling When a Task Fails

There are some situations where you will want to explicitly trigger an error when certain conditions are met.
In Ansible , a task is considered failed when it returns a non-zero erro exit code but there are certain situations you may want to override this default behavoir.
Ansible allows you to override this default behavoir by defining custom failure conditions for a task by using the "failed_when" directive.

The "failed_when" directive commonly used for handling commands that return non-standard exit codes, conditional task failures based on some logic
and preventing false failures in debugging and logging. As with all conditions in Ansible you can use the "failed_when" with "and" and "or"

Handling Commands with Non-Standard Exit Codes

- name: Run a script that exits with code 2 on success
  ansible.builtin.command: /usr/local/bin/custom_script.sh
  register: script_output
  failed_when: script_output.rc != 0 and script_output.rc != 2

Handling Conditional task failures based on some logic

- name: Check available disk space
  ansible.builtin.shell: df -h / | awk 'NR==2 {print $5}' | sed 's/%//'
  register: disk_usage
  failed_when: disk_usage.stdout | int > 90

Aborting a Play on All Hosts: When to Stop Everything

In handling errors in Ansible playbooks there maybe some situations where you may want to abort the play on all host or some number of host
when there is a failure on a single host.
The "any_errors_fatal" and "max_fail_percentage" allows you to stop a playbook execution on all host or a number of host respectively.
In critical deployments, you may want to stop everything immediately to prevent further issues, in this case you need to use "any_errors_fatal"
in your playbook.

Stopping Execution on All Hosts if a Critical Task Fails

- name: Configure firewall rules across servers
  hosts: all
  any_errors_fatal: true
  tasks:
    - name: Apply firewall rules
      ansible.builtin.iptables:
        chain: INPUT
        protocol: tcp
        destination_port: 22
        jump: ACCEPT

In the code snippets above , if one server fails to apply the rules, the playbook stops execution for all servers to prevent security inconsistencies.

Controlling Errors with Blocks, Rescue, and Always

Ansible blocks are used to group common tasks in a logical manner and all tasks in a "block" inherit directives applied at the block level.
Blocks offers ways to handle errors compared to the way exceptions are handled in common programming languages. Blocks are used together
with "rescue" and "always" to handle errors. Rescue block contains a list of task to run whenever a task in a block fails. The rescue block will
only run when a task returns a "failed" state. Bad task definitions and unreachable hosts will not trigger the rescue block.
Always block will run all the time no matter what the task status of the previous block is.
when block,rescue and always blocks directives are used together , they offer a structured way to handle complex task failures.

Installing Nginx with Apache2 as a fallback

    - name: Install web server with error handling
  hosts: all
  become: yes
  tasks:
    - name: Attempt to install Nginx with fallback to Apache
      block:
        - name: Install Nginx
          ansible.builtin.apt:
            name: nginx
            state: present
            update_cache: yes
          register: nginx_install_status

      rescue:
        - name: Log failure and install Apache2 instead
          ansible.builtin.debug:
            msg: "Failed to install Nginx, installing Apache2 instead."

        - name: Install Apache2
          ansible.builtin.apt:
            name: apache2
            state: present

      always:
        - name: Cleanup temporary files......
          ansible.builtin.file:
            path: /tmp/install_logs
            state: absent

3. Summary of Error Handling in Ansible

Scenario	Error Handling Method	Best Practice
Task fails but should not stop playbook	ignore_errors: yes	Use when failure is non-critical but log output for debugging.
Host is unreachable but should not affect others	ignore_unreachable: yes	Useful in large environments with occasional host failures.
Define custom failure conditions	failed_when	Use when command output doesn't align with default error codes.
Play should stop if any critical task fails	any_errors_fatal: true	Use for high-risk changes like database updates.
Catch task failure and attempt alternative action	block, rescue, always	Ensures graceful error handling with a fallback mechanism.
Prevent unnecessary task execution	changed_when	Avoids triggering handlers when no real changes occur.

4. Some Common Errors in Ansible Playbooks

Syntax Errors: These occurs when there are some erroneous YAML formatting or incorrect Ansible syntax.
Module Errors:Caused by specific Ansible modules with incorrect parameters or incompatible versions of the same module..
Connection Errors: Problems with SSH connections to distant hosts.ts.
Resource Unavailability: A necessary resource is either missing or unavailable.
Task Failures: Tasks may fail because they do not meet particular conditions or are executed in an inappropriate execution environment. .
Variable Errors: Variables that are undefined or wrongly declared, causing the process to fail.
Dependency Errors: Errors caused by lacking dependencies between tasks, roles, or playbook.

5. Conclusion: Making Your Ansible Playbooks More Resilient

Overall effective error handling in Ansible is very essential for building resilient and fault-tolerant automation workflows.
By using some directives like ignore_errors for non-critical failures, failed_when for custom conditions, and block/rescue/always for structured recovery etc you can ensure that your playbooks can handle errors without disrupting the entire deployment.
Thats all for now till we meet again on my next article its by for now.

6. References: More Reading

https://docs.ansible.com/ansible/latest/playbook_guide/playbooks_handlers.html
https://docs.ansible.com/ansible/latest/playbook_guide/playbooks_error_handling.html
https://medium.com/@vinoji2005/day-13-error-handling-in-playbooks-ensuring-robust-ansible-automation-%EF%B8%8F-2abb62cd52f9
https://spacelift.io/blog/ansible-handlers

DEV Community