Keywords: Ansible | Task Retry | Automated Operations
Abstract: This article provides an in-depth exploration of Ansible's task retry mechanism, focusing on practical scenarios where database connection operations may fail after restart. It details how to use the retries, delay, and until parameters to build intelligent retry logic, comparing different implementation approaches to avoid playbook interruption on initial failure while ensuring proper failure triggering after multiple unsuccessful attempts. Through concrete code examples, the article demonstrates the integration of register variables with conditional checks, offering practical solutions for fault tolerance in automated operations.
Core Principles of Ansible Task Retry Mechanism
In automated operations, task execution may fail due to external dependency states, such as needing to wait for a database service to fully start after restart before performing subsequent operations. Ansible provides built-in retry mechanisms to handle such transient failures, but proper configuration is essential for expected behavior.
Problem Scenario Analysis
The specific issue encountered involves database connection operations failing immediately after restart because the service isn't fully available. Simple retry configurations like retries: 3 and delay: 5 don't work as intended since the initial failure terminates the entire playbook. While ignore_errors: yes can suppress errors, this approach would show the playbook as successful even if all retries fail, which doesn't meet practical requirements.
Solution: Retry Mechanism with Until Condition
According to the best answer (Answer 1), the correct implementation combines the until parameter with a register variable. Here's a complete example:
- command: /usr/bin/false
retries: 3
delay: 3
register: result
until: result.rc == 0In this example:
retries: 3specifies up to 3 retries (4 total attempts including the initial execution)delay: 3indicates a 3-second wait before each retryregister: resultsaves the task execution result to theresultvariableuntil: result.rc == 0defines the stopping condition: retry until return code (rc) equals 0
The execution displays:
TASK [command] ******************************************************************************************
FAILED - RETRYING: command (3 retries left).
FAILED - RETRYING: command (2 retries left).
FAILED - RETRYING: command (1 retries left).
fatal: [localhost]: FAILED! => {"attempts": 3, "changed": true, "cmd": ["/usr/bin/false"], "delta": "0:00:00.003883", "end": "2017-05-23 21:39:51.669623", "failed": true, "rc": 1, "start": "2017-05-23 21:39:51.665740", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}Key Parameter Details
Retries and Until Collaboration: Ansible's retry mechanism is actually driven by the until parameter. retries only specifies the maximum number of retries, while until defines the condition to stop retrying. Without until, tasks won't automatically retry even with retries set.
Register Variable Role: By saving task execution results via register, these results can be referenced in until conditions for evaluation. Common evaluation approaches include:
result.rc == 0: Based on return coderesult is not failed: Based on task status (as shown in Answer 2)"success" in result.stdout: Based on output content
Practical Application Example
For the specific scenario of database connection after restart, implement as follows:
- name: Wait for database to be ready
command: /usr/bin/mysql -h localhost -e "SELECT 1"
retries: 10
delay: 5
register: db_check
until: db_check.rc == 0This task attempts to connect to the database with up to 10 retries, 5 seconds apart. Retrying stops only when the connection succeeds (return code 0). If all attempts fail, the task is marked as failed and playbook execution terminates.
Comparison with Alternative Approaches
Compared to simply using ignore_errors: yes, this method offers:
- Precise Failure Control: Failure is marked only when maximum retries are reached without meeting the condition
- Clear Execution Logging: Retry process is explicitly displayed in output for debugging
- Avoids False Success Reporting: Doesn't mask actual failures by ignoring errors
The until: result is not failed approach mentioned in Answer 2 provides another valid conditional check, directly evaluating task failure status without concern for specific return codes. This can be more concise in some cases but requires attention to Ansible version compatibility.
Best Practice Recommendations
1. Set Retry Parameters Appropriately: Configure delay based on typical service startup times and retries based on tolerance levels
2. Use Explicit Evaluation Conditions: Prefer specific conditions (like return code checks) over generalized status evaluations
3. Combine with Other Waiting Mechanisms: For long-running services, consider combining with the wait_for module to first check port availability
4. Log Retry Details: Results saved via register include an attempts field to track actual retry counts
Conclusion
Ansible's retry mechanism, through the combination of retries, delay, and until parameters, provides flexible task fault tolerance. Proper use of these parameters can handle common issues like dependency service startup sequences and temporary network failures while maintaining robust automation workflows. The key is understanding the central role of the until condition in driving retries and how to access task execution results via register variables for conditional evaluation.