Keywords: Apache Airflow | Parameter Passing | CLI Trigger | DAG Configuration | Workflow Orchestration
Abstract: This article provides a comprehensive exploration of various methods for passing parameters when manually triggering DAGs via CLI in Apache Airflow. It begins by introducing the core mechanism of using the --conf option to pass JSON configuration parameters, including how to access these parameters in DAG files through dag_run.conf. Through complete code examples, it demonstrates practical applications of parameters in PythonOperator and BashOperator. The article also compares the differences between --conf and --tp parameters, explaining why --conf is the recommended solution for production environments. Finally, it offers best practice recommendations and frequently asked questions to help users efficiently manage parameterized DAG execution in real-world scenarios.
Overview of Parameter Passing Mechanisms in Apache Airflow
Apache Airflow, as a powerful workflow orchestration platform, provides flexible DAG (Directed Acyclic Graph) definition and execution capabilities. In real production environments, there is often a need to pass dynamic parameters when manually triggering DAGs to meet requirements such as data reprocessing and test validation. This article delves deeply into the parameter passing mechanisms when manually triggering DAGs via CLI.
Core Parameter Passing Method: The --conf Option
Airflow's trigger_dag command supports the --conf option, allowing users to pass configuration parameters in JSON format. This is currently the most recommended approach for production environments for the following reasons:
- Broad Support: Natively supported by managed services like Google Cloud Composer
- Architectural Stability: Continuously maintained as a core Airflow feature
- High Flexibility: Supports complex JSON data structures
CLI Command Example
The syntax for passing parameters via command line is as follows:
airflow trigger_dag 'example_dag' -r 'manual_run_001' --conf '{"start_time": "2024-01-01T01:30:00", "end_time": "2024-01-02T01:30:00", "data_source": "backup"}'
In this example, we pass three parameters: start_time, end_time, and data_source. These parameters will be encapsulated in the conf attribute of the DagRun object.
Parameter Access in DAG Files
In DAG definition files, parameters can be accessed in multiple ways:
Accessing Parameters in PythonOperator
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from datetime import datetime
def process_data(**kwargs):
"""Python function for processing data"""
dag_run = kwargs['dag_run']
# Access passed parameters
start_time = dag_run.conf.get('start_time')
end_time = dag_run.conf.get('end_time')
data_source = dag_run.conf.get('data_source', 'default')
print(f"Processing time range: {start_time} to {end_time}")
print(f"Data source: {data_source}")
# Actual data processing logic
# ...
# DAG definition
default_args = {
'owner': 'data_team',
'start_date': datetime(2024, 1, 1),
}
dag = DAG(
'data_processing_dag',
default_args=default_args,
schedule_interval='30 1 * * *', # Execute daily at 01:30
catchup=False
)
process_task = PythonOperator(
task_id='process_data_task',
python_callable=process_data,
provide_context=True, # Must be set to True to access context
dag=dag
)
Using Parameters in Template Fields
Airflow supports direct parameter references in template fields, which is particularly useful for scenarios requiring command-line parameters such as BashOperator and SSHOperator:
from airflow.operators.bash_operator import BashOperator
bash_task = BashOperator(
task_id='data_extraction',
bash_command='''
echo "Starting data extraction"
python extract_data.py \
--start "{{ dag_run.conf["start_time"] }}" \
--end "{{ dag_run.conf["end_time"] }}" \
--source "{{ dag_run.conf.get("data_source", "primary") }}"
''',
dag=dag
)
Handling Parameter Default Values
In practical applications, it's important to gracefully handle missing parameters:
def process_with_defaults(**kwargs):
dag_run = kwargs['dag_run']
# Using get method to provide default values
start_time = dag_run.conf.get('start_time',
(datetime.utcnow() - timedelta(days=1)).isoformat())
end_time = dag_run.conf.get('end_time',
datetime.utcnow().isoformat())
# Or using conditional checks
if 'data_source' not in dag_run.conf:
dag_run.conf['data_source'] = 'default_source'
# Processing logic...
Comparison with --tp Parameter
Although the airflow test command supports the -tp parameter, this approach has limitations:
- Only suitable for testing individual tasks, not complete DAG execution
- Lacks production environment support
- Parameter passing mechanism is not flexible enough
Therefore, for manual triggering scenarios in production environments, using the --conf option is strongly recommended.
Best Practice Recommendations
- Parameter Validation: Add parameter validation logic in DAGs to ensure the legality of passed parameters
- Error Handling: Add appropriate exception handling for parameter access
- Documentation: Clearly document supported parameters and their formats in DAG definitions
- Security: Avoid passing sensitive information in parameters; use Airflow's Variables or Connections to store confidential data
- Version Compatibility: Be aware of differences in parameter passing support across different Airflow versions
Frequently Asked Questions
Q: What data types can be used in parameters?
A: Parameters passed via --conf support all data types supported by JSON, including strings, numbers, booleans, arrays, and objects.
Q: How to pass parameters containing special characters?
A: Properly escape special characters in JSON strings, for example, double quotes should be escaped as \".
Q: Is there a size limit for parameters?
A: There is theoretically a limit to JSON string size, but it is rarely reached in practice. It is recommended to store large data in external storage and pass references through parameters.
Conclusion
Passing parameters via the --conf option when manually triggering DAGs is a powerful and flexible feature in Apache Airflow. Proper use of this mechanism can significantly enhance the flexibility and maintainability of workflows. The examples and best practices provided in this article can help developers efficiently implement parameterized DAG execution in real-world projects.