Keywords: Flask | Gunicorn | Concurrency | WSGI | Workers
Abstract: This article provides an in-depth analysis of concurrent request handling capabilities in Flask applications under different deployment configurations. It examines the single-process synchronous model of Flask's built-in development server, then focuses on Gunicorn's two worker models: default synchronous workers and asynchronous workers. By comparing concurrency mechanisms across configurations, it helps developers choose appropriate deployment strategies based on application characteristics, offering practical configuration advice and performance optimization directions.
Concurrency Limitations of Flask Development Server
When starting the Flask framework's built-in development server via the app.run() method, it employs a single-process synchronous processing model. This means that at any given moment, the server can handle only one HTTP request. This design simplifies development environment configuration but significantly limits concurrent processing capabilities. When requests arrive, the server processes them sequentially, only beginning the next request after completely finishing the current one.
Gunicorn Synchronous Worker Model
In production environments, Gunicorn is commonly used as a WSGI server to deploy Flask applications. Gunicorn defaults to using the sync worker type, where each worker is an independent operating system process. When configured with 4 workers, Gunicorn creates 4 separate process instances, each running a complete copy of the Flask application.
In this configuration, each worker behaves similarly to Flask's built-in development server, capable of handling only one request at a time. However, with multiple independent processes, the system as a whole can process multiple requests concurrently. Specifically, a configuration with 4 workers means up to 4 concurrent requests can be handled simultaneously. Gunicorn's master process receives all incoming connections and distributes requests to available workers using load-balancing algorithms.
Gunicorn Asynchronous Worker Model
Gunicorn also supports asynchronous worker types, including eventlet and gevent. These asynchronous workers use coroutines rather than threads to achieve concurrency. The --worker-class parameter specifies asynchronous workers, for example: gunicorn --workers=4 --worker-class=gevent app:app.
The key advantage of asynchronous workers is that each process can handle multiple concurrent requests internally. Although each process still has only one execution thread, through coroutine technology, when a request awaits I/O operations (such as database queries or network requests), that coroutine can be paused and switched to coroutines handling other requests. This non-blocking I/O model significantly improves the concurrent processing capacity of individual processes.
With asynchronous workers, each worker process can handle dozens or even hundreds of requests simultaneously, depending on application characteristics and system resources. This means that even with fewer worker processes, high concurrent processing capability can be achieved, particularly suitable for I/O-intensive applications.
Configuration Selection and Performance Considerations
Choosing appropriate worker configurations requires considering multiple factors. For CPU-intensive applications, synchronous workers may be more suitable because Python's Global Interpreter Lock (GIL) limits CPU parallelism in multithreading. In this case, increasing the number of worker processes directly improves CPU utilization.
For I/O-intensive applications, asynchronous workers typically provide better performance. Since coroutines can context-switch while waiting for I/O, a single process can effectively use waiting time to handle other requests. This pattern reduces inter-process switching overhead while lowering memory consumption.
Actual deployment must also consider hardware limitations. Each worker process consumes memory, and too many processes may cause insufficient memory. Additionally, the number of workers should not exceed CPU cores significantly to avoid excessive context-switching overhead. Load testing is recommended to determine optimal configurations.
Alternative Concurrency Options
Beyond Gunicorn's worker configurations, Flask's development server itself offers some concurrency handling options. Using app.run(threaded=True) enables multithreading mode, allowing the server to use multiple threads for request processing. Another option is the app.run(processes=3) parameter, which creates multiple processes to handle requests.
It is important to note that Flask's official documentation explicitly states that the built-in server is unsuitable for production environments, primarily due to limited scalability. These options are mainly for development and testing scenarios; production deployments should use dedicated WSGI servers like Gunicorn or uWSGI.
Best Practice Recommendations
For most production environments, starting with Gunicorn synchronous workers is recommended. An initial configuration can use the empirical formula 2 * CPU cores + 1 for the number of workers. Monitor system resource usage, particularly memory consumption and CPU utilization.
If the application involves substantial I/O waiting operations, consider switching to asynchronous workers. Before migration, conduct thorough testing to ensure all dependent libraries are compatible with asynchronous mode. Some synchronously-called libraries may need replacement with asynchronous versions or appropriate adapters.
Regardless of the chosen configuration, implement comprehensive monitoring and logging. Monitor key metrics including request processing time, error rates, and worker process memory usage. Conduct regular load testing and adjust configuration parameters based on actual traffic patterns.