Optimizing Multi-Table Aggregate Queries in MySQL Using UNION and GROUP BY

Keywords: MySQL | UNION ALL | GROUP BY

Abstract: This article delves into the technical details of using UNION ALL with GROUP BY clauses for multi-table aggregate queries in MySQL. Through a practical case study, it analyzes issues of data duplication caused by improper grouping logic in the original query and proposes a solution based on the best answer, utilizing subqueries and external aggregation. It explains core principles such as the usage of UNION ALL, timing of grouping aggregation, and how to avoid common errors, with code examples and performance considerations to help readers master efficient techniques for complex data aggregation tasks.

Introduction

In database queries, it is often necessary to aggregate data from multiple related tables to generate summary reports. MySQL, as a widely used relational database management system, offers robust SQL capabilities to handle such requirements. However, when dealing with multi-table union queries combined with grouping aggregation, developers may encounter logical errors or performance issues. Based on a specific case, this article explores how to optimize queries using UNION ALL and GROUP BY clauses to ensure data accuracy and efficiency.

Problem Analysis

The original query aims to count the number of vehicles owned by each owner, including motorcycles and cars. The initial query uses two separate SELECT statements to count from the motorbike and car tables, respectively, and merges the results via UNION ALL. The query structure is as follows:

(SELECT COUNT(motorbike.`owner_id`) as count, owner.`name`, transport.`type` FROM transport, owner, motorbike WHERE transport.type='motobike' AND owner.`owner_id`=motorbike.`owner_id` AND transport.`type_id`=motorbike.`motorbike_id` GROUP BY motorbike.owner_id) UNION ALL (SELECT COUNT(car.`owner_id`) as count, owner.`name`, transport.`type` FROM transport, owner, car WHERE transport.type='car' AND owner.`owner_id`=car.`owner_id` AND transport.`type_id`=car.`car_id` GROUP BY car.`owner_id`)

This query returns results where the owner Linda appears twice, corresponding to counts for motorcycles and cars, leading to data duplication. This occurs because UNION ALL directly merges the two result sets without further aggregation at the outer level. An erroneous attempt to place GROUP BY outside the UNION ALL failed due to ambiguous column references.

Solution

The best answer resolves this issue by introducing a subquery and external GROUP BY. The core idea is to first merge the results of two subqueries using UNION ALL, then aggregate the combined data externally. The optimized query is:

SELECT SUM(qty), name FROM ( SELECT COUNT(m.owner_id) as qty, o.name FROM transport t, owner o, motorbike m WHERE t.type='motobike' AND o.owner_id=m.owner_id AND t.type_id=m.motorbike_id GROUP BY m.owner_id UNION ALL SELECT COUNT(c.owner_id) as qty, o.name FROM transport t, owner o, car c WHERE t.type='car' AND o.owner_id=c.owner_id AND t.type_id=c.car_id GROUP BY c.owner_id ) t GROUP BY name

This query first performs grouped counts on the motorcycle and car tables within subqueries, uses UNION ALL to retain all rows (including duplicates), and then applies SUM function and GROUP BY name at the outer level to summarize counts per owner. Thus, Linda's two records are merged into one with a total count of 2.

Technical Details

The UNION ALL operator is used to combine the result sets of two or more SELECT statements, retaining all rows, including duplicates, unlike UNION (which automatically removes duplicates). In this case, UNION ALL is appropriate because the subqueries have already deduplicated via GROUP BY, but data from different vehicle types needs merging.

The timing of grouping aggregation is crucial: performing initial grouping in subqueries reduces data volume and improves performance, while final aggregation externally ensures correct summarization across types. For example, subqueries group by owner_id, generating counts per owner per vehicle type, and the outer query groups by name, using SUM to compute totals.

The code uses implicit joins (comma-separated tables), equivalent to INNER JOIN. For instance, FROM transport t, owner o, motorbike m has the same effect as FROM transport t INNER JOIN owner o ON o.owner_id=m.owner_id INNER JOIN motorbike m ON t.type_id=m.motorbike_id, but explicit JOIN syntax is more readable and maintainable.

Performance and Scalability

This optimized solution performs well with large datasets, as subqueries reduce the size of intermediate result sets. If indexes exist on columns like owner_id or type_id, query speed is further enhanced. For more complex scenarios, such as adding more vehicle types, one can simply append similar subqueries to UNION ALL, making the code easily extensible.

Note that UNION ALL requires all SELECT statements to have compatible column counts and data types. In this example, both subqueries return qty (integer) and name (string), ensuring consistency.

Conclusion

By combining UNION ALL and GROUP BY, one can efficiently handle multi-table aggregate queries, avoiding data duplication and enhancing readability. This article's case study demonstrates the full process from problem identification to solution, emphasizing the importance of query structure design. In practice, it is recommended to adjust grouping logic and indexing strategies based on specific needs to optimize database performance.

Copyright Notice: All rights in this article are reserved by the operators of DevGex. Reasonable sharing and citation are welcome; any reproduction, excerpting, or re-publication without prior permission is prohibited.