Comprehensive Application of Group Aggregation and Join Operations in SQL Queries: A Case Study on Querying Top-Scoring Students

Keywords: SQL Query | Group Aggregation | Join Operations | Top Score Query | Student Grades

Abstract: This article delves into the integration of group aggregation and join operations in SQL queries, using the Amazon interview question 'query students with the highest marks in each subject' as a case study. It analyzes common errors and provides multiple solutions. The discussion begins by dissecting the flaws in the original incorrect query, then progressively constructs correct queries covering methods such as subqueries, IN operators, JOIN operations, and window functions. By comparing the strengths and weaknesses of different answers, it extracts core principles of SQL query design: problem decomposition, understanding data relationships, and selecting appropriate aggregation methods. The article includes detailed code examples and logical analysis to help readers master techniques for building complex queries.

Problem Background and Error Analysis

In database querying, complex scenarios often involve combining group aggregation with multi-table joins. This article uses an Amazon interview question as an example to explore how to query the names of students who scored the highest marks in each subject. The original problem involves three tables: Student (student information), Subject (subject information), and Marks (grade records).

The initial incorrect query provided by the interviewee was:

select A.Stname from A as Student, B as 
(select Stid, Subid, max(mark) from Marks groupby Subid) where A.Stid = B.Stid

This query has several issues:

Syntax error: groupby should be GROUP BY.
Logical error: In the subquery SELECT Stid, Subid, max(mark), Stid is not included in the GROUP BY clause, causing the aggregate function MAX(mark) to fail in correctly associating with specific students.
Structural problem: It does not properly handle the correspondence between subjects and highest marks, and ignores cases where multiple students share the same top score.

Constructing Correct Solutions

The key to solving such problems lies in decomposing complex queries into manageable steps. Below is a step-by-step construction based on the best answer:

Step 1: Determine the Highest Mark per Subject

First, identify the highest score for each subject. This can be achieved by grouping the Marks table by SubjectID and using the MAX aggregate function:

SELECT SubjectID, MAX(MarkRate) AS MaxMarkRate
FROM Mark
GROUP BY SubjectID;

This query returns each subject's ID and its corresponding highest score, providing foundational data for subsequent join operations.

Step 2: Associate Highest Marks with Students

Next, find which students achieved these highest marks. Use the IN operator to link the results from Step 1 with the original Marks table:

SELECT SubjectID, MarkRate, StudentID
FROM Mark
WHERE (SubjectID, MarkRate) IN (
    SELECT SubjectID, MAX(MarkRate)
    FROM Mark
    GROUP BY SubjectID
)
ORDER BY SubjectID, StudentID;

Here, (SubjectID, MarkRate) IN (...) is a tuple test that matches records where both the subject ID and mark satisfy the conditions. This method automatically handles cases where multiple students share the top score, including all such students.

Step 3: Add Student and Subject Names

Finally, replace IDs with readable names by joining the Student and Subject tables:

SELECT SubjectName, MarkRate, StudentName
FROM Mark
JOIN Student USING(StudentID)
JOIN Subject USING(SubjectID)
WHERE (SubjectID, MarkRate) IN (
    SELECT SubjectID, MAX(MarkRate)
    FROM Mark
    GROUP BY SubjectID
)
ORDER BY SubjectName, StudentName;

The USING clause simplifies join conditions, assuming consistent column names. The results are ordered by subject and student names for readability.

Alternative Approaches and Optimizations

Beyond the IN operator, the same logic can be implemented using JOIN. Treat the subquery result as a derived table for joining:

SELECT SubjectName, MarkRate, StudentName
FROM Mark
JOIN Student USING(StudentID)
JOIN Subject USING(SubjectID)
JOIN (
    SELECT SubjectID, MAX(MarkRate) AS MarkRate
    FROM Mark
    GROUP BY SubjectID
) AS MaxMarks USING(SubjectID, MarkRate)
ORDER BY SubjectName, StudentName;

This approach more explicitly expresses data relationships semantically but requires aliasing the subquery and columns, resulting in slightly longer code. For databases that do not support tuple IN, this is a necessary alternative.

Another efficient method involves window functions, as shown in Answer 2:

SELECT Name, SubjectID, Mark
FROM (
    SELECT Name, SubjectID, Mark,
           DENSE_RANK() OVER (PARTITION BY SubjectID ORDER BY Mark DESC) AS Rank
    FROM Student st
    JOIN Marks mk ON st.StudentID = mk.StudentID
) AS RankedMarks
WHERE Rank = 1;

DENSE_RANK() assigns ranks to marks within each subject, with ties receiving the same rank. This method is more intuitive logically but may not be available in some older database versions.

Summary of Core Knowledge Points

From this case study, the following principles of SQL query design can be extracted:

Problem Decomposition: Break complex queries into simple steps, such as aggregating first and then joining.
Aggregation Understanding: Use GROUP BY correctly, ensuring non-aggregated columns are either in the group or within aggregate functions.
Data Relationships: Leverage IN or JOIN to associate subquery results, paying attention to handling multiple matches.
Code Readability: Choose semantically clear constructs, such as IN for membership tests and JOIN for relationship expression.
Compatibility Considerations: Select solutions based on database features; for example, window functions suit modern systems, while traditional methods are more universal.

In practical applications, it is advisable to choose the optimal solution based on specific requirements and data volume. For instance, window functions may be more efficient for large datasets, while IN or JOIN subqueries are easier to maintain for simpler queries.

Copyright Notice: All rights in this article are reserved by the operators of DevGex. Reasonable sharing and citation are welcome; any reproduction, excerpting, or re-publication without prior permission is prohibited.