Implementing SELECT DISTINCT on a Single Column in SQL Server

Abstract: This technical article provides an in-depth exploration of implementing distinct operations on a single column while preserving other column data in SQL Server. It analyzes the limitations of the traditional DISTINCT keyword and presents comprehensive solutions using ROW_NUMBER() window functions with CTE, along with comparisons to GROUP BY approaches. The article includes complete code examples and performance analysis to offer practical guidance for developers.

Problem Context and Challenges

In practical database application development, there is often a need to perform distinct operations based on a single column. For example, in the provided sample data:

ID  SKU     PRODUCT
=======================
1   FOO-23  Orange
2   BAR-23  Orange
3   FOO-24  Apple
4   FOO-25  Orange

The expected query result requires distinct values based on the PRODUCT column while preserving information from other columns:

1   FOO-23  Orange
3   FOO-24  Apple

The traditional SELECT DISTINCT statement cannot fulfill this requirement because the DISTINCT keyword operates on all selected columns, making single-column distinct operations impossible when multiple column combinations are not duplicates.

Analysis of DISTINCT Keyword Limitations

The standard SQL DISTINCT keyword is used to remove duplicate rows from query results, but it operates at the entire row level rather than on individual columns. This means that when multiple columns are selected in a query, only when the combination of all column values is identical will records be considered duplicates and removed.

In the user's attempted query:

SELECT [ID],[SKU],[PRODUCT]
FROM [TestData] 
WHERE ([PRODUCT] = 
(SELECT DISTINCT [PRODUCT] FROM [TestData] WHERE ([SKU] LIKE 'FOO-%')) 
ORDER BY [ID]

This approach fails to achieve the desired outcome because the subquery returns all distinct PRODUCT values, while the outer query still returns all matching rows.

ROW_NUMBER() Based Solution

Core Principles

ROW_NUMBER() is a window function provided by SQL Server that assigns a unique sequential number to each row in the result set. Using the PARTITION BY clause allows grouping by specified columns, with numbering occurring within each partition according to specified ordering.

Implementation Code

Below is the complete solution based on ROW_NUMBER():

SELECT *
FROM   (SELECT ID, 
               SKU, 
               Product,
               ROW_NUMBER() OVER (PARTITION BY PRODUCT ORDER BY ID) AS RowNumber
        FROM   MyTable
        WHERE  SKU LIKE 'FOO%') AS a
WHERE  a.RowNumber = 1

Code Analysis

1. PARTITION BY PRODUCT: Groups by the PRODUCT column, with each distinct PRODUCT value forming an independent partition

2. ORDER BY ID: Within each partition, sorts in ascending order by the ID column

3. ROW_NUMBER(): Assigns sequence numbers to rows within each partition, starting from 1

4. The outer query filters to select only the first record from each group using WHERE a.RowNumber = 1

Execution Process Analysis

Using the sample data, the execution process is as follows:

1. Inner query result:

ID  SKU     PRODUCT  RowNumber
1   FOO-23  Orange   1
4   FOO-25  Orange   2
3   FOO-24  Apple    1

2. Outer query filtered result:

1   FOO-23  Orange
3   FOO-24  Apple

Alternative Approach: GROUP BY Method

Basic Principles

Using the GROUP BY clause combined with aggregate functions can achieve similar results, but requires careful handling of non-grouped column selection.

Implementation Example

SELECT MIN(ID) AS ID,
       MIN(SKU) AS SKU,
       PRODUCT
FROM   MyTable
WHERE  SKU LIKE 'FOO%'
GROUP BY PRODUCT

Applicable Scenarios and Limitations

The GROUP BY method is suitable for:

• Scenarios requiring specific extreme values (minimum, maximum) from each group

• Situations where there are no special requirements for non-grouped column value selection

However, it has the following limitations:

• Cannot precisely control which specific record is selected

• Lacks flexibility when non-extreme value records need to be selected

Performance Comparison and Optimization Recommendations

ROW_NUMBER() Solution Advantages

1. High flexibility: Can select different records by adjusting the ORDER BY clause

2. Precise control: Can explicitly specify which record to select from each group

3. Powerful functionality: Supports complex sorting and filtering logic

Index Optimization Recommendations

To improve query performance, it is recommended to create appropriate indexes:

CREATE INDEX IX_MyTable_Product_ID 
ON MyTable (PRODUCT, ID) INCLUDE (SKU)

This index effectively supports PARTITION BY and ORDER BY operations, avoiding full table scans.

Extended Practical Application Scenarios

Selecting Latest Records

If selecting the latest record for each product (assuming a timestamp column exists):

SELECT *
FROM   (SELECT ID, 
               SKU, 
               Product,
               CreateTime,
               ROW_NUMBER() OVER (PARTITION BY PRODUCT ORDER BY CreateTime DESC) AS RowNumber
        FROM   MyTable
        WHERE  SKU LIKE 'FOO%') AS a
WHERE  a.RowNumber = 1

Selecting Records with Specific Conditions

Combining with other conditions for complex filtering:

SELECT *
FROM   (SELECT ID, 
               SKU, 
               Product,
               ROW_NUMBER() OVER (PARTITION BY PRODUCT ORDER BY 
                   CASE WHEN SKU LIKE 'SPECIAL%' THEN 0 ELSE 1 END, ID) AS RowNumber
        FROM   MyTable
        WHERE  SKU LIKE 'FOO%') AS a
WHERE  a.RowNumber = 1

Compatibility Considerations

The ROW_NUMBER() method described in this article is applicable to SQL Server 2005 and later versions. For earlier versions, correlated subqueries or other methods can be considered to achieve similar functionality.

Compatibility with other database systems:

• MySQL: Supports window functions starting from version 8.0

• PostgreSQL: Fully supports window functions

• Oracle: Long-term support for window functions

Conclusion

Performing distinct operations based on a single column is a common requirement in database development. By appropriately utilizing window functions and grouping techniques, this goal can be efficiently achieved. The ROW_NUMBER() method provides maximum flexibility and precise control, making it the preferred solution in modern SQL development. In practical applications, the most suitable implementation approach should be selected based on specific business requirements and data characteristics, with attention to performance optimization and index design.

Copyright Notice: All rights in this article are reserved by the operators of DevGex. Reasonable sharing and citation are welcome; any reproduction, excerpting, or re-publication without prior permission is prohibited.