Combining DISTINCT with ROW_NUMBER() in SQL: An In-Depth Analysis for Assigning Row Numbers to Unique Values

Keywords: SQL | DISTINCT | ROW_NUMBER

Abstract: This article explores the common challenges and solutions when combining the DISTINCT keyword with the ROW_NUMBER() window function in SQL queries. By analyzing a real-world user case, it explains why directly using DISTINCT and ROW_NUMBER() together often yields unexpected results and presents three effective approaches: using subqueries or CTEs to first obtain unique values and then assign row numbers, replacing ROW_NUMBER() with DENSE_RANK(), and adjusting window function behavior via the PARTITION BY clause. The article also compares ROW_NUMBER(), RANK(), and DENSE_RANK() functions and discusses the impact of SQL query execution order on results. These methods are applicable in scenarios requiring sequential numbering of unique values, such as serializing deduplicated data.

Problem Background and Core Challenge

In SQL queries, users often need to assign consecutive row numbers to unique values in a dataset. A typical scenario involves extracting all distinct values from a column with duplicates and numbering them sequentially. The initial attempt might be:

SELECT DISTINCT id, ROW_NUMBER() OVER (ORDER BY id) AS RowNum
FROM table
WHERE fid = 64

However, this query does not return only unique id values as expected; instead, it includes all original rows with row numbers assigned to each. In contrast, a simple SELECT DISTINCT id FROM table WHERE fid = 64 correctly returns unique values. This discrepancy highlights a conflict between DISTINCT and ROW_NUMBER() due to SQL query execution order.

Solution 1: Using Subqueries or Common Table Expressions (CTEs)

The most straightforward approach is to first obtain unique values via a subquery and then assign row numbers to them. This method clearly separates the deduplication and row numbering steps, avoiding confusion in execution order. Two equivalent implementations are:

SELECT *, ROW_NUMBER() OVER (ORDER BY id) AS RowNum FROM
    (SELECT DISTINCT id FROM table WHERE fid = 64) Base

Or using a CTE for better readability:

; WITH Base AS (
    SELECT DISTINCT id FROM table WHERE fid = 64
)

SELECT *, ROW_NUMBER() OVER (ORDER BY id) AS RowNum FROM Base

Both queries first select all unique id values from the table that meet the condition fid = 64, then assign a row number to each row in the result set based on ordering by id. Since deduplication occurs in the subquery or CTE, the ROW_NUMBER() function operates only on the deduplicated results, ensuring a one-to-one correspondence between row numbers and unique values.

Solution 2: Using the DENSE_RANK() Function

Another method is to replace ROW_NUMBER() with the DENSE_RANK() window function. DENSE_RANK() assigns the same rank to identical values, with no gaps between ranks, making it ideal for assigning consecutive numbers to unique values. Example query:

SELECT distinct id, DENSE_RANK() OVER (ORDER BY id) AS RowNum
FROM table
WHERE fid = 64

In this query, DENSE_RANK() first assigns ranks to all rows (identical id values receive the same rank), then DISTINCT removes duplicate rows. Since ranks are computed before deduplication and are consistent for identical values, each unique value retains its rank as the row number after deduplication. For instance, if the original data has multiple duplicate id values "a", they all receive rank 1, and after deduplication, only one "a" remains with row number 1.

Solution 3: Combining with the PARTITION BY Clause

For more complex cases, such as when deduplication involves multiple fields, the PARTITION BY clause can be used to adjust window function behavior. This method allows specifying partitions in ROW_NUMBER() to ensure row numbers are unique within partitions, but note that DISTINCT must include all partition fields. Example query:

SELECT DISTINCT id, description,
    ROW_NUMBER() OVER (PARTITION BY id, description ORDER BY id) AS RowNum 
    FROM table
    WHERE fid = 64

Here, ROW_NUMBER() assigns row numbers to each unique (id, description) combination (starting from 1 within each partition due to PARTITION BY). Combined with DISTINCT id, description, this ensures each combination appears only once in the result with the correct row number. However, this approach requires the DISTINCT list to match the PARTITION BY list exactly; otherwise, duplicates may still occur.

Technical Principles Deep Dive

Understanding these solutions hinges on grasping the logical execution order of SQL queries. According to SQL standards, window functions like ROW_NUMBER() are computed in the SELECT clause before DISTINCT is applied. This means during query execution:

First, the FROM and WHERE clauses filter the data.
Then, window functions are calculated, assigning row numbers to each row.
Finally, DISTINCT removes duplicate rows, but by this point, row numbers are already attached to each row, potentially making deduplication ineffective (since row numbers make each row unique).

For example, consider a table with duplicate values "a"; ROW_NUMBER() might assign row numbers 1, 2, 3 to these rows, causing each to appear unique when DISTINCT is applied, thus preventing deduplication. In contrast, DENSE_RANK(), by assigning the same rank to identical values, retains a consistent rank as the row number after deduplication.

Function Comparison and Use Cases

To illustrate the behavior of different window functions more clearly, consider this comparison example, assuming table t has a column v:

SELECT
  v,
  ROW_NUMBER() OVER (ORDER BY v) row_number,
  RANK()       OVER (ORDER BY v) rank,
  DENSE_RANK() OVER (ORDER BY v) dense_rank
FROM t
ORDER BY v

Possible results:

+---+------------+------+------------+
| V | ROW_NUMBER | RANK | DENSE_RANK |
+---+------------+------+------------+
| a |          1 |    1 |          1 |
| a |          2 |    1 |          1 |
| a |          3 |    1 |          1 |
| b |          4 |    4 |          2 |
| c |          5 |    5 |          3 |
| c |          6 |    5 |          3 |
| d |          7 |    7 |          4 |
| e |          8 |    8 |          5 |
+---+------------+------+------------+

ROW_NUMBER(): Assigns a unique sequential number to each row, regardless of duplicate values. Suitable for scenarios requiring absolute unique identifiers, but problematic when combined directly with DISTINCT.
RANK(): Assigns the same rank to identical values, but ranks may have gaps (e.g., skipping 2 and 3). Ideal for competition rankings.
DENSE_RANK(): Assigns the same rank to identical values, with consecutive ranks. Best suited for combining with DISTINCT to assign consecutive row numbers to unique values.

In practice, the choice depends on specific requirements:

If strict row numbering based on deduplicated results is needed, the subquery or CTE method (Solution 1) is recommended for its clarity and maintainability.
For simple queries requiring only unique value numbering, DENSE_RANK() (Solution 2) offers a concise and effective option.
When deduplication involves multiple fields, the PARTITION BY method (Solution 3) can be considered, but field consistency must be ensured.

Considerations and Best Practices

When implementing these solutions, keep the following points in mind:

Data Model Considerations: If id is the primary key of the table, it is inherently unique, and using DISTINCT might be redundant. However, in complex queries involving JOINs or UNIONs, deduplication may still be necessary.
Performance Implications: Subqueries and CTEs can increase query complexity, but on large datasets, deduplicating first and then assigning row numbers is often more efficient than directly combining DISTINCT and ROW_NUMBER(), as it reduces the data volume for window function computation.
Standard Compliance: The window functions and syntax discussed follow SQL standards, but implementations may vary across database systems (e.g., MySQL, PostgreSQL, SQL Server); testing in the target environment is advised.
Code Readability: Using CTEs can enhance the readability and maintainability of complex queries, especially with multiple nesting levels.

In summary, by understanding SQL execution order and window function characteristics, developers can flexibly apply subqueries, DENSE_RANK(), or PARTITION BY to resolve the synergy between DISTINCT and ROW_NUMBER(), efficiently meeting data deduplication and serialization needs.

Copyright Notice: All rights in this article are reserved by the operators of DevGex. Reasonable sharing and citation are welcome; any reproduction, excerpting, or re-publication without prior permission is prohibited.