In-depth Analysis and Practice of Obtaining Unique Value Aggregation Using STRING_AGG in SQL Server

Keywords: SQL Server | STRING_AGG | unique value aggregation

Abstract: This article provides a detailed exploration of how to leverage the STRING_AGG function in combination with the DISTINCT keyword to achieve unique value string aggregation in SQL Server 2017 and later versions. Through a specific case study, it systematically analyzes the core techniques, from problem description and solution implementation to performance optimization, including the use of subqueries to remove duplicates and the application of STRING_AGG for ordered aggregation. Additionally, the article compares alternative methods, such as custom functions, and discusses best practices and considerations in real-world applications, aiming to offer a comprehensive and efficient data processing solution for database developers.

Problem Background and Scenario Analysis

In database operations, it is common to handle string fields containing delimiters, such as columns storing multiple identifiers. SQL Server provides the STRING_SPLIT function to split these strings, but when aggregating results, duplicate values may arise. In a practical case, consider a data table [dbo].[Data] with ProjectID and bID fields, where bID stores multiple values separated by semicolons. The user needs to query for specific values (e.g., 'O95833' and 'Q96NY7-2') and aggregate them by ProjectID, outputting a unique string list.

The initial query uses CROSS APPLY with STRING_SPLIT to split the bID field and filters target values via a WHERE clause. However, directly applying the STRING_AGG function includes duplicates in the aggregated result, e.g., for ProjectID 2, the output is "O95833,O95833,O95833,Q96NY7-2,Q96NY7-2,Q96NY7-2", which does not meet the user's requirement for unique value aggregation. The core issue is how to remove duplicates before aggregation to ensure concise and accurate output.

Core Solution: Using DISTINCT with Subqueries

To address duplicate values, the best practice is to remove them via a subquery before applying STRING_AGG. Specifically, use a SELECT DISTINCT statement in the subquery to obtain unique combinations based on ProjectID and newID.value fields. This ensures each value appears only once per ProjectID. Then, in the outer query, apply the STRING_AGG function to the subquery result, ordering by value for string aggregation.

Example code:

SELECT 
 ProjectID
,STRING_AGG(value, ',') WITHIN GROUP (ORDER BY value) AS NewField
FROM (
    SELECT DISTINCT 
      ProjectID
    , newId.value 
    FROM [dbo].[Data] WITH (NOLOCK)  
    CROSS APPLY STRING_SPLIT([bID],';') AS newId  
    WHERE newId.value IN (   'O95833' , 'Q96NY7-2'  )  
) x
GROUP BY ProjectID
ORDER BY ProjectID

The key advantage of this method is its simplicity and efficiency. By using the DISTINCT keyword, duplicates are removed directly at the database level, avoiding extra processing in the application layer. The WITHIN GROUP clause of the STRING_AGG function ensures ordered output, e.g., alphabetically, enhancing readability. In practical tests, this query correctly outputs the target results: for ProjectID 2, "O95833,Q96NY7-2"; for ProjectID 4, "Q96NY7-2".

Technical Details and Performance Considerations

Delving into this solution, several technical points warrant attention. First, using the WITH (NOLOCK) hint can improve query performance, especially in high-concurrency environments, but it may lead to dirty reads, so it should be used cautiously based on business needs. Second, the STRING_SPLIT function is available in SQL Server 2016 and later, returning a column named value with split values. In the subquery, the DISTINCT operation removes duplicates based on combinations of ProjectID and value, leveraging SQL Server's hash or sort aggregation mechanisms, depending on data volume and indexing.

Regarding performance, if the original data is large or the bID field contains many values, indexing on ProjectID and bID fields is recommended to speed up STRING_SPLIT and DISTINCT operations. For example, create a non-clustered index covering these columns. Additionally, the STRING_AGG function may consume significant memory when aggregating many strings, so monitoring server resources is essential. In SQL Server 2017, STRING_AGG supports a maximum return length of 8000 bytes; if exceeded, alternative methods like XML PATH or recursive CTEs might be necessary.

Alternative Methods and Extended Discussion

Beyond the primary solution, users can consider other approaches. For instance, creating a custom function to handle string deduplication and aggregation, as seen in Answer 2 with the fn_DistinctWords function. This function uses STRING_SPLIT and STRING_AGG internally for deduplication, suitable for more general scenarios. Example code:

CREATE OR ALTER FUNCTION [dbo].[fn_DistinctWords]
(
  @String NVARCHAR(MAX)  
)
RETURNS NVARCHAR(MAX)
WITH SCHEMABINDING
AS
BEGIN
  DECLARE @Result NVARCHAR(MAX);
  WITH MY_CTE AS ( SELECT Distinct(value) FROM STRING_SPLIT(@String, ' ')  )
  SELECT @Result = STRING_AGG(value, ' ') FROM MY_CTE
  RETURN @Result
END

Usage: SELECT dbo.fn_DistinctWords('One Two Three Two One'); outputs "One Two Three". This method offers high reusability but may increase system overhead, especially with frequent calls. Thus, for simple deduplication aggregation, using a DISTINCT subquery is generally more efficient.

Furthermore, the article discusses the distinction between HTML tags like <br> and characters. In technical documentation, when describing tags as textual objects, they should be escaped, e.g., using <br>, to prevent parsing as HTML instructions. This ensures accurate content presentation and avoids DOM structure errors.

Best Practices and Conclusion

In practical applications, it is recommended to choose solutions based on specific needs. For most scenarios, using a DISTINCT subquery with STRING_AGG is optimal, balancing performance, readability, and maintainability. Key steps include: correctly using STRING_SPLIT to split strings, applying DISTINCT to remove duplicates, and leveraging STRING_AGG for ordered aggregation. Additionally, consider index optimization and resource monitoring to enhance query efficiency.

In summary, through this analysis, readers can master the core techniques for unique value string aggregation in SQL Server. From problem identification to solution implementation, each step emphasizes code clarity and performance optimization. As SQL Server versions evolve, more built-in functions may support deduplication aggregation, but the current method suffices for most requirements. Developers should continue learning new technologies and apply them flexibly in real-world business contexts.

Copyright Notice: All rights in this article are reserved by the operators of DevGex. Reasonable sharing and citation are welcome; any reproduction, excerpting, or re-publication without prior permission is prohibited.

Problem Background and Scenario Analysis

Core Solution: Using DISTINCT with Subqueries

Technical Details and Performance Considerations

Alternative Methods and Extended Discussion

Best Practices and Conclusion

Cite this article