Deep Analysis of SQL String Aggregation: From Recursive CTE to STRING_AGG Evolution and Practice

Keywords: SQL String Aggregation | Recursive CTE | STRING_AGG Function | XML PATH | Database Performance Optimization

Abstract: This article provides an in-depth exploration of various string aggregation methods in SQL, with focus on recursive CTE applications in SQL Azure environments. Through detailed code examples and performance comparisons, it comprehensively covers the technical evolution from traditional FOR XML PATH to modern STRING_AGG functions, offering complete solutions for string aggregation requirements across different database environments.

Introduction and Problem Context

String aggregation is a common and crucial requirement in database application development. Particularly in reporting, data presentation, and data transformation scenarios, there is often a need to merge string values from multiple rows into single-row displays. This article delves into best practices for string aggregation in SQL Server and Azure SQL environments, based on real-world development challenges.

The fundamental requirement of string aggregation can be described as: combining multiple string values with the same grouping identifier into a complete string using specified delimiters. For example, in user management systems, there is a need to aggregate and display multiple role names for the same user.

Detailed Recursive CTE Solution

In environments where SQL Azure does not support CLR aggregate functions, recursive Common Table Expressions (CTE) provide a reliable string aggregation solution. This approach is implemented through three core steps:

First, partition and number the data using window functions:

;WITH Partitioned AS (
    SELECT 
        ID,
        Name,
        ROW_NUMBER() OVER (PARTITION BY ID ORDER BY Name) AS NameNumber,
        COUNT(*) OVER (PARTITION BY ID) AS NameCount
    FROM dbo.SourceTable
)

This code assigns sequence numbers to rows within each partition using the ROW_NUMBER() function, while using COUNT(*) to calculate the total number of rows per partition, laying the foundation for subsequent recursive processing.

Second, construct recursive CTE for string concatenation:

Concatenated AS (
    SELECT 
        ID, 
        CAST(Name AS nvarchar) AS FullName, 
        Name, 
        NameNumber, 
        NameCount 
    FROM Partitioned 
    WHERE NameNumber = 1

    UNION ALL

    SELECT 
        P.ID, 
        CAST(C.FullName + ', ' + P.Name AS nvarchar), 
        P.Name, 
        P.NameNumber, 
        P.NameCount
    FROM Partitioned AS P
        INNER JOIN Concatenated AS C 
                ON P.ID = C.ID 
                AND P.NameNumber = C.NameNumber + 1
)

The recursive portion connects the current row's string with the already concatenated result through UNION ALL, gradually building the complete aggregated string.

Finally, filter the final results:

SELECT 
    ID,
    FullName
FROM Concatenated
WHERE NameNumber = NameCount

The condition NameNumber = NameCount ensures that only the final concatenation result for each group is output.

Performance Comparison and Optimization Considerations

In actual performance testing, different string aggregation methods show significant differences. The XML PATH-based method processes 1000 rows of data in approximately 4 milliseconds, while the recursive CTE method requires about 54 milliseconds. This performance gap primarily stems from:

The XML PATH method leverages SQL Server's built-in XML processing engine with underlying optimizations, whereas recursive CTE requires multiple iterations and join operations. However, recursive CTE offers advantages in code readability and maintainability, especially when dealing with complex sorting rules.

For large dataset processing, recommendations are: In SQL Server versions below 2017, prioritize XML PATH method if performance is critical; if better code maintainability and Azure compatibility are needed, recursive CTE is a reliable choice.

Modern Solution: STRING_AGG Function

With updates to SQL Server 2017 and Azure SQL Database, the built-in STRING_AGG function was introduced, greatly simplifying string aggregation operations:

SELECT id, STRING_AGG(name, ', ') AS names
FROM some_table
GROUP BY id

This function directly supports group aggregation with concise syntax and excellent performance. In supported environments, this should be the preferred solution.

Cross-Database Compatibility Analysis

Different database systems provide their own string aggregation functions:

PostgreSQL similarly supports the STRING_AGG function, with syntax largely consistent with SQL Server. MySQL uses the GROUP_CONCAT function to achieve similar functionality:

SELECT id, GROUP_CONCAT(name SEPARATOR ', ') AS names
FROM demo_table
GROUP BY id

This cross-database similarity makes application migration and code reuse considerably easier.

Practical Application Scenario Extensions

String aggregation technology has wide-ranging applications in real-world scenarios. In mailing list generation, all recipient names at the same address can be aggregated:

SELECT
    streetnumber,
    streetname,
    STRING_AGG(CONCAT(firstname, ' ', lastname), ', ') AS recipient_names
FROM voters
GROUP BY streetnumber, streetname

This application not only enhances data presentation friendliness but also reduces processing burden at the application layer.

Best Practices Summary

Based on technological evolution and practical requirements, best practice recommendations for string aggregation are as follows: In SQL Server 2017+ or Azure SQL Database environments, prioritize using the STRING_AGG function; in older SQL Server versions, choose between XML PATH method or recursive CTE based on performance requirements; when considering application portability, encapsulate unified aggregation function interfaces.

Regardless of the chosen approach, attention must be paid to string length limitations, delimiter selection, and sorting rule consistency, as these factors directly impact the accuracy and usability of aggregation results.

Copyright Notice: All rights in this article are reserved by the operators of DevGex. Reasonable sharing and citation are welcome; any reproduction, excerpting, or re-publication without prior permission is prohibited.