Comprehensive Guide to Row-Level String Aggregation by ID in SQL

Keywords: SQL String Aggregation | XML PATH Method | STRING_AGG Function

Abstract: This technical paper provides an in-depth analysis of techniques for concatenating multiple rows with identical IDs into single string values in SQL Server. By examining both the XML PATH method and STRING_AGG function implementations, the article explains their operational principles, performance characteristics, and appropriate use cases. Using practical data table examples, it demonstrates step-by-step approaches for duplicate removal, order preservation, and query optimization, offering valuable technical references for database developers.

Technical Background and Problem Definition

In database application development, scenarios frequently arise where multiple records sharing the same identifier need to be consolidated into single string values. This requirement is particularly common in report generation, data presentation, and ETL processing. This article uses a typical data table containing two fields: ID and DisplayName. The original data structure is as follows:

ID      DisplayName
1        Editor
1        Reviewer
7        EIC
7        Editor
7        Reviewer
7        Editor
19       EIC
19       Editor
19       Reviewer

The objective output requires merging all DisplayName values for each ID into comma-separated strings while eliminating duplicate entries. For instance, ID 7 contains two "Editor" entries in the source data, but only one should appear in the final result.

Core Solution: XML PATH Method

For SQL Server versions prior to 2017, the most commonly used string aggregation technique involves the FOR XML PATH clause combined with the STUFF function. Although this approach features relatively complex syntax, it offers robust functionality and excellent compatibility.

Basic Implementation Code

The following complete implementation includes test data creation, query execution, and result verification:

-- Create temporary test table
CREATE TABLE #t 
(
    id INT,
    displayname NVARCHAR(MAX)
)

-- Insert test data
INSERT INTO #t VALUES
    (1, 'Editor'),
    (1, 'Reviewer'),
    (7, 'EIC'),
    (7, 'Editor'),
    (7, 'Reviewer'),
    (7, 'Editor'),  -- Duplicate value
    (19, 'EIC'),
    (19, 'Editor'),
    (19, 'Reviewer')

-- Execute string aggregation query
SELECT 
    id, 
    displayname = 
        STUFF(
            (SELECT DISTINCT ', ' + displayname
             FROM #t b 
             WHERE b.id = a.id 
             FOR XML PATH('')), 
            1, 2, ''
        )
FROM #t a
GROUP BY id

Technical Principle Analysis

The core of this solution lies in utilizing the FOR XML PATH('') clause. When an empty string is specified as the PATH parameter, SQL Server converts query results to XML format without adding any XML tags. Combined with subqueries, this generates XML strings containing all DisplayName values for each ID.

The STUFF function removes excess delimiters from the beginning of result strings. Since subqueries generate strings starting with ", ", STUFF(..., 1, 2, '') deletes two characters from position 1 (i.e., ", "), thereby normalizing the format.

The DISTINCT keyword ensures automatic duplicate removal, which is crucial for meeting problem requirements. Simultaneously, the WHERE b.id = a.id condition enables aggregation by ID grouping.

Execution Result Verification

Executing the above query produces the following output:

id  displayname
1   Editor, Reviewer
7   Editor, EIC, Reviewer
19  Editor, EIC, Reviewer

The results demonstrate successful elimination of duplicate DisplayName values, with each ID's values merged into properly formatted comma-separated strings.

Modern Alternative: STRING_AGG Function

Starting with SQL Server 2017, Microsoft introduced the dedicated string aggregation function STRING_AGG, significantly simplifying related operations.

STRING_AGG Implementation Code

SELECT 
    ID, 
    STRING_AGG(DisplayName, ', ') AS DisplayNames
FROM TableName
GROUP BY ID

Technical Comparison Analysis

Compared to the XML PATH method, the STRING_AGG function offers these advantages:

Syntactic Simplicity: Complex operations complete with single function calls, significantly improving code readability.
Performance Optimization: As a native aggregate function, STRING_AGG typically outperforms XML conversion methods in execution efficiency.
Functional Extensibility: Supports WITHIN GROUP clauses for convenient sorting control.

However, note that STRING_AGG does not automatically remove duplicates by default. When duplicate elimination is required, the DISTINCT keyword must be incorporated:

SELECT 
    ID, 
    STRING_AGG(DISTINCT DisplayName, ', ') AS DisplayNames
FROM TableName
GROUP BY ID

Cross-Database Platform Solutions

Different database systems provide their own string aggregation functions, requiring developers to select appropriate methods based on actual database types:

MySQL Solution

MySQL employs the GROUP_CONCAT function for similar functionality:

SELECT 
    id, 
    GROUP_CONCAT(displayname) 
FROM tableName 
GROUP BY id

Feature Comparison Table

<table> <tr><th>Database System</th><th>Aggregate Function</th><th>Duplicate Removal</th><th>Sorting Control</th></tr> <tr><td>SQL Server</td><td>STRING_AGG</td><td>Requires DISTINCT</td><td>WITHIN GROUP</td></tr> <tr><td>MySQL</td><td>GROUP_CONCAT</td><td>DISTINCT parameter</td><td>ORDER BY clause</td></tr> <tr><td>PostgreSQL</td><td>STRING_AGG</td><td>Requires DISTINCT</td><td>No built-in sorting</td></tr>

Performance Optimization Recommendations

When implementing string aggregation techniques in production environments, consider these performance optimization factors:

Index Design Strategy

Creating appropriate indexes on aggregation fields can significantly enhance query performance. For this article's scenario, indexing the ID field is recommended:

CREATE INDEX idx_id ON #t(id)

Large Dataset Processing

When handling substantial data volumes, the XML PATH method may incur considerable memory overhead. In such cases, consider these optimization strategies:

Utilize temporary tables for intermediate results to reduce redundant calculations
Process data in batches to avoid excessive record handling in single queries
Consider cursor or loop processing for extreme cases

Memory Management Considerations

String aggregation operations may consume significant memory, particularly when processing long strings or large record sets. Recommendations include:

Monitor query execution plans to identify potential performance bottlenecks
Appropriately adjust database memory configuration parameters
Consider using disk-based temporary tables instead of memory-based ones

Practical Application Scenario Extensions

String aggregation technology finds multiple applications in actual development. Below are some typical examples:

Report Generation

When generating user permission reports, consolidating multiple user roles for display is common:

SELECT 
    UserID,
    STRING_AGG(RoleName, ', ') AS UserRoles
FROM UserRoles
GROUP BY UserID

Tagging Systems

Article tag aggregation in content management systems:

SELECT 
    ArticleID,
    STRING_AGG(TagName, '; ') AS ArticleTags
FROM ArticleTags
GROUP BY ArticleID

Log Analysis

Consolidating multiple log entries for identical events:

SELECT 
    EventID,
    STRING_AGG(LogMessage, ' | ') AS EventLogs
FROM SystemLogs
WHERE LogDate >= '2024-01-01'
GROUP BY EventID

Summary and Best Practices

This paper provides a detailed exploration of row-level string aggregation techniques by ID in SQL. Through comparative analysis of XML PATH methods and STRING_AGG functions, it offers solutions for different SQL Server versions. Key takeaways include:

Version Compatibility Selection: For SQL Server 2017 and later, prioritize STRING_AGG functions; for earlier versions, XML PATH methods remain reliable choices.
Duplicate Handling: Regardless of method, explicit duplicate value processing ensures data accuracy.
Performance Considerations: With large datasets, appropriate index design and query optimization are crucial.
Cross-Platform Adaptation: Different database systems provide respective aggregate functions; development should consider system portability requirements.

In practical development, selecting the most suitable implementation based on specific business requirements, data scales, and system environments is recommended. Additionally, thorough testing and performance monitoring are essential for ensuring system stability.

Copyright Notice: All rights in this article are reserved by the operators of DevGex. Reasonable sharing and citation are welcome; any reproduction, excerpting, or re-publication without prior permission is prohibited.