Keywords: C# | List Merge | Deduplication
Abstract: This article explores several effective methods for merging two List<T> collections and removing duplicate values in C#. It begins by introducing the LINQ Union method, which is the simplest and most efficient approach for most scenarios. The article then delves into how Union works, including its hash-based deduplication mechanism and deferred execution特性. Using the custom class ResultAnalysisFileSql as an example, it demonstrates how to implement the IEqualityComparer<T> interface for complex types to ensure proper Union functionality. Additionally, the article compares Union with the Concat method and briefly mentions alternative approaches using HashSet<T>. Finally, it provides performance optimization tips and practical considerations to help developers choose the most suitable merging strategy based on specific needs.
Basic Usage of LINQ Union Method
In C#, the most straightforward way to merge two List<T> collections and remove duplicate values is by using the LINQ Union method. This method is part of the System.Linq namespace and takes two sequences as parameters, returning a new sequence containing all unique elements. Here is a simple example using integer types to illustrate its basic usage:
List<int> first_list = new List<int> { 1, 12, 12, 5 };
List<int> second_list = new List<int> { 12, 5, 7, 9, 1 };
List<int> resulting_list = first_list.Union(second_list).ToList();
// resulting_list output: [1, 12, 5, 7, 9]
In this example, first_list contains elements [1, 12, 12, 5], while second_list contains [12, 5, 7, 9, 1]. After calling the Union method, the resulting list resulting_list includes all non-duplicate elements, with the order based on the first sequence's elements appearing first, followed by elements from the second sequence that are not already present. It is important to note that the Union method uses the default equality comparer for elements (value-based for value types like int, reference-based for reference types), ensuring effective removal of duplicates.
How Union Works and Its Characteristics
The core of the Union method lies in its deduplication mechanism. Internally, it uses a hash set (HashSet<T>) to track added elements, guaranteeing that each element appears only once in the result. This process is deferred, meaning the actual merging and deduplication occur only when the result is iterated over (e.g., by calling ToList()). This helps optimize performance, especially when dealing with large datasets. The following code snippet demonstrates the deferred execution特性 of Union:
IEnumerable<int> unionResult = first_list.Union(second_list);
// No merging has occurred yet
List<int> finalList = unionResult.ToList(); // Execution happens here
Additionally, the Union method typically has a time complexity of O(n + m), where n and m are the lengths of the two input sequences, as it needs to traverse both sequences and check the hash set. In terms of space complexity, it requires extra memory for the hash set, but this is generally acceptable relative to the input size. Compared to other methods like Concat, Union automatically handles duplicates, whereas Concat retains all elements, including duplicates, which may not be desired in some scenarios.
Handling Custom Classes: Implementing IEqualityComparer<T>
When dealing with custom classes, such as the ResultAnalysisFileSql mentioned in the question, using Union directly may not correctly remove duplicates because the default equality comparer relies on reference equality. To ensure Union deduplicates based on specific properties of the objects (e.g., FileSql and PathFileSql), it is necessary to implement the IEqualityComparer<T> interface. Here is an example showing how to create a custom comparer for the ResultAnalysisFileSql class:
public class ResultAnalysisFileSqlComparer : IEqualityComparer<ResultAnalysisFileSql>
{
public bool Equals(ResultAnalysisFileSql x, ResultAnalysisFileSql y)
{
if (x == null || y == null)
return false;
return x.FileSql == y.FileSql && x.PathFileSql == y.PathFileSql;
}
public int GetHashCode(ResultAnalysisFileSql obj)
{
return (obj.FileSql?.GetHashCode() ?? 0) ^ (obj.PathFileSql?.GetHashCode() ?? 0);
}
}
// Using the custom comparer with Union
List<ResultAnalysisFileSql> list1 = new List<ResultAnalysisFileSql>();
List<ResultAnalysisFileSql> list2 = new List<ResultAnalysisFileSql>();
List<ResultAnalysisFileSql> combinedList = list1.Union(list2, new ResultAnalysisFileSqlComparer()).ToList();
In this implementation, the Equals method compares the FileSql and PathFileSql properties of two objects, while the GetHashCode method generates a hash code based on these properties to optimize hash set performance. By passing this comparer to the Union method, you can ensure that the merge operation correctly removes duplicates based on custom logic. If the class already overrides Equals and GetHashCode methods (as shown in the question's code), Union might use these overrides, but explicitly providing a comparer is often more reliable, especially with complex equality logic.
Alternative Approach: Using HashSet<T> for Merging and Deduplication
Besides the Union method, another common strategy for merging and deduplicating is to use HashSet<T>. This approach leverages the automatic deduplication特性 of HashSet to manually combine two lists. Here is an example:
List<int> first_list = new List<int> { 1, 12, 12, 5 };
List<int> second_list = new List<int> { 12, 5, 7, 9, 1 };
HashSet<int> hashSet = new HashSet<int>(first_list);
hashSet.UnionWith(second_list);
List<int> resulting_list = hashSet.ToList();
// resulting_list output: [1, 12, 5, 7, 9]
This method first converts first_list into a HashSet, which automatically removes any duplicates within it (e.g., the two 12s become one). Then, it calls the UnionWith method to add elements from second_list to the set, ignoring any duplicates. Finally, the HashSet is converted back to a list. Compared to the Union method, this approach has similar performance but offers more direct control, especially when multiple merge operations are needed. However, it may be less concise than Union, particularly within LINQ query chains.
Performance Optimization and Practical Application Tips
In practical applications, when choosing a merging method, consider performance, readability, and specific requirements. For most scenarios, the Union method is preferred due to its simplicity and efficiency. Here are some optimization tips:
- If the lists are very large, consider using deferred execution (e.g., not immediately calling
ToList()) to reduce memory usage until the result is needed. - For custom classes, ensure implementation of
IEqualityComparer<T>or overriding ofEqualsandGetHashCodemethods to avoid unexpected duplicates. - In scenarios where element order must be preserved, note that
Uniondoes not guarantee stable order (though it typically returns in the order of input sequences), whileHashSetdoes not maintain insertion order unless usingLinkedHashSet(simulated in .NET viaOrderedDictionary). - Compare
UnionandConcat: useConcatif deduplication is not required; otherwise, useUnionto avoid data redundancy.
For example, when handling file analysis results like ResultAnalysisFileSql lists, merging lists from different sources and removing duplicate file entries can ensure data consistency. By combining the methods discussed, developers can flexibly address various merging needs, enhancing code robustness and efficiency.