Keywords: C# | List Deduplication | HashSet | LINQ | Performance Optimization
Abstract: This article provides a comprehensive exploration of various techniques for removing duplicate elements from List<T> in C#, with emphasis on HashSet<T> and LINQ Distinct() methods. Through detailed code examples and performance comparisons, it demonstrates the differences in time complexity, memory allocation, and execution efficiency among different approaches, offering practical guidance for developers to choose the most suitable solution. The article also covers advanced techniques including custom comparers, iterative algorithms, and recursive methods, comprehensively addressing various scenarios in duplicate element processing.
Introduction
In C# programming practice, handling collections containing duplicate elements is a common requirement. List<T>, as one of the most frequently used generic collections, often requires removal of duplicate items to ensure data uniqueness. Based on high-scoring answers from Stack Overflow and authoritative technical documentation, this article systematically introduces multiple methods for removing duplicate elements and provides in-depth analysis of their respective application scenarios and performance characteristics.
Core Solution Using HashSet<T>
HashSet<T> is specifically designed to store unique elements, with internal implementation based on hash tables that enable near O(1) time complexity for element lookup and insertion operations. When we need to remove duplicate elements from List<T>, the most direct and effective approach is to leverage the characteristics of HashSet<T>.
using System;
using System.Collections.Generic;
class Program
{
static void Main()
{
// Create original list with duplicate elements
List<int> originalList = new List<int>() { 1, 2, 3, 2, 4, 3, 5, 1 };
// Automatically remove duplicates through HashSet constructor
HashSet<int> uniqueSet = new HashSet<int>(originalList);
// Convert back to List<T> format
List<int> distinctList = new List<int>(uniqueSet);
Console.WriteLine("Original List:");
foreach (int num in originalList)
{
Console.Write($"{num} ");
}
Console.WriteLine("\nDistinct List:");
foreach (int num in distinctList)
{
Console.Write($"{num} ");
}
}
}
The above code demonstrates the basic usage of HashSet<T>. When initializing HashSet with List<T> as parameter, the constructor automatically filters out all duplicate elements, retaining only the first occurrence instance of each value. This method has O(n) time complexity and O(n) space complexity, performing excellently when processing large datasets.
Application of LINQ Distinct() Method
For developers using .NET Framework 3.5 and later versions, LINQ (Language Integrated Query) provides a more declarative programming approach. The Distinct() method is the most commonly used tool for deduplication among LINQ extension methods.
using System;
using System.Collections.Generic;
using System.Linq;
class Program
{
static void Main()
{
List<string> cities = new List<string>
{
"Beijing", "Shanghai", "Guangzhou", "Beijing", "Shenzhen", "Shanghai"
};
// Remove duplicates using Distinct() method
List<string> uniqueCities = cities.Distinct().ToList();
Console.WriteLine("City List After Deduplication:");
foreach (string city in uniqueCities)
{
Console.WriteLine(city);
}
}
}
The Distinct() method returns an IEnumerable<T> sequence, which needs to be converted to List<T> using the ToList() method. This approach is syntactically more concise, particularly suitable for use in LINQ query chains. It's important to note that Distinct() uses the element's Equals() and GetHashCode() methods for equality comparison by default.
Implementation of Custom Equality Comparers
When dealing with complex objects or requiring custom equality logic, specific comparison rules can be provided by implementing the IEqualityComparer<T> interface. This is particularly important when processing nested collections or requiring deduplication based on specific object properties.
public class CustomEqualityComparer<T> : IEqualityComparer<T>
{
public bool Equals(T x, T y)
{
if (ReferenceEquals(x, y)) return true;
if (x is null || y is null) return false;
// Custom equality logic
return x.Equals(y);
}
public int GetHashCode(T obj)
{
if (obj is null) return 0;
return obj.GetHashCode();
}
}
// Using custom comparer
List<CustomObject> objects = GetObjects();
List<CustomObject> distinctObjects = objects
.Distinct(new CustomEqualityComparer<CustomObject>())
.ToList();
Implementation of Traditional Iterative Methods
Although modern C# development more commonly recommends using HashSet or LINQ methods, understanding traditional iterative algorithms still has educational value. The following demonstrates two loop-based deduplication implementations:
// Method 1: Using auxiliary list
public static List<T> RemoveDuplicatesWithNewList<T>(List<T> inputList)
{
List<T> result = new List<T>();
foreach (T item in inputList)
{
if (!result.Contains(item))
{
result.Add(item);
}
}
return result;
}
// Method 2: In-place modification (suitable for modifiable lists)
public static void RemoveDuplicatesInPlace<T>(List<T> list)
{
for (int i = 0; i < list.Count; i++)
{
for (int j = i + 1; j < list.Count; j++)
{
if (list[i].Equals(list[j]))
{
list.RemoveAt(j);
j--; // Adjust index
}
}
}
}
Performance Analysis and Comparison
Based on BenchmarkDotNet performance test data, different deduplication methods show significant differences in time and space efficiency:
- HashSet Method: Average execution time 2.4 microseconds, memory allocation 4.16KB, optimal comprehensive performance
- LINQ Distinct(): Average execution time 2.4 microseconds, memory allocation 4.22KB, comparable to HashSet
- Auxiliary List Method: Average execution time 1.2 microseconds, memory allocation 72B, best performance on small datasets
- Dictionary Method: Average execution time 1.3 microseconds, memory allocation 288B, suitable for scenarios requiring additional metadata
- Iterative Methods: Average execution time over 9 microseconds, O(n²) time complexity, only suitable for small datasets
Deduplication of Complex Data Structures
For nested collections like List<List<T>>, specialized comparers need to be implemented to handle equality judgment of entire sublists:
public class ListEqualityComparer<T> : IEqualityComparer<List<T>>
{
private readonly IEqualityComparer<T> _itemComparer;
public ListEqualityComparer(IEqualityComparer<T> itemComparer = null)
{
_itemComparer = itemComparer ?? EqualityComparer<T>.Default;
}
public bool Equals(List<T> x, List<T> y)
{
if (ReferenceEquals(x, y)) return true;
if (x is null || y is null) return false;
if (x.Count != y.Count) return false;
return x.SequenceEqual(y, _itemComparer);
}
public int GetHashCode(List<T> list)
{
int hash = 17;
foreach (T item in list)
{
hash = hash * 31 + _itemComparer.GetHashCode(item);
}
return hash;
}
}
// Usage example
List nestedList = GetNestedData();
List distinctNested = nestedList
.Distinct(new ListEqualityComparer<string>())
.ToList();
Best Practice Recommendations
Based on performance testing and practical application experience, the following recommendations are proposed:
- Small Datasets: Prioritize auxiliary list method with minimal memory overhead
- Medium to Large Datasets: Use HashSet or LINQ Distinct() for stable and reliable performance
- Complex Object Deduplication: Implement custom IEqualityComparer<T> to ensure correct equality judgment
- Performance-Sensitive Scenarios: Avoid nested loop methods with O(n²) time complexity
- Code Readability: Prefer LINQ methods in team development for clear and understandable syntax
Conclusion
C# provides multiple methods for removing duplicate elements from List<T>, each with specific application scenarios and performance characteristics. HashSet<T> and LINQ Distinct() are the most commonly used and performance-optimal choices in modern C# development, while traditional iterative methods still have value in certain specific scenarios. Developers should choose appropriate deduplication strategies based on specific data scale, performance requirements, and code maintainability. By deeply understanding the internal mechanisms and performance characteristics of these methods, more efficient and robust C# applications can be developed.