Applying LINQ's Distinct() on Specific Properties: Comprehensive Analysis and Implementation

Keywords: LINQ | Distinct | Property_Distinct | C# | Extension_Methods

Abstract: This article provides an in-depth exploration of implementing distinct operations based on one or more object properties in C# LINQ. By analyzing the limitations of the default Distinct() method, it details two primary solutions: query expressions using GroupBy with First method and custom DistinctBy extension methods. The article includes concrete code examples, explains the application of anonymous types in multi-property distinct operations, and discusses the implementation principles of custom comparers. Practical recommendations for performance considerations and EF Core compatibility issues in different scenarios are also provided to help developers effectively handle complex data deduplication requirements.

Introduction

In practical applications of Language Integrated Query (LINQ) in C#, developers frequently need to perform distinct operations based on specific properties of objects. While LINQ provides the standard Distinct() method, its default implementation relies on overall object equality comparison, which often fails to meet the requirements for distinct operations based on specific properties when dealing with complex objects. This article provides multiple practical solutions through comprehensive analysis.

Problem Context and Challenges

Consider a typical scenario: we have a list of Person objects, each containing Id and Name properties. When multiple objects share the same Id value, the standard Distinct() method cannot perform deduplication based on the Id property because it compares either reference equality or value equality of the entire object.

public class Person
{
    public int Id { get; set; }
    public string Name { get; set; }
}

List<Person> people = new List<Person>
{
    new Person { Id = 1, Name = "Test1" },
    new Person { Id = 1, Name = "Test1" },
    new Person { Id = 2, Name = "Test2" }
};

// Standard Distinct() cannot achieve deduplication based on Id
var result = people.Distinct(); // Still returns 3 elements

Solution One: GroupBy and First Combination

The first solution utilizes LINQ's GroupBy operator to group elements by specified properties, then selects the first element from each group as the representative.

// Distinct by single property
List<Person> distinctPeople = people
    .GroupBy(p => p.Id)
    .Select(g => g.First())
    .ToList();

// Distinct by multiple properties
List<Person> distinctPeopleMulti = people
    .GroupBy(p => new { p.Id, p.Name })
    .Select(g => g.First())
    .ToList();

The advantage of this approach lies in its simplicity and readability. Through anonymous types, composite key grouping based on multiple properties can be easily achieved. It's important to note that in some query providers (such as earlier versions of Entity Framework Core), FirstOrDefault might be necessary to ensure proper query execution.

Solution Two: Custom DistinctBy Extension Method

The second solution provides a more elegant and efficient approach by creating a custom DistinctBy extension method. The core idea of this method is to maintain a hash set of seen keys, using a key selector function to determine element uniqueness.

public static IEnumerable<TSource> DistinctBy<TSource, TKey>
    (this IEnumerable<TSource> source, Func<TSource, TKey> keySelector)
{
    if (source == null) throw new ArgumentNullException(nameof(source));
    if (keySelector == null) throw new ArgumentNullException(nameof(keySelector));
    
    HashSet<TKey> seenKeys = new HashSet<TKey>();
    
    foreach (TSource element in source)
    {
        if (seenKeys.Add(keySelector(element)))
        {
            yield return element;
        }
    }
}

Usage examples:

// By single property
var distinctById = people.DistinctBy(p => p.Id);

// By multiple properties
var distinctByMultiple = people.DistinctBy(p => new { p.Id, p.Name });

Advanced Features and Custom Comparison

To handle more complex comparison requirements, the DistinctBy method can be extended to support custom equality comparers:

public static IEnumerable<TSource> DistinctBy<TSource, TKey>
    (this IEnumerable<TSource> source, 
     Func<TSource, TKey> keySelector, 
     IEqualityComparer<TKey> comparer)
{
    if (source == null) throw new ArgumentNullException(nameof(source));
    if (keySelector == null) throw new ArgumentNullException(nameof(keySelector));
    
    HashSet<TKey> seenKeys = new HashSet<TKey>(comparer);
    
    foreach (TSource element in source)
    {
        if (seenKeys.Add(keySelector(element)))
        {
            yield return element;
        }
    }
}

Performance Analysis and Comparison

From a performance perspective, the DistinctBy method generally outperforms the GroupBy approach because:

DistinctBy uses a single HashSet for deduplication with near O(n) time complexity
GroupBy requires creating a grouping dictionary and then iterating through each group, resulting in higher memory overhead
DistinctBy demonstrates better memory efficiency on large datasets

Practical Application Scenarios

In real-world development, these techniques can be applied to various scenarios:

// Database query result deduplication
var uniqueCustomers = dbContext.Customers
    .DistinctBy(c => c.Email);

// Log data deduplication
var uniqueErrors = errorLogs
    .DistinctBy(e => new { e.ErrorCode, e.Timestamp.Date });

// Product catalog processing
var uniqueProducts = productList
    .DistinctBy(p => p.SKU);

Compatibility Considerations

It's worth noting that the MoreLINQ library includes an official implementation of the DistinctBy method. For production environments, it's recommended to reference the MoreLINQ library directly to obtain a thoroughly tested and optimized implementation. Additionally, while Entity Framework Core 6 and later versions provide better support for certain query patterns, developers should still be aware of query provider limitations in complex scenarios.

Conclusion

Through the comprehensive analysis presented in this article, we have demonstrated multiple effective methods for performing distinct operations based on object properties in C# LINQ. Whether using the GroupBy combination or custom DistinctBy extension methods, developers can choose the most suitable solution based on specific requirements. These techniques not only enhance code readability and maintainability but also provide powerful flexibility when handling complex data deduplication scenarios.

Copyright Notice: All rights in this article are reserved by the operators of DevGex. Reasonable sharing and citation are welcome; any reproduction, excerpting, or re-publication without prior permission is prohibited.