Comprehensive Analysis of Unique Value Extraction from Arrays in VBA

Abstract: This technical paper provides an in-depth examination of various methods for extracting unique values from one-dimensional arrays in VBA. The study begins with the classical Collection object approach, utilizing error handling mechanisms for automatic duplicate filtering. Subsequently, it analyzes the Dictionary method implementation and its performance advantages for small to medium-sized datasets. The paper further explores efficient algorithms based on sorting and indexing, including two-dimensional array sorting deduplication and Boolean indexing methods, with particular emphasis on ultra-fast solutions for integer arrays. Through systematic performance benchmarking, the execution efficiency of different methods across various data scales is compared, providing comprehensive technical selection guidance for developers. The article combines specific code examples and performance data to help readers choose the most appropriate deduplication strategy based on practical application scenarios.

Overview of VBA Array Deduplication Techniques

In VBA programming, extracting unique values or removing duplicate elements from array data is a common requirement. Although the VBA language itself does not provide built-in array deduplication functionality, this need can be achieved through various programming techniques. This paper systematically introduces several main implementation methods, analyzing their principles, performance characteristics, and applicable scenarios.

Deduplication Using Collection Objects

The Collection object is one of the most commonly used methods for array deduplication in VBA. Its core principle leverages the unique key-value characteristic of Collections, automatically filtering duplicate elements through error handling mechanisms.

Sub GetUniqueValues()
    Dim col As New Collection, element
    Dim originalArray() As Variant
    Dim i As Long
    
    originalArray() = Array("Banana", "Apple", "Orange", "Tomato", "Apple", _
    "Lemon", "Lime", "Lime", "Apple")
    
    On Error Resume Next
    For Each element In originalArray
        col.Add element, CStr(element)
    Next
    On Error GoTo 0
    
    For i = 1 To col.Count
        Cells(i, 1) = col(i)
    Next
End Sub

The key to this method lies in the use of the On Error Resume Next statement. When attempting to add a duplicate key to the Collection, VBA throws an error, but the error handling statement ignores this error and continues executing subsequent code. This method features concise code, easy understanding, and is suitable for processing small to medium-sized datasets.

Dictionary Method Implementation

Scripting.Dictionary provides another efficient array deduplication solution. The Dictionary object is specifically designed for storing key-value pairs and has the capability to automatically handle duplicate keys.

Sub DictionaryDeduplication()
    Dim dict As Object
    Set dict = CreateObject("Scripting.Dictionary")
    
    Dim originalArray() As Variant
    Dim i As Long
    
    originalArray = Array("A", "B", "C", "A", "B", "D")
    
    For i = LBound(originalArray) To UBound(originalArray)
        dict(originalArray(i)) = 1
    Next i
    
    Dim uniqueArray As Variant
    uniqueArray = dict.Keys()
    
    For i = LBound(uniqueArray) To UBound(uniqueArray)
        Debug.Print uniqueArray(i)
    Next i
End Sub

The advantage of the Dictionary method lies in its specially optimized data structure, which performs excellently when processing string and variant type data. The dict.Keys() method directly obtains an array containing all unique values, offering simple and efficient operation.

Advanced Sorting Deduplication Algorithm

For large-scale data processing, sorting-based deduplication algorithms provide better performance. The core concept of this method is to gather identical elements together through sorting, then remove duplicates.

Function SortingDeduplication(ByRef inputArray() As Long, preserveOrder As Boolean) As Variant
    Dim tempArray() As Long, i As Long, indexOffset As Integer
    Dim lowerBound As Long, upperBound As Long
    
    lowerBound = LBound(inputArray): upperBound = UBound(inputArray)
    
    ReDim tempArray(1 To 2, lowerBound To upperBound)
    indexOffset = 1 - lowerBound
    
    For i = lowerBound To upperBound
        tempArray(1, i) = inputArray(i)
        tempArray(2, i) = i + indexOffset
    Next i
    
    QuickSort2D tempArray, 1, LBound(tempArray, 2), UBound(tempArray, 2), 2
    RemoveDuplicates tempArray
    
    If preserveOrder Then
        QuickSort2D tempArray, 2, LBound(tempArray, 2), UBound(tempArray, 2), 2
    End If
    
    SortingDeduplication = tempArray()
End Function

This algorithm first creates a two-dimensional array to store original values and indices, then sorts by value, removes duplicates, and finally reorders according to original indices if required. This method demonstrates significant performance advantages when processing large-scale numerical data.

Boolean Indexing Method for Integer Arrays

When processing integer arrays, the Boolean indexing method provides ultimate performance. This approach uses array indices to directly map numerical values, achieving O(n) time complexity.

Function BooleanIndexDeduplication(ByRef inputArray() As Long, preserveOrder As Boolean) As Variant
    Dim valueArray() As Long, sortedArray() As Long, boolArray() As Boolean
    Dim i As Long, upperBound As Long, maxValue As Long, minValue As Long, offset As Long
    Dim lowerBound As Long, currentIndex As Long, count As Long, range As Long
    
    upperBound = UBound(inputArray)
    lowerBound = LBound(inputArray)
    
    For i = lowerBound To upperBound
        If inputArray(i) > maxValue Then maxValue = inputArray(i)
        If inputArray(i) < minValue Then minValue = inputArray(i)
    Next i
    
    offset = Abs(minValue)
    
    If maxValue > 0 Then
        range = maxValue + offset
    Else
        range = offset
    End If
    
    If preserveOrder Then
        ReDim sortedArray(1 To 2, 1 To upperBound)
        ReDim valueArray(1 To 2, 0 To range)
        ReDim boolArray(0 To range)
        
        For i = lowerBound To upperBound
            currentIndex = inputArray(i) + offset
            boolArray(currentIndex) = True
            valueArray(1, currentIndex) = inputArray(i)
            If valueArray(2, currentIndex) = 0 Then valueArray(2, currentIndex) = i
        Next i
        
        For i = 0 To range
            If boolArray(i) Then
                count = count + 1
                sortedArray(1, count) = valueArray(1, i)
                sortedArray(2, count) = valueArray(2, i)
            End If
        Next i
        
        QuickSort2D sortedArray, 2, 1, count, 2
        ReDim Preserve sortedArray(1 To 2, 1 To count)
    Else
        ReDim sortedArray(1 To upperBound)
        ReDim valueArray(0 To range)
        ReDim boolArray(0 To range)
        
        For i = lowerBound To upperBound
            currentIndex = inputArray(i) + offset
            boolArray(currentIndex) = True
            valueArray(currentIndex) = inputArray(i)
        Next i
        
        For i = 0 To range
            If boolArray(i) Then
                count = count + 1
                sortedArray(count) = valueArray(i)
            End If
        Next i
        
        ReDim Preserve sortedArray(1 To count)
    End If
    
    BooleanIndexDeduplication = sortedArray
End Function

Performance Comparison and Selection Recommendations

Through systematic performance benchmarking, different methods demonstrate their respective advantages in various scenarios:

Collection Method: Suitable for general scenarios, featuring concise code and good performance with non-integer data, but may encounter performance bottlenecks with ultra-large-scale data.

Dictionary Method: Excellent performance with data volumes less than 500,000, particularly suitable for processing string and variant type data, with good generality.

Sorting Deduplication Algorithm: Suitable for large-scale numerical data processing, demonstrating significant advantages when processing million-level data through optimized sorting and deduplication steps.

Boolean Indexing Method

Practical Application Considerations

When selecting specific deduplication methods, the following factors should be comprehensively considered:

Data Scale: For small-scale data (<100,000), Collection or Dictionary can be used; for large-scale data, sorting or indexing methods should be considered.

Data Type: Integer data should prioritize Boolean indexing method; non-integer data should use Collection or Dictionary.

Memory Constraints: Boolean indexing method requires additional memory space to map value ranges; memory usage should be noted when value ranges are large.

Order Requirements: If original order needs to be maintained, choose methods that support index preservation.

By understanding the principles and performance characteristics of various methods, developers can select the most appropriate array deduplication solution based on specific application scenarios, finding the optimal balance between code simplicity, execution efficiency, and memory usage.