Keywords: VBA | Array Deduplication | Unique Values | Collection | Dictionary | Performance Optimization | Algorithm Comparison
Abstract: This technical paper provides an in-depth examination of various methods for extracting unique values from one-dimensional arrays in VBA. The study begins with the classical Collection object approach, utilizing error handling mechanisms for automatic duplicate filtering. Subsequently, it analyzes the Dictionary method implementation and its performance advantages for small to medium-sized datasets. The paper further explores efficient algorithms based on sorting and indexing, including two-dimensional array sorting deduplication and Boolean indexing methods, with particular emphasis on ultra-fast solutions for integer arrays. Through systematic performance benchmarking, the execution efficiency of different methods across various data scales is compared, providing comprehensive technical selection guidance for developers. The article combines specific code examples and performance data to help readers choose the most appropriate deduplication strategy based on practical application scenarios.
Overview of VBA Array Deduplication Techniques
In VBA programming, extracting unique values or removing duplicate elements from array data is a common requirement. Although the VBA language itself does not provide built-in array deduplication functionality, this need can be achieved through various programming techniques. This paper systematically introduces several main implementation methods, analyzing their principles, performance characteristics, and applicable scenarios.
Deduplication Using Collection Objects
The Collection object is one of the most commonly used methods for array deduplication in VBA. Its core principle leverages the unique key-value characteristic of Collections, automatically filtering duplicate elements through error handling mechanisms.
Sub GetUniqueValues()
Dim col As New Collection, element
Dim originalArray() As Variant
Dim i As Long
originalArray() = Array("Banana", "Apple", "Orange", "Tomato", "Apple", _
"Lemon", "Lime", "Lime", "Apple")
On Error Resume Next
For Each element In originalArray
col.Add element, CStr(element)
Next
On Error GoTo 0
For i = 1 To col.Count
Cells(i, 1) = col(i)
Next
End Sub
The key to this method lies in the use of the On Error Resume Next statement. When attempting to add a duplicate key to the Collection, VBA throws an error, but the error handling statement ignores this error and continues executing subsequent code. This method features concise code, easy understanding, and is suitable for processing small to medium-sized datasets.
Dictionary Method Implementation
Scripting.Dictionary provides another efficient array deduplication solution. The Dictionary object is specifically designed for storing key-value pairs and has the capability to automatically handle duplicate keys.
Sub DictionaryDeduplication()
Dim dict As Object
Set dict = CreateObject("Scripting.Dictionary")
Dim originalArray() As Variant
Dim i As Long
originalArray = Array("A", "B", "C", "A", "B", "D")
For i = LBound(originalArray) To UBound(originalArray)
dict(originalArray(i)) = 1
Next i
Dim uniqueArray As Variant
uniqueArray = dict.Keys()
For i = LBound(uniqueArray) To UBound(uniqueArray)
Debug.Print uniqueArray(i)
Next i
End Sub
The advantage of the Dictionary method lies in its specially optimized data structure, which performs excellently when processing string and variant type data. The dict.Keys() method directly obtains an array containing all unique values, offering simple and efficient operation.
Advanced Sorting Deduplication Algorithm
For large-scale data processing, sorting-based deduplication algorithms provide better performance. The core concept of this method is to gather identical elements together through sorting, then remove duplicates.
Function SortingDeduplication(ByRef inputArray() As Long, preserveOrder As Boolean) As Variant
Dim tempArray() As Long, i As Long, indexOffset As Integer
Dim lowerBound As Long, upperBound As Long
lowerBound = LBound(inputArray): upperBound = UBound(inputArray)
ReDim tempArray(1 To 2, lowerBound To upperBound)
indexOffset = 1 - lowerBound
For i = lowerBound To upperBound
tempArray(1, i) = inputArray(i)
tempArray(2, i) = i + indexOffset
Next i
QuickSort2D tempArray, 1, LBound(tempArray, 2), UBound(tempArray, 2), 2
RemoveDuplicates tempArray
If preserveOrder Then
QuickSort2D tempArray, 2, LBound(tempArray, 2), UBound(tempArray, 2), 2
End If
SortingDeduplication = tempArray()
End Function
This algorithm first creates a two-dimensional array to store original values and indices, then sorts by value, removes duplicates, and finally reorders according to original indices if required. This method demonstrates significant performance advantages when processing large-scale numerical data.
Boolean Indexing Method for Integer Arrays
When processing integer arrays, the Boolean indexing method provides ultimate performance. This approach uses array indices to directly map numerical values, achieving O(n) time complexity.
Function BooleanIndexDeduplication(ByRef inputArray() As Long, preserveOrder As Boolean) As Variant
Dim valueArray() As Long, sortedArray() As Long, boolArray() As Boolean
Dim i As Long, upperBound As Long, maxValue As Long, minValue As Long, offset As Long
Dim lowerBound As Long, currentIndex As Long, count As Long, range As Long
upperBound = UBound(inputArray)
lowerBound = LBound(inputArray)
For i = lowerBound To upperBound
If inputArray(i) > maxValue Then maxValue = inputArray(i)
If inputArray(i) < minValue Then minValue = inputArray(i)
Next i
offset = Abs(minValue)
If maxValue > 0 Then
range = maxValue + offset
Else
range = offset
End If
If preserveOrder Then
ReDim sortedArray(1 To 2, 1 To upperBound)
ReDim valueArray(1 To 2, 0 To range)
ReDim boolArray(0 To range)
For i = lowerBound To upperBound
currentIndex = inputArray(i) + offset
boolArray(currentIndex) = True
valueArray(1, currentIndex) = inputArray(i)
If valueArray(2, currentIndex) = 0 Then valueArray(2, currentIndex) = i
Next i
For i = 0 To range
If boolArray(i) Then
count = count + 1
sortedArray(1, count) = valueArray(1, i)
sortedArray(2, count) = valueArray(2, i)
End If
Next i
QuickSort2D sortedArray, 2, 1, count, 2
ReDim Preserve sortedArray(1 To 2, 1 To count)
Else
ReDim sortedArray(1 To upperBound)
ReDim valueArray(0 To range)
ReDim boolArray(0 To range)
For i = lowerBound To upperBound
currentIndex = inputArray(i) + offset
boolArray(currentIndex) = True
valueArray(currentIndex) = inputArray(i)
Next i
For i = 0 To range
If boolArray(i) Then
count = count + 1
sortedArray(count) = valueArray(i)
End If
Next i
ReDim Preserve sortedArray(1 To count)
End If
BooleanIndexDeduplication = sortedArray
End Function
Performance Comparison and Selection Recommendations
Through systematic performance benchmarking, different methods demonstrate their respective advantages in various scenarios:
Collection Method: Suitable for general scenarios, featuring concise code and good performance with non-integer data, but may encounter performance bottlenecks with ultra-large-scale data.
Dictionary Method: Excellent performance with data volumes less than 500,000, particularly suitable for processing string and variant type data, with good generality.
Sorting Deduplication Algorithm: Suitable for large-scale numerical data processing, demonstrating significant advantages when processing million-level data through optimized sorting and deduplication steps.
Boolean Indexing Method
When selecting specific deduplication methods, the following factors should be comprehensively considered: Data Scale: For small-scale data (<100,000), Collection or Dictionary can be used; for large-scale data, sorting or indexing methods should be considered. Data Type: Integer data should prioritize Boolean indexing method; non-integer data should use Collection or Dictionary. Memory Constraints: Boolean indexing method requires additional memory space to map value ranges; memory usage should be noted when value ranges are large. Order Requirements: If original order needs to be maintained, choose methods that support index preservation. By understanding the principles and performance characteristics of various methods, developers can select the most appropriate array deduplication solution based on specific application scenarios, finding the optimal balance between code simplicity, execution efficiency, and memory usage.Practical Application Considerations