Application Research of Short Hash Functions in Unique Identifier Generation

Keywords: Short Hash | Unique Identifier | SHA-1 Truncation | Adler-32 | SHAKE Algorithm

Abstract: This paper provides an in-depth exploration of technical solutions for generating short-length unique identifiers using hash functions. Through analysis of three methods - SHA-1 hash truncation, Adler-32 lightweight hash, and SHAKE variable-length hash - it comprehensively compares their performance characteristics, collision probabilities, and application scenarios. The article offers complete Python implementation code and performance evaluations, providing theoretical foundations and practical guidance for developers selecting appropriate short hash solutions.

Overview of Short Hash Generation Technology

In modern software development, generating unique identifiers based on message content is a common requirement. Unlike randomly generated IDs, content-based identifiers offer determinism and reproducibility, facilitating data tracking and verification. However, traditional hash functions typically produce lengthy outputs, necessitating more compact representations in certain scenarios.

SHA-1 Hash Truncation Method

The SHA-1 algorithm, as a classical cryptographic hash function, maps inputs of any length to a fixed 160-bit output. By extracting the first N characters of the output result, the desired short hash value can be obtained. The core advantage of this method lies in maintaining SHA-1's strong collision resistance while providing flexible hash length control.

Specific implementation in Python:

import hashlib

def generate_short_hash(message, length=10):
    """
    Generate short hash of specified length
    
    Parameters:
        message: Input message string
        length: Desired hash length (character count)
    
    Returns:
        Truncated hash string
    """
    # Use UTF-8 encoding to ensure multilingual support
    encoded_message = message.encode("UTF-8")
    
    # Calculate SHA-1 hash
    full_hash = hashlib.sha1(encoded_message).hexdigest()
    
    # Extract specified length
    short_hash = full_hash[:length]
    
    return short_hash

# Usage example
message = "my message"
result = generate_short_hash(message, 10)
print(f"Original message: {message}")
print(f"10-character hash: {result}")

Collision Probability Analysis

When employing hash truncation techniques, careful consideration of collision probability is essential. For N-bit hash values, theoretical collision probability follows the birthday paradox principle. Specifically, 10-character (40-bit) hash collision probability becomes significant when storing approximately 1 million records. Developers should select appropriate hash lengths based on actual data scale and security requirements.

Collision probability calculation formula:

def collision_probability(n_bits, num_items):
    """
    Calculate collision probability for given bit length and item count
    
    Parameters:
        n_bits: Hash bit length
        num_items: Number of stored items
    
    Returns:
        Probability of at least one collision
    """
    import math
    
    # Calculate collision probability using approximation formula
    p_collision = 1 - math.exp(-num_items**2 / (2 * (2**n_bits)))
    
    return p_collision

# Calculate collision probability for 10-character hash
bits_10_chars = 40  # 10 characters * 4 bits/character
items_100k = 100000
prob = collision_probability(bits_10_chars, items_100k)
print(f"Collision probability for 10-character hash with 100k records: {prob:.6f}")

Alternative Approach: Adler-32 Algorithm

Adler-32 is a lightweight checksum algorithm particularly suitable for scenarios requiring fast computation with moderate security requirements. This algorithm produces 32-bit output, which converts to 8-character short hash in hexadecimal representation.

Adler-32 implementation example:

import zlib

def adler32_hash(message):
    """
    Generate short hash using Adler-32 algorithm
    
    Parameters:
        message: Input message string
    
    Returns:
        8-character Adler-32 hash value
    """
    # Calculate Adler-32 checksum
    checksum = zlib.adler32(message.encode("UTF-8"))
    
    # Convert to 8-character hexadecimal string
    hash_hex = format(checksum & 0xFFFFFFFF, '08x')
    
    return hash_hex

# Usage example
message = "test message"
result = adler32_hash(message)
print(f"Adler-32 hash: {result}")

Variable-Length Hash: SHAKE Algorithm

SHAKE (Secure Hash Algorithm-Keccak) is specifically designed for generating variable-length hashes. Unlike traditional fixed-length hashes, SHAKE allows developers to precisely specify output length, avoiding information loss from truncation operations.

SHAKE-256 implementation example:

def shake_hash(message, output_length=5):
    """
    Generate hash of specified length using SHAKE-256
    
    Parameters:
        message: Input message string
        output_length: Desired byte length (not character length)
    
    Returns:
        Hash string of specified length
    """
    # Calculate SHAKE-256 hash
    hash_obj = hashlib.shake_256(message.encode("UTF-8"))
    
    # Get hash of specified byte length, convert to hexadecimal
    hash_bytes = hash_obj.digest(output_length)
    hash_hex = hash_bytes.hex()
    
    return hash_hex

# Usage example
message = "hello shake"
result = shake_hash(message, 5)  # 5 bytes correspond to 10 characters
print(f"SHAKE-256 hash: {result}")

Performance Comparison and Selection Recommendations

In practical applications, different hash schemes have distinct advantages and disadvantages:

SHA-1 Truncation: Higher security but relatively large computational overhead, suitable for security-sensitive scenarios
Adler-32: Fast computation speed, small memory footprint, but weaker collision resistance, suitable for internal system use
SHAKE Algorithm: Maximum flexibility, no truncation operations required, preferred choice for modern applications

Performance testing code example:

import time

def benchmark_hash_functions():
    """
    Benchmark performance of different hash functions
    """
    test_message = "This is a test message for performance comparison" * 100
    
    # Test SHA-1 truncation
    start_time = time.time()
    for _ in range(1000):
        generate_short_hash(test_message, 10)
    sha1_time = time.time() - start_time
    
    # Test Adler-32
    start_time = time.time()
    for _ in range(1000):
        adler32_hash(test_message)
    adler_time = time.time() - start_time
    
    # Test SHAKE
    start_time = time.time()
    for _ in range(1000):
        shake_hash(test_message, 5)
    shake_time = time.time() - start_time
    
    print(f"SHA-1 truncation time: {sha1_time:.4f} seconds")
    print(f"Adler-32 time: {adler_time:.4f} seconds")
    print(f"SHAKE-256 time: {shake_time:.4f} seconds")

benchmark_hash_functions()

Practical Application Scenarios

Short hash technology holds significant application value across multiple domains:

URL Shortening: Mapping long URLs to short hashes for easy sharing and storage
Cache Key Generation: Generating compact cache identifiers based on content
File Deduplication: Quickly identifying duplicate files through content hashing
Distributed System IDs: Generating content-based unique identifiers in distributed environments

During actual deployment, consider the following factors:

Data scale and collision tolerance
Computational resource constraints
Security and attack resistance requirements
System compatibility and standardization needs

Conclusion and Future Outlook

Short hash generation technology provides flexible and efficient unique identifier solutions for modern software development. By appropriately selecting hash algorithms and output lengths, developers can optimize storage and transmission efficiency while ensuring uniqueness. With continuous development in cryptography, future optimized algorithms specifically designed for short hash scenarios may emerge, offering better support for various application contexts.

Copyright Notice: All rights in this article are reserved by the operators of DevGex. Reasonable sharing and citation are welcome; any reproduction, excerpting, or re-publication without prior permission is prohibited.