Practical Methods for Identifying Large Files in Git History

Keywords: Git repository analysis | Large file detection | Historical commit cleanup

Abstract: This article provides an in-depth exploration of effective techniques for identifying large files within Git repository history. By analyzing Git's object storage mechanism, it introduces a script-based solution using git verify-pack command that quickly locates the largest objects in the repository. The discussion extends to mapping objects to specific commits, performance optimization suggestions, and practical application scenarios. This approach is particularly valuable for addressing repository bloat caused by accidental commits of large files, enabling developers to efficiently clean Git history.

Analysis of Git Repository Size Anomalies

In practical development scenarios, abnormal growth of Git repository size is a common issue. Developers often encounter situations where the current working directory contains minimal files, yet the entire .git directory remains disproportionately large. This phenomenon typically results from historical commits containing large files (such as images, videos, binary documents), whose records persist in Git's object database even after subsequent deletions.

Understanding Git Object Storage Mechanism

Git employs a content-addressable file system to store project history. Each file (blob), directory tree (tree), and commit is stored as an object identified by SHA-1 hash values. Once files are committed, their objects remain in Git's database even after deletion in later commits, unless garbage collection is performed.

Core Detection Script Implementation

The following script, based on git verify-pack command, efficiently identifies the largest objects in the repository:

#!/bin/bash
#set -x 

# Shows the largest objects in your repo's pack file
# Written for macOS
#
# @see https://stubbisms.wordpress.com/2009/07/10/git-script-to-show-largest-pack-objects-and-trim-your-waist-line/
# @author Antony Stubbs

# Set internal field separator to line break for easy iteration over verify-pack output
IFS=$'\n';

# List all objects including their size, sort by size, take top 10
objects=`git verify-pack -v .git/objects/pack/pack-*.idx | grep -v chain | sort -k3nr | head`

echo "All sizes are in kB's. The pack column is the size of the object, compressed, inside the pack file."

output="size,pack,SHA,location"
allObjects=`git rev-list --all --objects`
for y in $objects
do
    # Extract the size in bytes
    size=$((`echo $y | cut -f 5 -d ' '`/1024))
    # Extract the compressed size in bytes
    compressedSize=$((`echo $y | cut -f 6 -d ' '`/1024))
    # Extract the SHA
    sha=`echo $y | cut -f 1 -d ' '`
    # Find the object's location in the repository tree
    other=`echo "${allObjects}" | grep $sha`
    #lineBreak=`echo -e "\n"`
    output="${output}\n${size},${compressedSize},${other}"
done

echo -e $output | column -t -s ', '

Script Execution Process Detailed

The script execution involves four main steps: first, using git verify-pack to analyze pack files and obtain detailed information about all objects; then filtering out chain objects using grep to avoid duplicate counting; followed by sorting objects in descending order by size; and finally processing each object to extract size, compressed size, SHA value, and file path information.

Output Result Analysis

The script output contains four columns: original size (KB), compressed size (KB), object SHA hash value, and file path. Sample output might display:

size    pack    SHA                                      location
65183   43021   bd1741ddce0d07b72ccf69ed281e09bf8a2d0b2f  path/to/some-video-1080p.mp4
12446   8902    2ba44098e28f8f66bac5e21210c2774085d2319b  path/to/hires-image.png
530     412     0d99bb93129939b72069df14af0d0dbda7eb6dba  path/to/some-image.jpg

Locating Related Commits

After obtaining SHA values of large objects, identifying commits containing these objects becomes essential. Use the following command to query commits containing specific blob objects:

git log --all --find-object=<blob-sha>

Alternatively, use the traditional approach:

git rev-list --all | while read commit; do
    if git ls-tree -r $commit | grep -q "<blob-sha>"; then
        echo $commit
    fi
done

Performance Optimization Considerations

For large repositories (like the Linux kernel repository with 5.6 million objects), script performance is crucial. The original script achieves reasonable analysis time by limiting output quantity (head command) and employing efficient file processing methods. In practical applications, adjust head command parameters to balance output detail and execution time.

Practical Application Scenarios

Beyond identifying accidentally committed large files, this technique applies to: code repository health checks, continuous integration environment optimization, and storage space management. In CI environments, clearing historical large files may require cleaning build caches, such as source caches in CircleCI, to ensure changes take full effect.

Advanced Tool Recommendations

For more complex repository analysis needs, consider using git-filter-repo tool's --analyze option, which provides comprehensive repository statistics. Additionally, git rev-list --disk-usage command offers branch-level disk usage statistics, helping identify directories containing numerous small files.

Best Practice Recommendations

To prevent similar issues, establish code commit standards within teams, properly configure .gitignore files, and consider adding file size checks in pre-commit hooks. Regularly running repository size analysis scripts enables early detection of potential problems, preventing unlimited repository growth.

Copyright Notice: All rights in this article are reserved by the operators of DevGex. Reasonable sharing and citation are welcome; any reproduction, excerpting, or re-publication without prior permission is prohibited.