Keywords: Git repository analysis | Large file detection | Historical commit cleanup
Abstract: This article provides an in-depth exploration of effective techniques for identifying large files within Git repository history. By analyzing Git's object storage mechanism, it introduces a script-based solution using git verify-pack command that quickly locates the largest objects in the repository. The discussion extends to mapping objects to specific commits, performance optimization suggestions, and practical application scenarios. This approach is particularly valuable for addressing repository bloat caused by accidental commits of large files, enabling developers to efficiently clean Git history.
Analysis of Git Repository Size Anomalies
In practical development scenarios, abnormal growth of Git repository size is a common issue. Developers often encounter situations where the current working directory contains minimal files, yet the entire .git directory remains disproportionately large. This phenomenon typically results from historical commits containing large files (such as images, videos, binary documents), whose records persist in Git's object database even after subsequent deletions.
Understanding Git Object Storage Mechanism
Git employs a content-addressable file system to store project history. Each file (blob), directory tree (tree), and commit is stored as an object identified by SHA-1 hash values. Once files are committed, their objects remain in Git's database even after deletion in later commits, unless garbage collection is performed.
Core Detection Script Implementation
The following script, based on git verify-pack command, efficiently identifies the largest objects in the repository:
#!/bin/bash
#set -x
# Shows the largest objects in your repo's pack file
# Written for macOS
#
# @see https://stubbisms.wordpress.com/2009/07/10/git-script-to-show-largest-pack-objects-and-trim-your-waist-line/
# @author Antony Stubbs
# Set internal field separator to line break for easy iteration over verify-pack output
IFS=$'\n';
# List all objects including their size, sort by size, take top 10
objects=`git verify-pack -v .git/objects/pack/pack-*.idx | grep -v chain | sort -k3nr | head`
echo "All sizes are in kB's. The pack column is the size of the object, compressed, inside the pack file."
output="size,pack,SHA,location"
allObjects=`git rev-list --all --objects`
for y in $objects
do
# Extract the size in bytes
size=$((`echo $y | cut -f 5 -d ' '`/1024))
# Extract the compressed size in bytes
compressedSize=$((`echo $y | cut -f 6 -d ' '`/1024))
# Extract the SHA
sha=`echo $y | cut -f 1 -d ' '`
# Find the object's location in the repository tree
other=`echo "${allObjects}" | grep $sha`
#lineBreak=`echo -e "\n"`
output="${output}\n${size},${compressedSize},${other}"
done
echo -e $output | column -t -s ', '
Script Execution Process Detailed
The script execution involves four main steps: first, using git verify-pack to analyze pack files and obtain detailed information about all objects; then filtering out chain objects using grep to avoid duplicate counting; followed by sorting objects in descending order by size; and finally processing each object to extract size, compressed size, SHA value, and file path information.
Output Result Analysis
The script output contains four columns: original size (KB), compressed size (KB), object SHA hash value, and file path. Sample output might display:
size pack SHA location
65183 43021 bd1741ddce0d07b72ccf69ed281e09bf8a2d0b2f path/to/some-video-1080p.mp4
12446 8902 2ba44098e28f8f66bac5e21210c2774085d2319b path/to/hires-image.png
530 412 0d99bb93129939b72069df14af0d0dbda7eb6dba path/to/some-image.jpg
Locating Related Commits
After obtaining SHA values of large objects, identifying commits containing these objects becomes essential. Use the following command to query commits containing specific blob objects:
git log --all --find-object=<blob-sha>
Alternatively, use the traditional approach:
git rev-list --all | while read commit; do
if git ls-tree -r $commit | grep -q "<blob-sha>"; then
echo $commit
fi
done
Performance Optimization Considerations
For large repositories (like the Linux kernel repository with 5.6 million objects), script performance is crucial. The original script achieves reasonable analysis time by limiting output quantity (head command) and employing efficient file processing methods. In practical applications, adjust head command parameters to balance output detail and execution time.
Practical Application Scenarios
Beyond identifying accidentally committed large files, this technique applies to: code repository health checks, continuous integration environment optimization, and storage space management. In CI environments, clearing historical large files may require cleaning build caches, such as source caches in CircleCI, to ensure changes take full effect.
Advanced Tool Recommendations
For more complex repository analysis needs, consider using git-filter-repo tool's --analyze option, which provides comprehensive repository statistics. Additionally, git rev-list --disk-usage command offers branch-level disk usage statistics, helping identify directories containing numerous small files.
Best Practice Recommendations
To prevent similar issues, establish code commit standards within teams, properly configure .gitignore files, and consider adding file size checks in pre-commit hooks. Regularly running repository size analysis scripts enables early detection of potential problems, preventing unlimited repository growth.