Optimizing Git Repository Size: A Practical Guide from 5GB to Efficient Storage

Keywords: Git optimization | repository compression | large file cleanup

Abstract: This article addresses the issue of excessive .git folder size in Git repositories, providing systematic solutions. It first analyzes common causes of repository bloat, such as frequently changed binary files and historical accumulation. Then, it details the git repack command recommended by Linus Torvalds and its parameter optimizations to improve compression efficiency through depth and window settings. The article also discusses the risks of git gc and supplements methods for identifying and cleaning large files, including script detection and git filter-branch for history rewriting. Finally, it emphasizes considerations for team collaboration to ensure the optimization process does not compromise remote repository stability.

In software development, Git, as a distributed version control system, may have its local repository's .git folder become excessively large due to historical accumulation, affecting storage efficiency and operational performance. Users often encounter scenarios where the base code is only 200MB, but the .git folder reaches 5GB, typically caused by frequently committed binary files or unoptimized storage structures. This article delves into how to safely and effectively reduce the .git folder size to free up local storage space.

Git Storage Mechanism and Causes of Volume Bloat

Git stores the state of files for each commit through snapshots, rather than differences. For text files, Git compresses efficiently; however, for binary files (e.g., images, executables, archives), compression is poor, and each modification generates a new snapshot, leading to rapid repository growth. Additionally, large files from historical commits, even if deleted, remain in .git/objects unless explicitly cleaned.

Core Optimization Command: git repack

Linus Torvalds recommends using the git repack command for repository compression, instead of git gc --aggressive. The latter can perform garbage collection but may compromise data integrity and is considered bad practice. The correct command is as follows:

git repack -a -d -f --depth=250 --window=250

Parameter breakdown: -a repacks all objects; -d deletes redundant packs; -f forces recalculation of delta chains; --depth sets the delta chain depth, with higher values increasing compression but taking more time; --window controls the object window size for scanning, affecting delta candidate selection. For repositories with long histories, it is advisable to set larger values (e.g., 250) and run this command overnight to save time.

Identifying and Cleaning Large Files

If the repository contains mistakenly committed large files (e.g., SQL dumps), git repack alone cannot remove them entirely. Use a script to detect large objects:

#!/bin/bash
IFS=$'\n';
objects=`git verify-pack -v .git/objects/pack/pack-*.idx | grep -v chain | sort -k3nr | head`;
echo "All sizes are in kB. The pack column is the size of the object, compressed, inside the pack file.";
output="size,pack,SHA,location";
for y in $objects; do
    size=$((`echo $y | cut -f 5 -d ' '`/1024));
    compressedSize=$((`echo $y | cut -f 6 -d ' '`/1024));
    sha=`echo $y | cut -f 1 -d ' '`;
    other=`git rev-list --all --objects | grep $sha`;
    output="${output}\n${size},${compressedSize},${other}";
done
echo -e $output | column -t -s ', '

This script lists the top 10 largest objects in the repository, helping to identify problematic files. Once confirmed, use git filter-branch to rewrite history and permanently delete them:

git filter-branch --tag-name-filter cat --index-filter 'git rm -r --cached --ignore-unmatch filename' --prune-empty -f -- --all

After this operation, clean up residual objects:

rm -rf .git/refs/original/
git reflog expire --expire=now --all
git gc --prune=now

Team Collaboration and Remote Repository Synchronization

Rewriting history affects all collaborators. After optimization, force push to the remote repository:

git push origin --force --all
git push origin --force --tags

Team members must re-clone the repository or use git rebase to synchronize, preventing old history from being reintroduced. This process requires team coordination to ensure no one pushes changes during optimization.

Summary and Best Practices

Reducing .git folder size involves a combination of git repack for storage optimization, script detection for large files, and git filter-branch for history cleaning. Avoid periodically deleting changes older than 30 days, as this disrupts version continuity. It is recommended to run git repack regularly to prevent bloat and use Git LFS (Large File Storage) for managing binary files. Through these methods, a 5GB .git folder can be significantly reduced, enhancing development efficiency.

Copyright Notice: All rights in this article are reserved by the operators of DevGex. Reasonable sharing and citation are welcome; any reproduction, excerpting, or re-publication without prior permission is prohibited.

Git Storage Mechanism and Causes of Volume Bloat

Core Optimization Command: git repack

Identifying and Cleaning Large Files

Team Collaboration and Remote Repository Synchronization

Summary and Best Practices

Cite this article