Strategies for Identifying and Cleaning Large .pack Files in Git Repositories

Keywords: Git | .pack file | history rewriting | garbage collection | repository optimization

Abstract: This article provides an in-depth exploration of the causes and cleanup methods for large .pack files in Git repositories. By analyzing real user cases, it explains the mechanism by which deleted files remain in historical records and systematically introduces complete solutions using git filter-branch for history rewriting combined with git gc for garbage collection. The article also supplements with preventive measures and best practices to help developers effectively manage repository size.

Problem Background and Cause Analysis

In the Git version control system, developers often encounter situations where repository size increases abnormally, especially when large .pack files appear in the .git/objects/pack/ directory. This typically stems from historical commits containing numerous files; even if these files are later deleted via git rm commands, they remain in Git's history. Git's design philosophy is to preserve all historical changes to allow reverting to any version, which means deleted files are only removed from the working directory and index, while the corresponding Git objects persist in the repository.

Taking the user case as an example, the developer committed many files to a branch using git add . and git commit, then merged into the main branch and deleted files with git rm -rf unwanted_folder/. Although the files disappeared from the working tree, the related Git objects (including blobs, trees, and commits) are still stored in the .pack file. Git uses pack files to compress and store multiple objects for storage efficiency, but when containing extensive historical data, the pack file size increases significantly.

Core Solution: History Rewriting and Garbage Collection

To completely remove these unnecessary file objects, Git history must be rewritten to ensure all related references are purged. The main steps are as follows:

First, use the git filter-branch command to rewrite history and remove all references to specific files or directories. For instance, for the unwanted_folder/ in the case, execute:

git filter-branch --index-filter 'git rm -r --cached --ignore-unmatch unwanted_folder' --prune-empty

This command traverses all commits and removes files at the specified path from each commit's index. The --index-filter option allows directly modifying the index without checking out files to the working directory, improving efficiency. --ignore-unmatch ensures no error is thrown if the file does not exist, and --prune-empty automatically deletes empty commits resulting from file removal.

After history rewriting, clean up residual references and perform garbage collection. Run the following commands in sequence:

git for-each-ref --format='delete %(refname)' refs/original | git update-ref --stdin
git reflog expire --expire=now --all
git gc --aggressive --prune=now

git for-each-ref and git update-ref are used to delete original reference backups created during filter-branch; git reflog expire immediately expires all reflog entries; git gc --aggressive --prune=now performs aggressive garbage collection, promptly cleaning all unreachable objects. These steps collectively ensure redundant data in the pack file is thoroughly removed.

Supplementary Measures and Prevention Strategies

In addition to the core methods above, other strategies can be combined to optimize repository management. For example, regularly running git gc can automatically pack loose objects and clean up useless data, but note that without history rewriting, git gc may not remove objects that still have references. For repositories containing large binary files, it is recommended to use Git LFS (Large File Storage) to store these files, preventing them from directly entering Git history. Furthermore, consider using shallow clones to reduce initial download size or, when necessary, splitting the repository into multiple smaller parts.

In practice, rewriting history affects all collaborators, so it must be used cautiously in team environments, ensuring all members synchronize the updated repository. By combining history cleaning, regular maintenance, and reasonable workflows, Git repository size can be effectively controlled, enhancing version control efficiency.

Copyright Notice: All rights in this article are reserved by the operators of DevGex. Reasonable sharing and citation are welcome; any reproduction, excerpting, or re-publication without prior permission is prohibited.

Problem Background and Cause Analysis

Core Solution: History Rewriting and Garbage Collection

Supplementary Measures and Prevention Strategies

Cite this article