Keywords: Git | Version Control | History Rewriting | Large File Cleanup | Filter-Repo
Abstract: This technical article provides a comprehensive guide on permanently removing large files from Git repository history using the git filter-repo tool. Through detailed case analysis, it explains key steps including file identification, filtering operations, and remote repository updates, while offering best practice recommendations. Compared to traditional filter-branch methods, filter-repo demonstrates superior efficiency and compatibility, making it the recommended solution in modern Git workflows.
Problem Context and Challenges
In Git version control systems, developers occasionally commit large files (such as videos, archives, etc.) to repositories by mistake. Even if these files are deleted in subsequent commits, they remain in Git history, causing persistent repository bloat. This scenario commonly occurs due to accidental operations, like adding DVD image files to web projects—though immediately removed, the large files persist in historical commits.
Limitations of Traditional Methods
Early solutions primarily relied on the git filter-branch command, but this approach has multiple drawbacks: slow execution, complex operations, and in some cases, incomplete cleanup of remote repositories. More importantly, Git officially marks filter-branch as deprecated and no longer recommends its use.
Modern Solution: Git Filter-Repo
git filter-repo is specifically designed for rewriting Git history, offering significant advantages over traditional methods: higher execution efficiency, simpler operations, and more reliable results. The following demonstrates the complete workflow through concrete examples.
Step 1: Identify Large Files
First, locate large file objects in the repository using this command to list the largest files:
git rev-list --objects --all | grep -f <(git verify-pack -v .git/objects/pack/*.idx | sort -k 3 -n | cut -f 1 -d " " | tail -10)
This command displays the 10 largest files in the repository, helping identify targets for cleanup.
Step 2: Perform Filtering Operations
Based on identification results, use filter-repo to remove specified files. Multiple matching patterns are supported:
Match by file path:
git filter-repo --path-glob '../../src/../..' --invert-paths --force
Match by file extension:
git filter-repo --path-glob '*.zip' --invert-paths --force
Match by specific file type:
git filter-repo --path-glob '*.a' --invert-paths --force
The --invert-paths parameter indicates removal of matched files, while --force ensures the operation executes.
Step 3: Update Remote Repository
After local history rewriting, force push to the remote repository:
git push --all --force
git push --tags --force
Note: Force pushing overwrites remote history—exercise caution in collaborative environments and ensure all collaborators sync updates.
Operational Considerations
Before executing history rewriting operations, create a repository backup:
git clone --mirror original-repo.git backup-repo.git
For team projects, coordinate all developers:
- Notify team members to pause push operations
- Provide detailed operation instructions and update steps
- Ensure all local repositories are re-cloned or reset
Alternative Solution Comparison
Besides filter-repo, other tools are available:
BFG Repo-Cleaner: A Java tool specialized for cleaning large files, with simple operation:
java -jar bfg.jar --strip-blobs-bigger-than 100M my-repo.git
Interactive Rebase: Suitable for modifying recent commit history, using git rebase -i to enter interactive mode and edit specific commits to remove files.
Best Practice Recommendations
Prevention is better than cure—establish good Git habits:
- Configure comprehensive
.gitignorefiles to exclude unnecessary file types - Check changes with
git statusbefore committing - For binary files, consider using Git LFS (Large File Storage)
- Perform regular repository maintenance to clean unused objects
Conclusion
git filter-repo provides an efficient solution for permanently removing large files from Git history. Through accurate file identification, precise filtering operations, and proper team coordination, repository bloat can be successfully resolved while maintaining clean project history. In practical applications, choose the most appropriate cleanup strategy based on specific project requirements.