Cleaning Large Files from Git Repository: Using git filter-branch to Permanently Remove Committed Large Files

Nov 21, 2025 · Programming · 16 views · 7.8

Keywords: Git cleanup | git filter-branch | large file removal | history rewriting | repository optimization

Abstract: This article provides a comprehensive analysis of large file cleanup issues in Git repositories, focusing on scenarios where users accidentally commit numerous files that continue to occupy .git folder space even after disk deletion. By comparing the differences between git rm and git filter-branch, it delves into the working principles and usage methods of git filter-branch, including the role of --index-filter parameter, the significance of --prune-empty option, and the necessity of force pushing. The article offers complete operational procedures and important considerations to help developers effectively clean large files from Git history and reduce repository size.

Problem Background Analysis

In the Git version control system, developers sometimes accidentally commit large numbers of unnecessary files, such as the case of over 9000 photos mentioned in this article. Even if users delete these files from disk and commit the changes, historical versions of these files remain stored in the .git folder, resulting in a significantly large repository (12GB in this case). When attempting to push changes to a remote repository, the process becomes extremely slow due to the need to transfer all historical data.

Limitations of git rm Command

The user initially attempted to delete the photos using git rm public/photos command but received the error message fatal: pathspec 'public/photos' did not match any files. This error occurs because the user has already deleted these files from disk, and Git cannot find the corresponding file paths for deletion.

More importantly, even if the git rm command executes successfully, it only removes file tracking status from the current working tree and cannot delete file objects that have already been committed to Git history. This is the fundamental reason why adding public/photos/ to the .gitignore file also fails to solve the problem – .gitignore only affects untracked files and is ineffective for files already committed to history.

git filter-branch Solution

git filter-branch is a powerful tool provided by Git specifically for rewriting Git history. It can traverse all commits and apply specified filters to each commit, thereby enabling batch modification of historical records.

Core Command Analysis

The recommended solution uses the following command:

git filter-branch --force --index-filter \
  'git rm -r --cached --ignore-unmatch public/photos' \
  --prune-empty --tag-name-filter cat -- --all

Let's break down the various parameters of this command:

Detailed Operation Steps

The complete cleanup process includes the following key steps:

Step 1: Execute History Rewriting

Run the above git filter-branch command. This process may take considerable time, depending on the repository size and complexity of historical records. The command will traverse all commits, removing all files under the public/photos directory from each commit.

Step 2: Verify Results

After filtering completes, carefully check the repository status to ensure no important files were accidentally deleted. Use git log --oneline --graph --all to view changes in history, and du -sh .git to check size changes of the .git folder.

Step 3: Update .gitignore

Although photo files have been removed from historical records, to prevent future accidental commits, the .gitignore file should be updated:

echo public/photos >> .gitignore
git add .gitignore && git commit -m "ignore rule for photos"

Step 4: Force Push Changes

Since history has been rewritten, a force push to the remote repository is required:

git push -f origin branch-name

It's important to note that force pushing will overwrite the remote repository's history, which may cause conflicts if other developers have work based on the old history.

In-depth Technical Principle Analysis

The working principle of git filter-branch is based on Git's object model. Git stores each file's content as blob objects, directory structures as tree objects, and commit information as commit objects. These objects reference each other through SHA-1 hash values, forming a directed acyclic graph (DAG).

When using git filter-branch, Git will:

  1. Traverse all commit objects
  2. Apply specified filters to each commit
  3. If filter operations change commit content, Git creates new objects (blob, tree, commit)
  4. Update all relevant references (branches, tags) to point to the new commit chain

The advantage of --index-filter lies in its direct operation on the Git index, avoiding file system I/O operations, making it more efficient than other filters like --tree-filter.

Considerations and Best Practices

Backup Importance: Before executing git filter-branch, it's strongly recommended to create a complete backup of the repository. Use git clone --mirror to create a mirror repository as backup.

Team Collaboration Impact: Rewriting history affects all developers using the repository. Ensure team members are aware of this change and coordinate their work before pushing.

Alternative Solution Consideration: For very large repositories, git filter-branch might have performance issues. Git version 2.22 and above provides the git filter-repo tool as a modern alternative to git filter-branch, offering better performance and ease of use.

Storage Optimization: After filtering completes, run git reflog expire --expire=now --all and git gc --prune=now --aggressive to clean up unreferenced objects and optimize storage.

Error Handling and Debugging

If problems occur during execution, debug using the following methods:

Through this detailed analysis, developers can deeply understand the technical principles of Git large file cleanup, master the correct usage of git filter-branch, and effectively solve repository size issues.

Copyright Notice: All rights in this article are reserved by the operators of DevGex. Reasonable sharing and citation are welcome; any reproduction, excerpting, or re-publication without prior permission is prohibited.