Keywords: Git LFS | GitHub File Size Limit | History Rewriting
Abstract: This article provides an in-depth analysis of why large CSV files still trigger GitHub's 100MB file size limit even after Git LFS configuration. It explains the fundamental workings of Git LFS and why the simple git lfs track command cannot handle large files already committed to history. Three primary solutions are detailed: using the git lfs migrate command, git filter-branch tool, and BFG Repo-Cleaner tool, with BFG recommended as best practice due to its efficiency and safety. Each method includes step-by-step instructions and scenario analysis to help developers permanently solve large file version control problems.
Problem Background and Phenomenon Analysis
When using Git for version control, developers frequently encounter issues where large files (such as CSV files exceeding 100MB) cannot be pushed to GitHub. GitHub imposes a 100MB limit per file to prevent repositories from becoming excessively large and impacting performance. To address this, Git provides the Large File Storage (LFS) extension, which allows developers to store large files on dedicated LFS servers while keeping only pointers to these files in the Git repository.
Git LFS Working Mechanism and Configuration Misconceptions
The core mechanism of Git LFS involves adding filter rules to the .gitattributes file via the git lfs track command. For example, after executing git lfs track "*.csv", the .gitattributes file will contain an entry like *.csv filter=lfs diff=lfs merge=lfs -text. This configuration tells Git that all CSV files should be processed through the LFS filter, meaning newly added CSV files will be automatically converted to LFS pointers.
However, a common misconception exists: many developers believe that once LFS rules are configured in .gitattributes, all existing CSV files will automatically convert to LFS format. In reality, Git LFS filters only affect newly added files. If large CSV files have already been committed to Git history, these files remain in the Git object database as complete binaries and are not automatically converted to LFS pointers. This explains why in the problem description, despite correct LFS configuration in .gitattributes, executing git lfs ls-files returns no output—because existing CSV files are not managed by LFS.
Root Cause and Solution Comparison
The root cause is that Git LFS configuration cannot automatically rewrite history. To resolve this, Git history must be explicitly rewritten to remove large files from the Git object database and replace them with LFS pointers. Below is a detailed analysis of three primary solutions:
Solution 1: Using git lfs migrate Command (Git LFS 2.2.0+)
Starting from Git LFS version 2.2.0, the official git lfs migrate command was introduced specifically for migrating files in existing repositories to LFS management. This command rewrites commit history, converting files matching specified patterns to LFS pointers. Basic usage:
git lfs migrate import --include="*.csv"
After executing this command, all CSV files in historical commits will be replaced with LFS pointers. Subsequently, a normal git push can be performed to push changes to the remote repository. The advantages of this method are official support and relative simplicity, but it requires Git LFS 2.2.0 or higher.
Solution 2: Using git filter-branch Command
git filter-branch is Git's built-in history rewriting tool, powerful but complex to use. It can delete or modify files in history. For LFS migration, one approach is to first delete large files, then re-add them as LFS files. Example command:
git filter-branch --tree-filter 'rm -rf path/to/your/file' HEAD
The issue with this method is that it only deletes files without automatically converting them to LFS format. Developers must manually re-add files and ensure they are correctly tracked by LFS. Additionally, git filter-branch is slow on large repositories and error-prone, making it a non-preferred solution.
Solution 3: Using BFG Repo-Cleaner Tool (Recommended)
BFG Repo-Cleaner is a third-party tool specifically designed for cleaning Git repositories, faster and safer than git filter-branch. Starting from version 1.12.5, BFG supports directly converting historical files to Git LFS format. Usage:
java -jar bfg-1.12.5.jar --convert-to-git-lfs '*.csv' --no-blob-protection
This command scans the entire Git history, replacing all files matching the *.csv pattern with LFS pointers. BFG's advantages include:
- Efficiency: JVM-based implementation, significantly faster than
git filter-branchon large repositories. - Safety: Default protection of files in recent commits prevents accidental deletion.
- LFS-Specific Design: The
--convert-to-git-lfsoption handles LFS conversion directly, requiring no additional steps.
Because BFG is optimized for history rewriting and directly supports LFS conversion, it is widely considered best practice in the community for solving such issues.
Implementation Steps and Considerations
Regardless of the chosen solution, the following key steps must be observed before implementation:
- Backup Repository: History rewriting is irreversible; always create a complete repository backup first.
- Clean Working Directory: Ensure no uncommitted changes exist in the working directory to avoid conflicts.
- Verify LFS Configuration: Confirm that the
.gitattributesfile has correct LFS tracking rules. - Execute Migration Command: Run the appropriate migration tool based on the selected solution.
- Force Push: After history rewriting, use
git push --forceto push changes to the remote repository, as history has changed. - Notify Collaborators: Since history is rewritten, all collaborators need to re-clone the repository or reset local branches.
Summary and Best Practice Recommendations
Git LFS is a powerful tool for managing large files, but proper configuration requires understanding its workings. The key insight is that LFS tracking rules only apply to new files; historical files already committed must be migrated via history rewriting. For most scenarios, the BFG Repo-Cleaner tool is recommended due to its safe and efficient LFS conversion. If using newer Git LFS versions (2.2.0+), git lfs migrate is also a good option. Avoid git filter-branch unless specifically needed, as its complexity and performance issues may introduce additional risks.
Finally, prevention is better than cure: planning large file management strategies early in a project and configuring LFS promptly can avoid subsequent history rewriting operations. For new projects, it is advisable to execute git lfs track before the first commit of large files, ensuring all large files are correctly managed from the outset.