DEV Community

Cover image for Managing Large Files with Git LFS
Ebrahim Ramadan
Ebrahim Ramadan

Posted on

Managing Large Files with Git LFS

Image description

Git LFS (large file system) hell

Managing Large Files with Git LFS

I recently faced a first-thing-for-everything challenge while working on my portfolio (this site). I had some quitelarge .gif filesthat I decided to manage using Git LFS (Large File Storage). However, things didn’t go as smoothly as I anticipated. Here's how I went through it and what I learned along the way.

"--distributed-even-if-your-workflow-isnt"

Gitis a powerful version control system with many benefits, including storing and managing large files. However, it’s important to note that storing large files directly in Git can significantly slow down operations like pulling, pushing, and cloning the repository. This can frustrate collaborators who rely on these operations to work efficiently.

When a large file is added to a Git repository, every collaborator on the repository must download the entire file, including all versions of it. This process can be time-consuming, especially for collaborators with slower internet connections. Additionally, storing large files on Git can result in a large repository size, making collaboration difficult.

That is whenGit LFScomes into play. It is a Git extension that allows you to store large files in a separate, encrypted repository, and stores a single text pointer in the current regular repository that points to the actual centent in the remote server. This means that only the collaborators who need the file can download it, reducing the size of the repository and improving collaboration.

Installing

Refer to Git LFS, note that required Git ≥ 1.8.2

windows

download https://git-lfs.github.com/
>_ git lfs install

macOS

_ brew install git-lfs
>_ git lfs install

Linux

>_ sudo apt-get install git-lfs
>_ git lfs install
this will return output like this
Updated Git Hooks
Git LFS initialized

Tracking Files

To track that .gif type of file in my repo, I just ran
>_ git add .gitattributes
>_ git lfs track "*.gif"
this cmd let me git lfs track all .gif files in the repo directory, also will actually create the .gitattributes file in the root of the repo dir, so it has something like
*.gif filter=lfs diff=lfs merge=lfs -text
This is git mechanism that binds special behaviors to certain file patterns. Git LFS binds to filters using tracked file patterns via the .gitattributes file. And then you can absolutely commit/push
>_ git add .
>_ git commit -m "gif files to lfs"
>_ git push origin main
see now the gif file content does not exist on my actual repo, It is jsut a pointer. so when someone clones or pulls the changes, git will try to pull the changes, there are a few ways to ensure the LFS content isretrieved&available:

  1. Before deploying, you can run git lfs fetch --all to download all LFS objects.

  2. On-demand fetching: Some hosting platforms (like GitHub Pages) can fetch LFS content on-demand when requested.

  3. Custom server logic: You could implement server-side logic to fetch LFS content when requested.

Untracking Files

I could not have the content served for me on dev nor production, so I tried to untrack the files by compressing them to be less than 50MB (gh repo limit). first thing to strat with:
>_ git lfs untrack "*.gif"
now you have to pull the files contents from the lfs remote server to your local machine by running:

>_ git lfs pull
This command downloads the actual file content for any LFS-tracked files referenced in your repository.as it would be for any other regular file in the repository.

Problems I confronted

Source Code
To ensure the file type is completely removed from LFS tracking, you should remove it from the LFS cache. Run the following command:
>_ git rm --cached "*.gif"
Ensure the file is untracked by Git LFS and that the actual file content is present in your local working directory. You can check if the file is tracked/not by listing all the files there:

_ git lfs ls-files

Machine Learning reproducibility crisis

ML devs
"The so-called crisis is because of the difficulty in replicating the work of co-workers or fellow scientists, threatening their ability to build on each others work or to share it with clients or to deploy production services. Since machine learning, and other forms of artificial intelligence software, are so widely used across both academic and corporate research, replicability or reproducibility is a critical problem."

David Herron

Posted on Jun 15, 2019

Read the full article by David on Why Git and Git-LFS is not enough to solve the Machine Learning Reproducibility crisis and see how machine learning use git LFS in its models, datasets, and others it is really helpful for the ML devs.

Top comments (0)