Is LFS store garbage collected? #3380

Closed
opened 2025-11-02 05:10:46 -06:00 by GiteaMirror · 4 comments
Owner

Originally created by @yacoob on GitHub (May 25, 2019).

  • Gitea version (or commit ref): 1.8.1
  • Git version: 2.21.0
  • Operating system: Linux
  • Database (use [x]):
    • PostgreSQL
    • MySQL
    • MSSQL
    • SQLite
  • Can you reproduce the bug at https://try.gitea.io:
    • Yes (provide example URL)
    • No
    • Not relevant
  • Log gist:

Description

I'm trying to understand how is Gitea's LFS store gets garbage collected. I can see some references to LFS object removal in the code, but I can't find a definite answer when exactly are unreferenced blobs removed from LFS directory. As a test, I've created a repository on gitea, pushed some LFS objects to it, then removed the branch referencing them and forced a gc in the admin panel. The objects under git/lfs are still present.

Does this kind of gc happen at all? Or is it only after whole repository is removed? If there's no automatic gc, please treat this bug as a feature request. If there is, consider this a documentation request.

Thanks!

Originally created by @yacoob on GitHub (May 25, 2019). - Gitea version (or commit ref): 1.8.1 - Git version: 2.21.0 - Operating system: Linux - Database (use `[x]`): - [ ] PostgreSQL - [ ] MySQL - [ ] MSSQL - [x] SQLite - Can you reproduce the bug at https://try.gitea.io: - [ ] Yes (provide example URL) - [ ] No - [x] Not relevant - Log gist: ## Description I'm trying to understand how is Gitea's LFS store gets garbage collected. I can see some references to LFS object removal in the code, but I can't find a definite answer when exactly are unreferenced blobs removed from LFS directory. As a test, I've created a repository on gitea, pushed some LFS objects to it, then removed the branch referencing them and forced a gc in the admin panel. The objects under `git/lfs` are still present. Does this kind of gc happen at all? Or is it only after whole repository is removed? If there's no automatic gc, please treat this bug as a feature request. If there is, consider this a documentation request. Thanks!
GiteaMirror added the type/proposaltype/feature labels 2025-11-02 05:10:46 -06:00
Author
Owner

@ghost commented on GitHub (May 26, 2019):

You have to delete the repository to remove the LFS objects from disk.

@ghost commented on GitHub (May 26, 2019): You have to delete the repository to remove the LFS objects from disk.
Author
Owner

@yacoob commented on GitHub (Jun 1, 2019):

That's the way only GC that happens for lfs? Okay - can we treat this topic as a request to implement something more granular, that would run during git gc?

Thanks!

@yacoob commented on GitHub (Jun 1, 2019): That's the way only GC that happens for lfs? Okay - can we treat this topic as a request to implement something more granular, that would run during `git gc`? Thanks!
Author
Owner

@zeripath commented on GitHub (Jun 1, 2019):

There is one big store - files are not stored by repository but by oid. The repository information is kept separately. We don't have the putative filename - LFS never gives us it - we never get the SHA of the pointer file that points to the oid and although you can guess what the bland pointer should be, the spec allows for extensions so you won't be able to guess them all. You can't even simply use a .gitattributes file stored within the repository - as it might not be stored - and they might not call the filter lfs!

Therefore in terms of a GC, what you would have to do is:

  • Get a list of the oids that are stored in the LFS for the repo. (Simple select on the database)

  • Walk the git repository, find all blobs <=1k, check if they look like a pointer file, if so get the oid, check if it's stored in the LFS and is associated with the repo.

  • Give a diff of the two (possibly three) states.

  • Any unreachable LFS objects by repository suggest deletion? I guess, but you don't know why they're there - you're assuming LFS is only being used by git-lfs. This might be useful to know about and then you could prune these but this can't be automatically done.

  • What about potential oids that are missing - either because they're not attached to the repo or they're not in the LFS? Well are you sure that they're actually pointers rather than just files that look like pointers? (You cannot tell the difference - you can't assume .gitattributes is present and you can't really assume that they're only placed there by filter.lfs.* commands either.)

  • Do you reveal that you have a file matching an oid but one that is not attached to the oid? It could be a security issue to do so - although if sha256 is a secure hash the only way you should have the hash is if you have the object.

In #7082 I decided that the only sensible thing to do when merging a pr from one repository to another was to check if a blob could be a pointer file, check if it's oid is in the LFS and if so associate it with the base repository. (I probably should only add it to the base repository if that oid is actually associated with the head repository for that possible security reason above.)

When we display files in the UI we tend to just check if the blob looks like a pointer and then if the oid is associated with repository assume it's meant to be an LFS object.

It's only during uploads to repositories that we actually pay attention to .gitattributes as that's the only possible hint we have that an object should be in the LFS.

It's not simple at all. The spec for LFS is so extensible that you just don't know why an object has been placed in the LFS.

There is one final thing that might be useful - find all things in the store that are not associated with a repo - then you have to walk all the repos and try to find out if they could match a repo.

@zeripath commented on GitHub (Jun 1, 2019): There is one big store - files are not stored by repository but by oid. The repository information is kept separately. We don't have the putative filename - LFS never gives us it - we never get the SHA of the pointer file that points to the oid and although you can guess what the bland pointer should be, the spec allows for extensions so you won't be able to guess them all. You can't even simply use a .gitattributes file stored within the repository - as it might not be stored - and they might not call the filter lfs! Therefore in terms of a GC, what you would have to do is: * Get a list of the oids that are stored in the LFS for the repo. (Simple select on the database) * Walk the git repository, find all blobs <=1k, check if they look like a pointer file, if so get the oid, check if it's stored in the LFS and is associated with the repo. * Give a diff of the two (possibly three) states. * Any unreachable LFS objects by repository suggest deletion? I guess, but you don't know why they're there - you're assuming LFS is only being used by git-lfs. This might be useful to know about and then you could prune these but this can't be automatically done. * What about potential oids that are missing - either because they're not attached to the repo or they're not in the LFS? Well are you sure that they're actually pointers rather than just files that look like pointers? (You cannot tell the difference - you can't assume .gitattributes is present and you can't really assume that they're only placed there by filter.lfs.* commands either.) * Do you reveal that you have a file matching an oid but one that is not attached to the oid? It could be a security issue to do so - although if sha256 is a secure hash the only way you should have the hash is if you have the object. In #7082 I decided that the only sensible thing to do when merging a pr from one repository to another was to check if a blob could be a pointer file, check if it's oid is in the LFS and if so associate it with the base repository. (I probably should only add it to the base repository if that oid is actually associated with the head repository for that possible security reason above.) When we display files in the UI we tend to just check if the blob looks like a pointer and then if the oid is associated with repository assume it's meant to be an LFS object. It's only during uploads to repositories that we actually pay attention to .gitattributes as that's the only possible hint we have that an object should be in the LFS. It's not simple at all. The spec for LFS is so extensible that you just don't know why an object has been placed in the LFS. There is one final thing that might be useful - find all things in the store that are not associated with a repo - then you have to walk all the repos and try to find out if they could match a repo.
Author
Owner

@stale[bot] commented on GitHub (Jul 31, 2019):

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs during the next 2 weeks. Thank you for your contributions.

@stale[bot] commented on GitHub (Jul 31, 2019): This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs during the next 2 weeks. Thank you for your contributions.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/gitea#3380