Reject push if refs contains LFS object unknown to server #13064

Open
opened 2025-11-02 10:29:05 -06:00 by GiteaMirror · 5 comments
Owner

Originally created by @tdesveaux on GitHub (May 29, 2024).

Feature Description

We had an issue where a client hooks manager made the lfs pre-push not work correctly, thus resulting in commits being push without their LFS objects.

While reproducing and testing things around, I encountered the following fail on a push to a Github repository:

remote: error: GH008: Your push referenced at least 1 unknown Git LFS object:
remote:     cd0e6edf7a48b6b12ca737319acdc0ec1e4ff8918a264499d7e60bfc558c4341
remote: Try to push them with 'git lfs push --all'.
To https://github.com/tdesveaux/test_private.git
 ! [remote rejected]       HEAD -> test_lfs (pre-receive hook declined)
error: failed to push some refs to 'https://github.com/tdesveaux/test_private.git'

It would be great if Gitea did the same sanity check preventing users from pushing commits without their LFS objects.

Screenshots

No response

Originally created by @tdesveaux on GitHub (May 29, 2024). ### Feature Description We had an issue where a client hooks manager made the `lfs pre-push` not work correctly, thus resulting in commits being push without their LFS objects. While reproducing and testing things around, I encountered the following fail on a push to a Github repository: ```bash remote: error: GH008: Your push referenced at least 1 unknown Git LFS object: remote: cd0e6edf7a48b6b12ca737319acdc0ec1e4ff8918a264499d7e60bfc558c4341 remote: Try to push them with 'git lfs push --all'. To https://github.com/tdesveaux/test_private.git ! [remote rejected] HEAD -> test_lfs (pre-receive hook declined) error: failed to push some refs to 'https://github.com/tdesveaux/test_private.git' ``` It would be great if Gitea did the same sanity check preventing users from pushing commits without their LFS objects. ### Screenshots _No response_
GiteaMirror added the type/proposaltopic/lfs labels 2025-11-02 10:29:05 -06:00
Author
Owner

@tdesveaux commented on GitHub (May 30, 2024):

Here is a few information:

This can be done in the pre-receive hook server-side.
In pre-receive, Git provide on Stdin a line per ref pushed with each line in the formatprev_rev new_rev ref.
From this, a simple git log --format=%H new_rev ^prev_rev list the new revs that are being pushed.
From this list of revs, running git lfs fsck [revs ...] should be fine as a sanity check.
An alternative to fsck is ls-files, which take only one ref at a time, but would allow to provide better feedback about which rev and which files have missing lfs objects.
git lfs ls-files -l rev will output one line per LFS ref in the commit in the format lfs_hash - filepath

Here is a toy python script I used to validate this:

#!/usr/bin/python3
import sys

import subprocess

for line in sys.stdin.readlines():
    prev_rev, new_rev, ref = line.strip().split()
    output = subprocess.check_output(['git', 'log', "--format=%H", new_rev, f"^{prev_rev}"], text=True)
    revs = [
        r.strip() for r in output.splitlines(False)
        if r.strip()
    ]
    print(revs)
    if revs:
        subprocess.check_call(['git', 'lfs', 'fsck', *revs])
@tdesveaux commented on GitHub (May 30, 2024): Here is a few information: This can be done in the `pre-receive` hook server-side. In pre-receive, Git provide on Stdin a line per ref pushed with each line in the format`prev_rev new_rev ref`. From this, a simple `git log --format=%H new_rev ^prev_rev` list the new revs that are being pushed. From this list of revs, running `git lfs fsck [revs ...]` should be fine as a sanity check. An alternative to `fsck` is `ls-files`, which take only one ref at a time, but would allow to provide better feedback about which rev and which files have missing lfs objects. `git lfs ls-files -l rev` will output one line per LFS ref in the commit in the format `lfs_hash - filepath` Here is a toy python script I used to validate this: ```python #!/usr/bin/python3 import sys import subprocess for line in sys.stdin.readlines(): prev_rev, new_rev, ref = line.strip().split() output = subprocess.check_output(['git', 'log', "--format=%H", new_rev, f"^{prev_rev}"], text=True) revs = [ r.strip() for r in output.splitlines(False) if r.strip() ] print(revs) if revs: subprocess.check_call(['git', 'lfs', 'fsck', *revs]) ```
Author
Owner

@tdesveaux commented on GitHub (Jun 3, 2024):

Welp, my toy script was FAR too naive and does not work at all.
I did a better implementation, but did not test it extensively yet.

I'm open to work on this as LFS is quite critical to the sanity of our projects.

Here is a quick patch on were I felt it should be added, can I get a confirmation that it's the correct place?
Also, would appreciate pointers on how to add tests for this.

diff --git a/cmd/hook.go b/cmd/hook.go
index 6e31710caf..37c23c97d0 100644
--- a/cmd/hook.go
+++ b/cmd/hook.go
@@ -242,7 +242,7 @@ Gitea or set your environment appropriately.`, "")
 		// If the ref is a branch or tag, check if it's protected
 		// if supportProcReceive all ref should be checked because
 		// permission check was delayed
-		if supportProcReceive || refFullName.IsBranch() || refFullName.IsTag() {
+		if supportProcReceive || refFullName.IsBranch() || refFullName.IsTag() || setting.LFS.StartServer {
 			oldCommitIDs[count] = oldCommitID
 			newCommitIDs[count] = newCommitID
 			refFullNames[count] = refFullName
diff --git a/routers/private/hook_pre_receive.go b/routers/private/hook_pre_receive.go
index 0a3c8e2559..c099979b10 100644
--- a/routers/private/hook_pre_receive.go
+++ b/routers/private/hook_pre_receive.go
@@ -4,9 +4,11 @@
 package private

 import (
+	"context"
 	"fmt"
 	"net/http"
 	"os"
+	"strings"

 	"code.gitea.io/gitea/models"
 	asymkey_model "code.gitea.io/gitea/models/asymkey"
@@ -19,6 +21,7 @@ import (
 	"code.gitea.io/gitea/modules/git"
 	"code.gitea.io/gitea/modules/log"
 	"code.gitea.io/gitea/modules/private"
+	"code.gitea.io/gitea/modules/setting"
 	"code.gitea.io/gitea/modules/web"
 	gitea_context "code.gitea.io/gitea/services/context"
 	pull_service "code.gitea.io/gitea/services/pull"
@@ -117,6 +120,10 @@ func HookPreReceive(ctx *gitea_context.PrivateContext) {
 		newCommitID := opts.NewCommitIDs[i]
 		refFullName := opts.RefFullNames[i]

+		if setting.LFS.StartServer {
+			preReceiveLFS(ourCtx, oldCommitID, newCommitID)
+		}
+
 		switch {
 		case refFullName.IsBranch():
 			preReceiveBranch(ourCtx, oldCommitID, newCommitID, refFullName)
@@ -135,6 +142,59 @@ func HookPreReceive(ctx *gitea_context.PrivateContext) {
 	ctx.PlainText(http.StatusOK, "ok")
 }

+func preReceiveLFS(ctx *preReceiveContext, oldCommitID, newCommitID string) {
+	if !ctx.AssertCanWriteCode() {
+		return
+	}
+
+	objectFormat := ctx.Repo.GetObjectFormat()
+
+	// Branch deletion, nothing to check
+	if newCommitID == objectFormat.EmptyObjectID().String() {
+		return
+	}
+
+	repo := ctx.Repo.Repository
+
+ // TODO: IMPLEMENT CHECKS HERE
+}
+
 func preReceiveBranch(ctx *preReceiveContext, oldCommitID, newCommitID string, refFullName git.RefName) {
 	branchName := refFullName.BranchName()
 	ctx.branchName = branchName

And here is a "better" (but not thoroughly tested) version of the python pre-receive hook:

#!/usr/bin/python3
from __future__ import annotations
import re
import sys
import subprocess
from pathlib import Path

# Install with
# sudo -u git ln -s /var/lib/gitea/dne_hooks/fail_lfs_no_objects.pre-receive /var/lib/gitea/repositories/{owner}/{repository}.git/hooks/pre-receive.d/fail_lfs_no_objects

GITEA_LFS_STORAGE_DIR = Path('/var/lib/gitea/data/lfs')
LFS_LSFILES_RE = re.compile("^([a-zA-Z0-9]+) ([*|-]) (.*)$")

def _lfs_object_exists(oid: str) -> bool:
    # gitea store LFS files with CAS in subdirs
    object_path = (GITEA_LFS_STORAGE_DIR / oid[0:2] / oid[2:4] / oid[4:]).absolute()
    return object_path.is_file()

def _is_default_sha(rev: str) -> bool:
    return all(c == '0' for c in rev)


def _find_first_unique_rev(rev: str) -> str:
    all_refs = set(subprocess.check_output(['git', 'show-ref', '--hash'], text=True).splitlines(False))
    # remove self
    if rev in all_refs:
        all_refs.remove(rev)
    if not all_refs:
        # gen default_sha
        return len(rev) * '0'

    return subprocess.check_output(['git', 'merge-base', rev, *all_refs] , text=True).strip()


def check_missing_lfs_objects(rev: str, parent_rev: str | None) -> bool:
    ls_files_cmd = ['git', 'lfs', 'ls-files', '--long']
    if parent_rev:
        ls_files_cmd.append(parent_rev)
    ls_files_cmd.append(rev)

    lfs_oids = [
        LFS_LSFILES_RE.match(line).groups()[0]
        for line in subprocess.check_output(ls_files_cmd, text=True).splitlines(False)
    ]

    first_missing_lfs_object = next(
        (
            oid
            for oid in lfs_oids
            if not _lfs_object_exists(oid)
        ),
        None,
    )

    if first_missing_lfs_object is not None:
        print(
            (
                f"Error: Your push referenced at least 1 unknown Git LFS object:\n"
                f"\t{first_missing_lfs_object}\n"
                "\n"
                "Try to push them with 'git lfs push --all'."
            ),
            file=sys.stderr,
        )
        sys.exit(1)


for line in sys.stdin.readlines():
    prev_rev, new_rev, ref = line.strip().split()
    if _is_default_sha(new_rev):
        # branch deletion
        continue

    if _is_default_sha(prev_rev):
        prev_rev = _find_first_unique_rev(new_rev)

    is_new_ref = _is_default_sha(prev_rev)

    rev_list_cmd = ['git', 'rev-list', new_rev]
    if not is_new_ref:
        rev_list_cmd.append(f"^{prev_rev}")
    output = subprocess.check_output(rev_list_cmd, text=True)

    revs = [
        r.strip() for r in output.splitlines(False)
        if r.strip()
    ]

    for rev in revs[:-1]:
        check_missing_lfs_objects(rev, f"{rev}~")

    check_missing_lfs_objects(
        revs[-1],
        # New ref will list ALL history, first commit does not have a parent
        f"{revs[-1]}~" if not is_new_ref else None,
    )
@tdesveaux commented on GitHub (Jun 3, 2024): Welp, my toy script was FAR too naive and does not work at all. I did a better implementation, but did not test it extensively yet. I'm open to work on this as LFS is quite critical to the sanity of our projects. Here is a quick patch on were I felt it should be added, can I get a confirmation that it's the correct place? Also, would appreciate pointers on how to add tests for this. ```diff diff --git a/cmd/hook.go b/cmd/hook.go index 6e31710caf..37c23c97d0 100644 --- a/cmd/hook.go +++ b/cmd/hook.go @@ -242,7 +242,7 @@ Gitea or set your environment appropriately.`, "") // If the ref is a branch or tag, check if it's protected // if supportProcReceive all ref should be checked because // permission check was delayed - if supportProcReceive || refFullName.IsBranch() || refFullName.IsTag() { + if supportProcReceive || refFullName.IsBranch() || refFullName.IsTag() || setting.LFS.StartServer { oldCommitIDs[count] = oldCommitID newCommitIDs[count] = newCommitID refFullNames[count] = refFullName diff --git a/routers/private/hook_pre_receive.go b/routers/private/hook_pre_receive.go index 0a3c8e2559..c099979b10 100644 --- a/routers/private/hook_pre_receive.go +++ b/routers/private/hook_pre_receive.go @@ -4,9 +4,11 @@ package private import ( + "context" "fmt" "net/http" "os" + "strings" "code.gitea.io/gitea/models" asymkey_model "code.gitea.io/gitea/models/asymkey" @@ -19,6 +21,7 @@ import ( "code.gitea.io/gitea/modules/git" "code.gitea.io/gitea/modules/log" "code.gitea.io/gitea/modules/private" + "code.gitea.io/gitea/modules/setting" "code.gitea.io/gitea/modules/web" gitea_context "code.gitea.io/gitea/services/context" pull_service "code.gitea.io/gitea/services/pull" @@ -117,6 +120,10 @@ func HookPreReceive(ctx *gitea_context.PrivateContext) { newCommitID := opts.NewCommitIDs[i] refFullName := opts.RefFullNames[i] + if setting.LFS.StartServer { + preReceiveLFS(ourCtx, oldCommitID, newCommitID) + } + switch { case refFullName.IsBranch(): preReceiveBranch(ourCtx, oldCommitID, newCommitID, refFullName) @@ -135,6 +142,59 @@ func HookPreReceive(ctx *gitea_context.PrivateContext) { ctx.PlainText(http.StatusOK, "ok") } +func preReceiveLFS(ctx *preReceiveContext, oldCommitID, newCommitID string) { + if !ctx.AssertCanWriteCode() { + return + } + + objectFormat := ctx.Repo.GetObjectFormat() + + // Branch deletion, nothing to check + if newCommitID == objectFormat.EmptyObjectID().String() { + return + } + + repo := ctx.Repo.Repository + + // TODO: IMPLEMENT CHECKS HERE +} + func preReceiveBranch(ctx *preReceiveContext, oldCommitID, newCommitID string, refFullName git.RefName) { branchName := refFullName.BranchName() ctx.branchName = branchName ``` And here is a "better" (but not thoroughly tested) version of the python pre-receive hook: ```python #!/usr/bin/python3 from __future__ import annotations import re import sys import subprocess from pathlib import Path # Install with # sudo -u git ln -s /var/lib/gitea/dne_hooks/fail_lfs_no_objects.pre-receive /var/lib/gitea/repositories/{owner}/{repository}.git/hooks/pre-receive.d/fail_lfs_no_objects GITEA_LFS_STORAGE_DIR = Path('/var/lib/gitea/data/lfs') LFS_LSFILES_RE = re.compile("^([a-zA-Z0-9]+) ([*|-]) (.*)$") def _lfs_object_exists(oid: str) -> bool: # gitea store LFS files with CAS in subdirs object_path = (GITEA_LFS_STORAGE_DIR / oid[0:2] / oid[2:4] / oid[4:]).absolute() return object_path.is_file() def _is_default_sha(rev: str) -> bool: return all(c == '0' for c in rev) def _find_first_unique_rev(rev: str) -> str: all_refs = set(subprocess.check_output(['git', 'show-ref', '--hash'], text=True).splitlines(False)) # remove self if rev in all_refs: all_refs.remove(rev) if not all_refs: # gen default_sha return len(rev) * '0' return subprocess.check_output(['git', 'merge-base', rev, *all_refs] , text=True).strip() def check_missing_lfs_objects(rev: str, parent_rev: str | None) -> bool: ls_files_cmd = ['git', 'lfs', 'ls-files', '--long'] if parent_rev: ls_files_cmd.append(parent_rev) ls_files_cmd.append(rev) lfs_oids = [ LFS_LSFILES_RE.match(line).groups()[0] for line in subprocess.check_output(ls_files_cmd, text=True).splitlines(False) ] first_missing_lfs_object = next( ( oid for oid in lfs_oids if not _lfs_object_exists(oid) ), None, ) if first_missing_lfs_object is not None: print( ( f"Error: Your push referenced at least 1 unknown Git LFS object:\n" f"\t{first_missing_lfs_object}\n" "\n" "Try to push them with 'git lfs push --all'." ), file=sys.stderr, ) sys.exit(1) for line in sys.stdin.readlines(): prev_rev, new_rev, ref = line.strip().split() if _is_default_sha(new_rev): # branch deletion continue if _is_default_sha(prev_rev): prev_rev = _find_first_unique_rev(new_rev) is_new_ref = _is_default_sha(prev_rev) rev_list_cmd = ['git', 'rev-list', new_rev] if not is_new_ref: rev_list_cmd.append(f"^{prev_rev}") output = subprocess.check_output(rev_list_cmd, text=True) revs = [ r.strip() for r in output.splitlines(False) if r.strip() ] for rev in revs[:-1]: check_missing_lfs_objects(rev, f"{rev}~") check_missing_lfs_objects( revs[-1], # New ref will list ALL history, first commit does not have a parent f"{revs[-1]}~" if not is_new_ref else None, ) ```
Author
Owner

@lunny commented on GitHub (Jun 3, 2024):

Looks like there is no easy way to get all LFS objects from a commit. We have to read all changed objects on this commit which is too heavy. Maybe we need to have a new database table to track repo_id, commit_id, path, oid for LFS.

@lunny commented on GitHub (Jun 3, 2024): Looks like there is no easy way to get all LFS objects from a commit. We have to read all changed objects on this commit which is too heavy. Maybe we need to have a new database table to track `repo_id`, `commit_id`, `path`, `oid` for LFS.
Author
Owner

@tdesveaux commented on GitHub (Jun 3, 2024):

Do you mean for caching it and avoid checking multiple times?

I'm not sure I understand correctly We have to read all changed objects on this commit which is too heavy.
git lfs ls-files {SHA}~ {SHA} will only list for the commit SHA. Or is that already too heavy?

note: git lfs ls-files {SHA} will list all LFS objects accessible at this commit. as in if you checkout this commit, all these LFS objects need to be checked out too.

@tdesveaux commented on GitHub (Jun 3, 2024): Do you mean for caching it and avoid checking multiple times? I'm not sure I understand correctly `We have to read all changed objects on this commit which is too heavy`. `git lfs ls-files {SHA}~ {SHA}` will only list for the commit `SHA`. Or is that already too heavy? note: `git lfs ls-files {SHA}` will list all LFS objects accessible at this commit. as in if you checkout this commit, all these LFS objects need to be checked out too.
Author
Owner

@tdesveaux commented on GitHub (Aug 14, 2024):

I took a stab at an implementation 80415af209

Disclaimer: I have basically no experience with Go or the Gitea codebase.
This is not polished at all, Didn't even test this code yet.

When pushing multiple refs at the same time, all the new commits will be checked, even if the new refs have commits in common.
I think have a somewhat short lived cache should be enough to tackle this.

@tdesveaux commented on GitHub (Aug 14, 2024): I took a stab at an implementation https://github.com/tdesveaux/gitea/commit/80415af2095115fa3e68114a70cad323419e1705 Disclaimer: I have basically no experience with Go or the Gitea codebase. This is not polished at all, Didn't even test this code yet. When pushing multiple refs at the same time, all the new commits will be checked, even if the new refs have commits in common. I think have a somewhat short lived cache should be enough to tackle this.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/gitea#13064