Proposal: An abstract layer for managed git repositories #12427

Open
opened 2025-11-02 10:09:24 -06:00 by GiteaMirror · 1 comment
Owner

Originally created by @lunny on GitHub (Feb 3, 2024).

Background

As more and more big Gitea instances, the current implementation have two drawbacks.

  • Scalable

The git repositories stored in the disk and only under one directories. It’s hard to scale for those big Gitea instances. Because of the repository absolute path have already been used everywhere.

  • Fork disk optimization

Git itself supports shared repositories but Gitea haven't use this feature to reduce forked repositories disk usage. Some designs need to be considered. Which one should be the root repositories of the base and forked repositories? Should we have a hide repository as the root repositories? This is also related as the layer.

  • Risk to failed when renaming

When renaming a repository or a user, some folders needs to be renamed, this operations mixed with some database transactions. It have a high risk that the inconsistent between disk name and database records.

Purpose

So that I propose to have an abstract layer for managed repositories.

What is managed repositories? Now we have git package which can handle all git repositories, some repositories are created for pushing, editing and various reasons. Another repositories like the code repository, wiki repository, profile repository and package repositories. We call these repositories managed repositories which is not created and destroy for a special operation.

All operations of managed repositories will depends on a new package named gitrepo package rather than directly depends on git package.

I think there are some benefits for that.

  • It will be easier to introduce a distributed git storage based on the gitrepo package. After all abstracts completed, we can have a proxy mode inside of gitrepo package. i.e.

OriginalGitStorageService could keep the original logic with a root repositories path.

HTTPGitStorageService could store the managed git repositories into another server against Gitea server and provide a HTTP service to read/write managed git repositories.

  • Convert to a different storage directory structure. Currently, renaming a user or repository will need to rename the disk directories. This makes it difficult to keep consistent when operations failure. The best method is to use fixed repository information as directorie names, we can use user/repository id or others as directories name so when rename user/repository, no disk operation is necessary.

Concepts

I ever sent some PRs to want to introduce a layer in the module/git but I found it's not the right direction. That package modules/git should be a basis package which will always focus on handling disk operations. Whatever the repository is the managed one, the wiki one, the temporary one or the hide one. So I think some concepts need to be introduced to clarify.

  • Managed Git Repositories: All repositories recorded on Gitea's databases include wiki repositories or future other types repositories can be considered as managed git repositories. Only these git repositories should be managed by the distributed system.
  • Temporary Git Repositories: The repositories will be created/deleted when doing some operations in Gitea internal. Those repositories will be stored on system's temporary file system and will be clean after the related operations finished.
  • modules/git: This package should be a low level package which can handle any disk git repositories. For managed git repositories, a new package should be introduced.
  • modules/gitrepo: This is the new package introduced as an abstract layer to handle managed git repositories. It may include different storage strategy but the interface to other package is almost the same as before to hide the implementation details. This package will depend on modules/git and should not depend on any models packages. It can be dependent by other modules, services layer packages.

Refactoring

To address the purpose, we need do some refactors.

Move managed git operations and setting.RepoRootPath to modules/gitrepo package.

All operations related to managed git repositories should be moved togitrepo package but not depends on modules/git directly. modules/git is still useful. It can handle temporary repositories and is dependent by modules/gitrepo.

An abstract storage repository interface like

type Repository interface {
RelativePath() string
}

So that, we need have CodeStorageRepository , WikiStorageRepository , ProfileStorageRepository and PackageRepository which implemented this interfact.

The interface should only focus on the storage of managed git repositories.

All functions under modules/gitrepo should use this interface as the second parameters, the first one is context.Context .

Storage strategies

The relative path now is generated dynamically by ownername and reponame, it should be stored in the database, we can have some new columns in the database table repository i.e.

type Repository struct {
...
StorageRelativePath string `xorm:"VARCHAR(2048)`
...
}

For the storage path generating, we can introduce different storage strategies. i.e.


type GenerateTraditionalRelativePath(repo *repo_model.Repository) string {
    return repo.OwnerName + "/" + repo.Name
}

type GenerateHashedRelativePath(repo *repo_model.Repository) string {
    return hashfunc(repo.ID)
}

The strategy should be applied only to new created repository, the old created repositories will depend on the database table column as storage relative path.

Some strategies will require disk operations when renaming which should be part of the strategy.

We can have a convert tool to convert the traditional relative path strategy to the hashed relative path. The hashed relative path will use the repository’s ID which is a 64-bit

Multiple storage services

After the first two steps, we have enough abstract to introduce GitStorageService. A GitStorageService could have such an interface

type GitStorageService struct {
  Init(ctx context.Context) error
	OpenRepository(ctx context.Context) (GitRepository, error)
	RunCommand(ctx context.Context, repo Repository, c *git.Command, opts *git.RunOpts) error
}

A repository interafce

Since for difference

type GitRepository interface {
    GetCommit(ctx context.Context, commitID string) (*git.Commit, error)
}

Git Objects rewrite

Many git objects contains a reference to git.Repository which prevent the above abstract, so that a prepare step is to remove the reference inside the git objects like git.Commit, git.Tag and etc.

#28937
#28940
#28966

Originally created by @lunny on GitHub (Feb 3, 2024). ## Background As more and more big Gitea instances, the current implementation have two drawbacks. - Scalable The git repositories stored in the disk and only under one directories. It’s hard to scale for those big Gitea instances. Because of the repository absolute path have already been used everywhere. - Fork disk optimization Git itself supports shared repositories but Gitea haven't use this feature to reduce forked repositories disk usage. Some designs need to be considered. Which one should be the root repositories of the base and forked repositories? Should we have a hide repository as the root repositories? This is also related as the layer. - Risk to failed when renaming When renaming a repository or a user, some folders needs to be renamed, this operations mixed with some database transactions. It have a high risk that the inconsistent between disk name and database records. ## Purpose So that I propose to have an abstract layer for managed repositories. What is managed repositories? Now we have `git` package which can handle all git repositories, some repositories are created for pushing, editing and various reasons. Another repositories like the code repository, wiki repository, profile repository and package repositories. We call these repositories managed repositories which is not created and destroy for a special operation. All operations of managed repositories will depends on a new package named `gitrepo` package rather than directly depends on `git` package. I think there are some benefits for that. - It will be easier to introduce a distributed git storage based on the `gitrepo` package. After all abstracts completed, we can have a proxy mode inside of `gitrepo` package. i.e. `OriginalGitStorageService` could keep the original logic with a root repositories path. `HTTPGitStorageService` could store the managed git repositories into another server against Gitea server and provide a HTTP service to read/write managed git repositories. - Convert to a different storage directory structure. Currently, renaming a user or repository will need to rename the disk directories. This makes it difficult to keep consistent when operations failure. The best method is to use fixed repository information as directorie names, we can use user/repository id or others as directories name so when rename user/repository, no disk operation is necessary. ## Concepts I ever sent some PRs to want to introduce a layer in the `module/git` but I found it's not the right direction. That package `modules/git` should be a basis package which will always focus on handling disk operations. Whatever the repository is the managed one, the wiki one, the temporary one or the hide one. So I think some concepts need to be introduced to clarify. - Managed Git Repositories: All repositories recorded on Gitea's databases include wiki repositories or future other types repositories can be considered as managed git repositories. Only these git repositories should be managed by the distributed system. - Temporary Git Repositories: The repositories will be created/deleted when doing some operations in Gitea internal. Those repositories will be stored on system's temporary file system and will be clean after the related operations finished. - `modules/git`: This package should be a low level package which can handle any disk git repositories. For managed git repositories, a new package should be introduced. - `modules/gitrepo`: This is the new package introduced as an abstract layer to handle managed git repositories. It may include different storage strategy but the interface to other package is almost the same as before to hide the implementation details. This package will depend on `modules/git` and should not depend on any `models` packages. It can be dependent by other `modules`, `services` layer packages. ## Refactoring To address the purpose, we need do some refactors. ## Move managed git operations and `setting.RepoRootPath` to `modules/gitrepo` package. All operations related to managed git repositories should be moved to`gitrepo` package but not depends on `modules/git` directly. `modules/git` is still useful. It can handle temporary repositories and is dependent by `modules/gitrepo`. An abstract storage repository interface like ```jsx type Repository interface { RelativePath() string } ``` So that, we need have `CodeStorageRepository` , `WikiStorageRepository` , `ProfileStorageRepository` and `PackageRepository` which implemented this interfact. The interface should only focus on the storage of managed git repositories. All functions under `modules/gitrepo` should use this interface as the second parameters, the first one is `context.Context` . ## Storage strategies The relative path now is generated dynamically by ownername and reponame, it should be stored in the database, we can have some new columns in the database table `repository` i.e. ```go type Repository struct { ... StorageRelativePath string `xorm:"VARCHAR(2048)` ... } ``` For the storage path generating, we can introduce different storage strategies. i.e. ```jsx type GenerateTraditionalRelativePath(repo *repo_model.Repository) string { return repo.OwnerName + "/" + repo.Name } type GenerateHashedRelativePath(repo *repo_model.Repository) string { return hashfunc(repo.ID) } ``` The strategy should be applied only to new created repository, the old created repositories will depend on the database table column as storage relative path. Some strategies will require disk operations when renaming which should be part of the strategy. We can have a convert tool to convert the traditional relative path strategy to the hashed relative path. The hashed relative path will use the repository’s ID which is a 64-bit ## Multiple storage services After the first two steps, we have enough abstract to introduce `GitStorageService`. A `GitStorageService` could have such an interface ```go type GitStorageService struct { Init(ctx context.Context) error OpenRepository(ctx context.Context) (GitRepository, error) RunCommand(ctx context.Context, repo Repository, c *git.Command, opts *git.RunOpts) error } ``` A repository interafce Since for difference ```go type GitRepository interface { GetCommit(ctx context.Context, commitID string) (*git.Commit, error) } ``` ## Git Objects rewrite Many git objects contains a reference to `git.Repository` which prevent the above abstract, so that a prepare step is to remove the reference inside the git objects like `git.Commit`, `git.Tag` and etc. ## Related PRs #28937 #28940 #28966
GiteaMirror added the type/proposal label 2025-11-02 10:09:24 -06:00
Author
Owner

@silverwind commented on GitHub (Mar 9, 2024):

Reduce fork repositories size

That will be a massive benefit for big hosters with many forks per repo and this is also how GitHub works under the hood. A repo and all of its forks use a shared git repo on the server, so if a repo has 1000 forks, you are only storing their changed branches.

Care needs to taken to prevent cross-repo influences. GitHub also had a number of issues related to this in the past (this comes to mind).

@silverwind commented on GitHub (Mar 9, 2024): > Reduce fork repositories size That will be a massive benefit for big hosters with many forks per repo and this is also how GitHub works under the hood. A repo and all of its forks use a shared git repo on the server, so if a repo has 1000 forks, you are only storing their changed branches. Care needs to taken to prevent cross-repo influences. GitHub also had a number of issues related to this in the past ([this comes to mind](https://news.ycombinator.com/item?id=24882921)).
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/gitea#12427