Prune hook_task Table #5069

Closed
opened 2025-11-02 06:13:10 -06:00 by GiteaMirror · 19 comments
Owner

Originally created by @jag3773 on GitHub (Mar 16, 2020).

  • Gitea version (or commit ref): NA
  • Git version: NA
  • Operating system: NA
  • Database (use [x]): NA
    • PostgreSQL
    • MySQL
    • MSSQL
    • SQLite
  • Can you reproduce the bug at https://try.gitea.io:
    • Yes (provide example URL)
    • No
    • Not relevant
  • Log gist:

Description

The hook_task table doesn't appear to be pruned at any point, so on an active site this table can grow to be very large and it makes loading the edit webhook page quite slow. I'm not sure how the action table relates but that may need to be pruned also?

Possible Solution

One solution I thought of is to allow administrators to set a max number of deliveries to retain per webhook.

Screenshots

MySQL [(none)]> SELECT 
    ->      table_schema as `Database`, 
    ->      table_name AS `Table`, 
    ->      round(((data_length + index_length) / 1024 / 1024), 2) `Size in MB` 
    -> FROM information_schema.TABLES 
    -> ORDER BY (data_length + index_length) DESC;
+--------------------+------------------------------------------------------+------------+
| Database           | Table                                                | Size in MB |
+--------------------+------------------------------------------------------+------------+
| gogs               | hook_task                                            |    1723.52 |
| gogs               | action                                               |     908.06 |
Originally created by @jag3773 on GitHub (Mar 16, 2020). - Gitea version (or commit ref): NA - Git version: NA - Operating system: NA - Database (use `[x]`): NA - [ ] PostgreSQL - [x] MySQL - [ ] MSSQL - [ ] SQLite - Can you reproduce the bug at https://try.gitea.io: - [ ] Yes (provide example URL) - [ ] No - [x] Not relevant - Log gist: ## Description The hook_task table doesn't appear to be pruned at any point, so on an active site this table can grow to be very large and it makes loading the edit webhook page quite slow. I'm not sure how the `action` table relates but that may need to be pruned also? ## Possible Solution One solution I thought of is to allow administrators to set a max number of deliveries to retain *per webhook*. ## Screenshots ``` MySQL [(none)]> SELECT -> table_schema as `Database`, -> table_name AS `Table`, -> round(((data_length + index_length) / 1024 / 1024), 2) `Size in MB` -> FROM information_schema.TABLES -> ORDER BY (data_length + index_length) DESC; +--------------------+------------------------------------------------------+------------+ | Database | Table | Size in MB | +--------------------+------------------------------------------------------+------------+ | gogs | hook_task | 1723.52 | | gogs | action | 908.06 | ```
GiteaMirror added the issue/confirmedtype/enhancement labels 2025-11-02 06:13:10 -06:00
Author
Owner

@lafriks commented on GitHub (Mar 16, 2020):

Action table can't really be pruned as it contains valuable activity information, while I agree webhooks could be cleared up by some policy

@lafriks commented on GitHub (Mar 16, 2020): Action table can't really be pruned as it contains valuable activity information, while I agree webhooks could be cleared up by some policy
Author
Owner

@lunny commented on GitHub (Mar 17, 2020):

At first, we can add button on repository webhook management UI to delete old webhooks.

@lunny commented on GitHub (Mar 17, 2020): At first, we can add button on repository webhook management UI to delete old webhooks.
Author
Owner

@jag3773 commented on GitHub (Mar 17, 2020):

@lunny If I understand your suggestion, I don't think that solves the problem. That's only helpful if you have a handful of repositories, in which case you would never have this problem in the first place. Imagine having to click that button on 25,000 repositories!

The only realistic solution in that case is to have a global setting that the admin can control.

@jag3773 commented on GitHub (Mar 17, 2020): @lunny If I understand your suggestion, I don't think that solves the problem. That's only helpful if you have a handful of repositories, in which case you would never have this problem in the first place. Imagine having to click that button on 25,000 repositories! The only realistic solution in that case is to have a global setting that the admin can control.
Author
Owner

@lunny commented on GitHub (Mar 17, 2020):

@jag3773 I also think there should be a button on admin panel but it's not conflicted with my idea. The button on admin panel will delete all the repositories in the gitea instance and for a public instance, we should let user chose whether to delete them.

@lunny commented on GitHub (Mar 17, 2020): @jag3773 I also think there should be a button on admin panel but it's not conflicted with my idea. The button on admin panel will delete all the repositories in the gitea instance and for a public instance, we should let user chose whether to delete them.
Author
Owner

@bhalbright commented on GitHub (Apr 14, 2020):

Just posting a note I am looking into this issue, thanks!

@bhalbright commented on GitHub (Apr 14, 2020): Just posting a note I am looking into this issue, thanks!
Author
Owner

@stale[bot] commented on GitHub (Jun 13, 2020):

This issue has been automatically marked as stale because it has not had recent activity. I am here to help clear issues left open even if solved or waiting for more insight. This issue will be closed if no further activity occurs during the next 2 weeks. If the issue is still valid just add a comment to keep it alive. Thank you for your contributions.

@stale[bot] commented on GitHub (Jun 13, 2020): This issue has been automatically marked as stale because it has not had recent activity. I am here to help clear issues left open even if solved or waiting for more insight. This issue will be closed if no further activity occurs during the next 2 weeks. If the issue is still valid just add a comment to keep it alive. Thank you for your contributions.
Author
Owner

@lunny commented on GitHub (Jun 13, 2020):

A configuration item could be added to keep recent 1 year(or longer) webhooks. A background go routine could clean them every day.

@lunny commented on GitHub (Jun 13, 2020): A configuration item could be added to keep recent 1 year(or longer) webhooks. A background go routine could clean them every day.
Author
Owner

@bhalbright commented on GitHub (Jun 15, 2020):

@lunny thanks, I had submitted a PR for an implementation based on deleting all but x webhooks (can be set per repo). Do you have any thoughts there...that was the implementation that worked best for our use case but I can understand if you'd rather have something a little different for general usage.

@bhalbright commented on GitHub (Jun 15, 2020): @lunny thanks, I had submitted a PR for an implementation based on deleting all but x webhooks (can be set per repo). Do you have any thoughts there...that was the implementation that worked best for our use case but I can understand if you'd rather have something a little different for general usage.
Author
Owner

@lunny commented on GitHub (Jun 18, 2020):

Then we could have two choices. One is to delete old finished records, another is to keep some recent unfinished records.

@lunny commented on GitHub (Jun 18, 2020): Then we could have two choices. One is to delete old finished records, another is to keep some recent unfinished records.
Author
Owner

@jag3773 commented on GitHub (Jun 18, 2020):

Sounds like you are both saying similar things. My request is to limit the number of items kept in the history, whether it is date based or count based is not so important.

The advantage of "count based" is that every repo will retain a certain number of recent events. In a date based system, it's likely you would not see any history for a repo that hasn't been used recently.

@jag3773 commented on GitHub (Jun 18, 2020): Sounds like you are both saying similar things. My request is to limit the number of items kept in the history, whether it is date based or count based is not so important. The advantage of "count based" is that every repo will retain a certain number of recent events. In a date based system, it's likely you would not see any history for a repo that hasn't been used recently.
Author
Owner

@bhalbright commented on GitHub (Jun 21, 2020):

I guess @lunny you are saying we should give the user the option to either purge by "older than x days" or an option like we had suggested "delete all but most recent x entries"? Which should be the default option?

@bhalbright commented on GitHub (Jun 21, 2020): I guess @lunny you are saying we should give the user the option to either purge by "older than x days" or an option like we had suggested "delete all but most recent x entries"? Which should be the default option?
Author
Owner

@lunny commented on GitHub (Jun 21, 2020):

@bhalbright Yes. That's what I meant.

@lunny commented on GitHub (Jun 21, 2020): @bhalbright Yes. That's what I meant.
Author
Owner

@jgkirschbaum commented on GitHub (Jul 16, 2020):

Action table can't really be pruned as it contains valuable activity information, while I agree webhooks could be cleared up by some policy

@lunny @lafriks If I see this correctly, the action table could also be pruned. In my opinion the only function of the action table is to provide data for the dashboard. The timespan for the dashboard is currently 1 year, so data older than one year could be pruned. Our action table is about 1 GB in size after 1 year usage of gitea, which makes up 90% of the total database size.

So it would be perfect if someone could please implement the following features:

  • The timespan displayed of the dashboard should be configurable, so not always a full year has to be displayed (dashboard_display_period).
  • The timespan for historization of the action table should be configurable (action_historization_period (default is dashboard_display_period)).
  • Pruning of the action table should be implemented by a gitea cron task in dependence of the configured value for action_historization_period.

I'm sorry I can't support coding, but I don't speak go.

@jgkirschbaum commented on GitHub (Jul 16, 2020): > > > Action table can't really be pruned as it contains valuable activity information, while I agree webhooks could be cleared up by some policy @lunny @lafriks If I see this correctly, the action table could also be pruned. In my opinion the only function of the action table is to provide data for the dashboard. The timespan for the dashboard is currently 1 year, so data older than one year could be pruned. Our action table is about 1 GB in size after 1 year usage of gitea, which makes up 90% of the total database size. So it would be perfect if someone could please implement the following features: * The timespan displayed of the dashboard should be configurable, so not always a full year has to be displayed (`dashboard_display_period`). * The timespan for historization of the action table should be configurable (`action_historization_period` (default is `dashboard_display_period`)). * Pruning of the action table should be implemented by a gitea cron task in dependence of the configured value for `action_historization_period`. I'm sorry I can't support coding, but I don't speak go.
Author
Owner

@lunny commented on GitHub (Jul 16, 2020):

I think we could also have a button on admin panel to clean the two tables.

@lunny commented on GitHub (Jul 16, 2020): I think we could also have a button on admin panel to clean the two tables.
Author
Owner

@jgkirschbaum commented on GitHub (Jul 16, 2020):

Yes, would be a first step.

@jgkirschbaum commented on GitHub (Jul 16, 2020): Yes, would be a first step.
Author
Owner

@zeripath commented on GitHub (Jul 28, 2020):

So partially this is a problem of database management. When you get into large enough systems those running Gitea are going to need to actually do some DB management themselves and not rely on the ORM creating a perfect DB.

For example here if you were using a postgres 10+ DB backend you could simply PARTITION the action table etc. Similarly for other DB systems.

Throwing away action data is a decision for server managers - I'm not sure that we at Gitea should be running anything that deletes data by default.

However, we can and should do a few things here.

  • I agree putting in some configurabilty about dashboard length/feed length is required.
  • We could provide a cron task that would delete old actions etc. BUT they absolutely cannot run by default

Another option is whether we can store these actions on disk as a sort of hybrid db - however, we are then getting in to the situation of effectively being a DBMS - it's the job of the DB to decide how to partition and manage big tables. I'm not certain how GH or Facebook manage their big tables - I know some people advocate throwing this sort of stuff into a non-relational/NoSQL db depending on the inherent structure within these - and given we mostly don't use the relational features of this data that could work.

One thing we should additionally look at is why this data is updated and not immutable - if it can be made immutable then the hybrid approach may make more sense.

@zeripath commented on GitHub (Jul 28, 2020): So partially this is a problem of database management. When you get into large enough systems those running Gitea are going to need to actually do some DB management themselves and not rely on the ORM creating a perfect DB. For example here if you were using a postgres 10+ DB backend you could simply PARTITION the action table etc. Similarly for other DB systems. Throwing away action data is a decision for server managers - I'm not sure that we at Gitea should be running anything that deletes data by default. However, we can and should do a few things here. * I agree putting in some configurabilty about dashboard length/feed length is required. * We could provide a cron task that would delete old actions etc. BUT they absolutely cannot run by default Another option is whether we can store these actions on disk as a sort of hybrid db - however, we are then getting in to the situation of effectively being a DBMS - it's the job of the DB to decide how to partition and manage big tables. I'm not certain how GH or Facebook manage their big tables - I know some people advocate throwing this sort of stuff into a non-relational/NoSQL db depending on the inherent structure within these - and given we mostly don't use the relational features of this data that could work. One thing we should additionally look at is why this data is updated and not immutable - if it can be made immutable then the hybrid approach may make more sense.
Author
Owner

@bhalbright commented on GitHub (Jul 29, 2020):

@zeripath regarding the cron to delete old actions being OFF by default, would you expect the same for a cron job to delete from hook_task? In the PR I had submitted it was turned on by default globally and then you could turn on/off by repo in the UI.

@bhalbright commented on GitHub (Jul 29, 2020): @zeripath regarding the cron to delete old actions being OFF by default, would you expect the same for a cron job to delete from hook_task? In the PR I had submitted it was turned on by default globally and then you could turn on/off by repo in the UI.
Author
Owner

@jgkirschbaum commented on GitHub (Jul 29, 2020):

@zeripath

For example here if you were using a postgres 10+ DB backend you could simply PARTITION the action table etc. Similarly for other DB systems.

You are right, that would be simple and effective, but I would not recommend this approach to Gitea admins, because then you have another DDL as the one delivered with Gitea.

However, we can and should do a few things here.

  • I agree putting in some configurabilty about dashboard length/feed length is required.
  • We could provide a cron task that would delete old actions etc. BUT they absolutely cannot run by default

That's great and I appreciate that.

Another option is whether we can store these actions on disk as a sort of hybrid db - however, we are then getting in to the situation of effectively being a DBMS - it's the job of the DB to decide how to partition and manage big tables. I'm not certain how GH or Facebook manage their big tables - I know some people advocate throwing this sort of stuff into a non-relational/NoSQL db depending on the inherent structure within these - and given we mostly don't use the relational features of this data that could work.

IMHO I would advice you not to use plain text files and create a hybrid db of your own. Plain text files are a mess and in a container environment they are a mess and pain. I think the preasure on this topic isn't so high, so making the things configurable paired with a few simple db jobs inside Gitea as you recommended would be sufficient.

One thing we should additionally look at is why this data is updated and not immutable - if it can be made immutable then the hybrid approach may make more sense.

Which data is updated? As far as I know, the data in both the action table and the hook_task table are immutable and are not updated. But as I mentioned before I would not go with an hybrid approach, this introduces unneeded complexity.

Thank you for your efforts.

@jgkirschbaum commented on GitHub (Jul 29, 2020): @zeripath > For example here if you were using a postgres 10+ DB backend you could simply PARTITION the action table etc. Similarly for other DB systems. You are right, that would be simple and effective, but I would not recommend this approach to Gitea admins, because then you have another DDL as the one delivered with Gitea. > However, we can and should do a few things here. > * I agree putting in some configurabilty about dashboard length/feed length is required. > * We could provide a cron task that would delete old actions etc. BUT they absolutely cannot run by default That's great and I appreciate that. > Another option is whether we can store these actions on disk as a sort of hybrid db - however, we are then getting in to the situation of effectively being a DBMS - it's the job of the DB to decide how to partition and manage big tables. I'm not certain how GH or Facebook manage their big tables - I know some people advocate throwing this sort of stuff into a non-relational/NoSQL db depending on the inherent structure within these - and given we mostly don't use the relational features of this data that could work. IMHO I would advice you not to use plain text files and create a hybrid db of your own. Plain text files are a mess and in a container environment they are a mess and pain. I think the preasure on this topic isn't so high, so making the things configurable paired with a few simple db jobs inside Gitea as you recommended would be sufficient. > One thing we should additionally look at is why this data is updated and not immutable - if it can be made immutable then the hybrid approach may make more sense. Which data is updated? As far as I know, the data in both the action table and the hook_task table are immutable and are not updated. But as I mentioned before I would not go with an hybrid approach, this introduces unneeded complexity. Thank you for your efforts.
Author
Owner

@zeripath commented on GitHub (Jul 29, 2020):

@bhalbright in regards to the hook_task table - in some ways that could be argued as being just user visible logging. Actions though are the unique behaviours of the users - it's a bit more than logging.

@zeripath commented on GitHub (Jul 29, 2020): @bhalbright in regards to the hook_task table - in some ways that could be argued as being just user visible logging. Actions though are the unique behaviours of the users - it's a bit more than logging.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/gitea#5069