mirror of
https://github.com/go-gitea/gitea.git
synced 2026-03-12 02:24:21 -05:00
Queues not terminating gracefully #10315
Closed
opened 2025-11-02 09:04:02 -06:00 by GiteaMirror
·
19 comments
No Branch/Tag Specified
main
release/v1.25
release/v1.24
release/v1.23
release/v1.22
release/v1.21
release/v1.20
release/v1.19
release/v1.18
release/v1.17
release/v1.16
release/v1.15
release/v1.14
release/v1.13
release/v1.12
release/v1.11
release/v1.10
release/v1.9
release/v1.8
v1.25.3
v1.25.2
v1.25.1
v1.25.0
v1.24.7
v1.25.0-rc0
v1.26.0-dev
v1.24.6
v1.24.5
v1.24.4
v1.24.3
v1.24.2
v1.24.1
v1.24.0
v1.23.8
v1.24.0-rc0
v1.25.0-dev
v1.23.7
v1.23.6
v1.23.5
v1.23.4
v1.23.3
v1.23.2
v1.23.1
v1.23.0
v1.23.0-rc0
v1.24.0-dev
v1.22.6
v1.22.5
v1.22.4
v1.22.3
v1.22.2
v1.22.1
v1.22.0
v1.23.0-dev
v1.22.0-rc1
v1.21.11
v1.22.0-rc0
v1.21.10
v1.21.9
v1.21.8
v1.21.7
v1.21.6
v1.21.5
v1.21.4
v1.21.3
v1.21.2
v1.20.6
v1.21.1
v1.21.0
v1.21.0-rc2
v1.21.0-rc1
v1.20.5
v1.22.0-dev
v1.21.0-rc0
v1.20.4
v1.20.3
v1.20.2
v1.20.1
v1.20.0
v1.19.4
v1.21.0-dev
v1.20.0-rc2
v1.20.0-rc1
v1.20.0-rc0
v1.19.3
v1.19.2
v1.19.1
v1.19.0
v1.19.0-rc1
v1.20.0-dev
v1.19.0-rc0
v1.18.5
v1.18.4
v1.18.3
v1.18.2
v1.18.1
v1.18.0
v1.17.4
v1.18.0-rc1
v1.19.0-dev
v1.18.0-rc0
v1.17.3
v1.17.2
v1.17.1
v1.17.0
v1.17.0-rc2
v1.16.9
v1.17.0-rc1
v1.18.0-dev
v1.16.8
v1.16.7
v1.16.6
v1.16.5
v1.16.4
v1.16.3
v1.16.2
v1.16.1
v1.16.0
v1.15.11
v1.17.0-dev
v1.16.0-rc1
v1.15.10
v1.15.9
v1.15.8
v1.15.7
v1.15.6
v1.15.5
v1.15.4
v1.15.3
v1.15.2
v1.15.1
v1.14.7
v1.15.0
v1.15.0-rc3
v1.14.6
v1.15.0-rc2
v1.14.5
v1.16.0-dev
v1.15.0-rc1
v1.14.4
v1.14.3
v1.14.2
v1.14.1
v1.14.0
v1.13.7
v1.14.0-rc2
v1.13.6
v1.13.5
v1.14.0-rc1
v1.15.0-dev
v1.13.4
v1.13.3
v1.13.2
v1.13.1
v1.13.0
v1.12.6
v1.13.0-rc2
v1.14.0-dev
v1.13.0-rc1
v1.12.5
v1.12.4
v1.12.3
v1.12.2
v1.12.1
v1.11.8
v1.12.0
v1.11.7
v1.12.0-rc2
v1.11.6
v1.12.0-rc1
v1.13.0-dev
v1.11.5
v1.11.4
v1.11.3
v1.10.6
v1.12.0-dev
v1.11.2
v1.10.5
v1.11.1
v1.10.4
v1.11.0
v1.11.0-rc2
v1.10.3
v1.11.0-rc1
v1.10.2
v1.10.1
v1.10.0
v1.9.6
v1.9.5
v1.10.0-rc2
v1.11.0-dev
v1.10.0-rc1
v1.9.4
v1.9.3
v1.9.2
v1.9.1
v1.9.0
v1.9.0-rc2
v1.10.0-dev
v1.9.0-rc1
v1.8.3
v1.8.2
v1.8.1
v1.8.0
v1.8.0-rc3
v1.7.6
v1.8.0-rc2
v1.7.5
v1.8.0-rc1
v1.9.0-dev
v1.7.4
v1.7.3
v1.7.2
v1.7.1
v1.7.0
v1.7.0-rc3
v1.6.4
v1.7.0-rc2
v1.6.3
v1.7.0-rc1
v1.7.0-dev
v1.6.2
v1.6.1
v1.6.0
v1.6.0-rc2
v1.5.3
v1.6.0-rc1
v1.6.0-dev
v1.5.2
v1.5.1
v1.5.0
v1.5.0-rc2
v1.5.0-rc1
v1.5.0-dev
v1.4.3
v1.4.2
v1.4.1
v1.4.0
v1.4.0-rc3
v1.4.0-rc2
v1.3.3
v1.4.0-rc1
v1.3.2
v1.3.1
v1.3.0
v1.3.0-rc2
v1.3.0-rc1
v1.2.3
v1.2.2
v1.2.1
v1.2.0
v1.2.0-rc3
v1.2.0-rc2
v1.1.4
v1.2.0-rc1
v1.1.3
v1.1.2
v1.1.1
v1.1.0
v1.0.2
v1.0.1
v1.0.0
v0.9.99
Labels
Clear labels
$20
$250
$50
$500
backport/done
💎 Bounty
docs-update-needed
good first issue
hacktoberfest
issue/bounty
issue/confirmed
issue/critical
issue/duplicate
issue/needs-feedback
issue/not-a-bug
issue/regression
issue/stale
issue/workaround
lgtm/need 2
modifies/api
modifies/translation
outdated/backport/v1.18
outdated/theme/markdown
outdated/theme/timetracker
performance/bigrepo
performance/cpu
performance/memory
performance/speed
pr/breaking
proposal/accepted
proposal/rejected
pr/wip
pull-request
reviewed/wontfix
💰 Rewarded
skip-changelog
status/blocked
topic/accessibility
topic/api
topic/authentication
topic/build
topic/code-linting
topic/commit-signing
topic/content-rendering
topic/deployment
topic/distribution
topic/federation
topic/gitea-actions
topic/issues
topic/lfs
topic/mobile
topic/moderation
topic/packages
topic/pr
topic/projects
topic/repo
topic/repo-migration
topic/security
topic/theme
topic/ui
topic/ui-interaction
topic/ux
topic/webhooks
topic/wiki
type/bug
type/deprecation
type/docs
type/enhancement
type/feature
type/miscellaneous
type/proposal
type/question
type/refactoring
type/summary
type/testing
type/upstream
Mirrored from GitHub Pull Request
No Label
type/bug
Milestone
No items
No Milestone
Projects
Clear projects
No project
No Assignees
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: github-starred/gitea#10315
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @brechtvl on GitHub (Feb 21, 2023).
Description
We are running into the issue where a Gitea restart happens during merge conflict checking, PRs remain stuck in conflict state forever. I think there's two distinct issues here, one is that these don't get unstuck on new pushes and perhaps that's best left for another report.
However the reason things get into this state in the first place seems to be a problem in shutdown terminating all the queues immediately instead of respecting the hammer time.
When starting
./gitea weband then dokillall gitea, I see the following in the log:This log is from my local test instance with configuration set to defaults as much as possible.
On our production instance that
Terminated before completed flushingis leading to a lot of different errors as workers get terminated in the middle of what they're doing.I can provide more detailed logs if needed, but maybe this is easy to redo on any instance.
Gitea Version
main (
43405c3)Can you reproduce the bug on the Gitea demo site?
No
Log Gist
No response
Screenshots
No response
Git Version
No response
Operating System
No response
How are you running Gitea?
Own build from main.
Database
None
@techknowlogick commented on GitHub (Feb 21, 2023):
For debugging purposes:
[W] [63f40381] cannot get cache context when getting data: &{context.Background.WithCancel.WithCancel.WithValue(type pprof.labelContextKey, val {"graceful-lifecycle":"with-hammer"})was also found in the logs, which traces back to the following function:https://sourcegraph.com/github.com/go-gitea/gitea/-/blob/modules/cache/context.go?L58
Edit: this is also showing up on gitea.com, although not during shutdowns.
@brechtvl commented on GitHub (Feb 21, 2023):
I'm also not seeing
"graceful-lifecycle":"with-hammer"messages when simply starting and stopping gitea. But when I for example do a push that causes conflict checking workers to be run for PRs, then they start showing up.So I guess there is some problem with cache context in the workers?
@zeripath commented on GitHub (Feb 21, 2023):
If I had to guess this is going to be a misreport and there's going to be a non-deterministic hierarchy of cancelled contexts - the wrong one of which is being picked and reported. I'd need to spend a good few hours looking through the code thinking about the concurrency again which I won't have time to look at until the weekend.
@lunny commented on GitHub (Feb 21, 2023):
The context cache should be used in http request level currently. Do you have a redid cache server?
@techknowlogick commented on GitHub (Feb 21, 2023):
@lunny they are using memcached for cache
@lunny commented on GitHub (Feb 21, 2023):
This maybe occurs when Gitea start up.
@brechtvl commented on GitHub (Feb 21, 2023):
I see these
Terminated before completed flushingwarning messages also inv1.16.9(had some problems testing earlier). So I'm not sure if this is a problem that has existed for a while, or if these messages are not as serious as they sound and the problem is elsewhere.Here is the part of the log from the production server where it got the shutdown signal and then ran into errors, if it helps:
https://gist.github.com/brechtvl/e4778293c1c919a4ec896a31e418677c
@techknowlogick commented on GitHub (Feb 21, 2023):
@lunny the same ctx error can be seen in Gitea.com logs and occurs at times other than startup and shutdown
@lunny commented on GitHub (Feb 22, 2023):
I have sent a PR #23054, but I don't think that will fix this problem in fact. It will remove the unnecessary warning logs.
@lunny commented on GitHub (Feb 22, 2023):
From the logs, looks like it's a normal terminal process.
@brechtvl commented on GitHub (Feb 22, 2023):
Everything seems to shut down within 1s while
GRACEFUL_HAMMER_TIMEwas left to the default 60s. I was expecting it to wait up to 60 seconds for any ongoing work to complete. Or is that only meant for requests, not queues?If it's not meant for queues, is the mechanism that work gets added back to the queue on such errors, to be redone after restart? It wasn't doing that as far as I could tell, but I may well have missed something.
@lunny commented on GitHub (Feb 22, 2023):
I think the logic should be if all workers shut down less than 60s, then it will exit immediately. If some workers shutting down time out(address 60s), then hammer shutdown will be executed and the program will exit at that time.
@brechtvl commented on GitHub (Feb 22, 2023):
I found more details (and a very hacky workaround) for pull requests getting stuck in conflict checking.
pr_patch_checkerwork, a persistent queue of remaining PRs to be checked is saved to disk.prPatchCheckerQueue, but this seemingly does not cause any work to start.InitializePullRequestswill put any PRs in need of checking in the channel queue ofprPatchCheckerQueueand start working on them.checkAndUpdateStatusnever updates the PR status becauseprPatchCheckerQueue.Hasreturns true, due to this same PR still being in the persistent queue ofprPatchCheckerQueue.The persistent queue in this case seems unnecessary as
InitializePullRequestscan already restart conflict checking for any PRs that need it.@brechtvl commented on GitHub (Feb 22, 2023):
The steps to reproduce this are:
killall giteaquickly while conflict checking is going on@wxiaoguang commented on GitHub (Feb 23, 2023):
This is a test for this problem
The results are quite surprising.
@zeripath commented on GitHub (Feb 25, 2023):
@wxiaoguang Thanks for your testcase. It's not quite right but it's pointed me in the correct direction.
Queues are extremely concurrent and the handler in q1 is going to drop a lot of data - you cannot expect that as soon as you tell a queue to shutdown that no more data will be handled - In fact you can almost guarantee that at least a few more will be handled.
I'm just working through it trying to make it safe and properly concurrent safe and I'm placing it in modules/queue once I've slightly rewritten it - (I'm not sure what q3 is supposed to reliably do.)
Now to address the Warning - if you switch on Trace logging for modules/queue:
You would see that internal channelqueue is explicitly drained at Terminate. There should be no loss of data - the warning is not the cause of the issue you're investigating.
I assume you're actually investigating things not being emptied on startup - and that's a real bug.
If you look at:
8540fc45b1/modules/queue/queue_disk_channel.go (L175-L197)and compare with:
8540fc45b1/modules/queue/unique_queue_disk_channel.go (L212-L226)I think you'll see the real bug.
@wxiaoguang commented on GitHub (Feb 25, 2023):
Actually, I do not care about whether the queue is reliable or not. And in most cases, the queue doesn't need to be reliable.
But
has3=19seems to cause the bug described by this issue. These tasks are just stuck in queue.@zeripath commented on GitHub (Feb 25, 2023):
As I said, your testcase is somewhat incorrect in form. The reason you're seeing somewhat inconsistent behaviour is partially because of this but also because of the bug that I mentioned above.
Thank you for the testcase though because whilst it's not quite right it made me realise that the bug is the one I've mentioned above - which I've fixed in the attached PR. The testcase has also been fixed.
@wxiaoguang commented on GitHub (Feb 26, 2023):
After applying #23154 , I can still see some strange results (just strange, I do not know whether they are right or wrong)
q2always outputsexecuted2=0, has2=20, does it mean that: if noPushafter the startup, the tasks in the queue would never be executed?q3usesHasto limit the tasks only being executed once (the code in below details, maybe like thecheckAndUpdateStatusin code?), the output will beexecuted3=107, has3=0, it means that many tasks are still re-executed even if there is aHascheck. I know that there is concurrency, but is it clear to future developers? Is the existing code likecheckAndUpdateStatuswhich relies onHasworking as expected in all cases?Are these two behaviors by design? If yes, I think some comments would be very useful. If no, fix or not?
Feel free to ignore my comment if you think they are unrelated.
ps: I didn't get the point about what do you mean by "incorrect" or "not quite right". That test code is just used to test the queue's behavior, there is no assertion in it, and I didn't say anything about "correct" or "right" before ....