Queues not terminating gracefully #10315

@zeripath commented on GitHub (Feb 21, 2023):

If I had to guess this is going to be a misreport and there's going to be a non-deterministic hierarchy of cancelled contexts - the wrong one of which is being picked and reported. I'd need to spend a good few hours looking through the code thinking about the concurrency again which I won't have time to look at until the weekend.

@zeripath commented on GitHub (Feb 21, 2023): If I had to guess this is going to be a misreport and there's going to be a non-deterministic hierarchy of cancelled contexts - the wrong one of which is being picked and reported. I'd need to spend a good few hours looking through the code thinking about the concurrency again which I won't have time to look at until the weekend.

GiteaMirror commented

@lunny commented on GitHub (Feb 21, 2023):

The context cache should be used in http request level currently. Do you have a redid cache server?

@lunny commented on GitHub (Feb 21, 2023): The context cache should be used in http request level currently. Do you have a redid cache server?

GiteaMirror commented

@techknowlogick commented on GitHub (Feb 21, 2023):

@lunny they are using memcached for cache

@techknowlogick commented on GitHub (Feb 21, 2023): @lunny they are using memcached for cache

GiteaMirror commented

@lunny commented on GitHub (Feb 21, 2023):

@lunny they are using memcached for cache

This maybe occurs when Gitea start up.

@lunny commented on GitHub (Feb 21, 2023): > @lunny they are using memcached for cache This maybe occurs when Gitea start up.

GiteaMirror commented

@brechtvl commented on GitHub (Feb 21, 2023):

I see these Terminated before completed flushing warning messages also in v1.16.9 (had some problems testing earlier). So I'm not sure if this is a problem that has existed for a while, or if these messages are not as serious as they sound and the problem is elsewhere.

Here is the part of the log from the production server where it got the shutdown signal and then ran into errors, if it helps:
https://gist.github.com/brechtvl/e4778293c1c919a4ec896a31e418677c

@brechtvl commented on GitHub (Feb 21, 2023): I see these `Terminated before completed flushing` warning messages also in `v1.16.9` (had some problems testing earlier). So I'm not sure if this is a problem that has existed for a while, or if these messages are not as serious as they sound and the problem is elsewhere. Here is the part of the log from the production server where it got the shutdown signal and then ran into errors, if it helps: https://gist.github.com/brechtvl/e4778293c1c919a4ec896a31e418677c

GiteaMirror commented

@techknowlogick commented on GitHub (Feb 21, 2023):

@lunny the same ctx error can be seen in Gitea.com logs and occurs at times other than startup and shutdown

@techknowlogick commented on GitHub (Feb 21, 2023): @lunny the same ctx error can be seen in Gitea.com logs and occurs at times other than startup and shutdown

GiteaMirror commented

@lunny commented on GitHub (Feb 22, 2023):

I have sent a PR #23054, but I don't think that will fix this problem in fact. It will remove the unnecessary warning logs.

@lunny commented on GitHub (Feb 22, 2023): I have sent a PR #23054, but I don't think that will fix this problem in fact. It will remove the unnecessary warning logs.

GiteaMirror commented

@lunny commented on GitHub (Feb 22, 2023):

I see these Terminated before completed flushing warning messages also in v1.16.9 (had some problems testing earlier). So I'm not sure if this is a problem that has existed for a while, or if these messages are not as serious as they sound and the problem is elsewhere.

Here is the part of the log from the production server where it got the shutdown signal and then ran into errors, if it helps: https://gist.github.com/brechtvl/e4778293c1c919a4ec896a31e418677c

From the logs, looks like it's a normal terminal process.

@lunny commented on GitHub (Feb 22, 2023): > I see these `Terminated before completed flushing` warning messages also in `v1.16.9` (had some problems testing earlier). So I'm not sure if this is a problem that has existed for a while, or if these messages are not as serious as they sound and the problem is elsewhere. > > Here is the part of the log from the production server where it got the shutdown signal and then ran into errors, if it helps: https://gist.github.com/brechtvl/e4778293c1c919a4ec896a31e418677c From the logs, looks like it's a normal terminal process.

GiteaMirror commented

@brechtvl commented on GitHub (Feb 22, 2023):

Everything seems to shut down within 1s while GRACEFUL_HAMMER_TIME was left to the default 60s. I was expecting it to wait up to 60 seconds for any ongoing work to complete. Or is that only meant for requests, not queues?

If it's not meant for queues, is the mechanism that work gets added back to the queue on such errors, to be redone after restart? It wasn't doing that as far as I could tell, but I may well have missed something.

@brechtvl commented on GitHub (Feb 22, 2023): Everything seems to shut down within 1s while `GRACEFUL_HAMMER_TIME` was left to the default 60s. I was expecting it to wait up to 60 seconds for any ongoing work to complete. Or is that only meant for requests, not queues? If it's not meant for queues, is the mechanism that work gets added back to the queue on such errors, to be redone after restart? It wasn't doing that as far as I could tell, but I may well have missed something.

GiteaMirror commented

@lunny commented on GitHub (Feb 22, 2023):

I think the logic should be if all workers shut down less than 60s, then it will exit immediately. If some workers shutting down time out(address 60s), then hammer shutdown will be executed and the program will exit at that time.

@lunny commented on GitHub (Feb 22, 2023): I think the logic should be if all workers shut down less than 60s, then it will exit immediately. If some workers shutting down time out(address 60s), then hammer shutdown will be executed and the program will exit at that time.

GiteaMirror commented

@brechtvl commented on GitHub (Feb 22, 2023):

I found more details (and a very hacky workaround) for pull requests getting stuck in conflict checking.

After shutdown in the middle of pr_patch_checker work, a persistent queue of remaining PRs to be checked is saved to disk.
On restart these are loaded into the persistent queue of prPatchCheckerQueue, but this seemingly does not cause any work to start.
Also on restart, InitializePullRequests will put any PRs in need of checking in the channel queue of prPatchCheckerQueue and start working on them.
However checkAndUpdateStatus never updates the PR status because prPatchCheckerQueue.Has returns true, due to this same PR still being in the persistent queue of prPatchCheckerQueue.

The persistent queue in this case seems unnecessary as InitializePullRequests can already restart conflict checking for any PRs that need it.

@brechtvl commented on GitHub (Feb 22, 2023): I found more details (and a very hacky workaround) for pull requests getting stuck in conflict checking. * After shutdown in the middle of `pr_patch_checker` work, a persistent queue of remaining PRs to be checked is saved to disk. * On restart these are loaded into the persistent queue of `prPatchCheckerQueue`, but this seemingly does not cause any work to start. * Also on restart, `InitializePullRequests` will put any PRs in need of checking in the channel queue of `prPatchCheckerQueue` and start working on them. * However `checkAndUpdateStatus` never updates the PR status because `prPatchCheckerQueue.Has` returns true, due to this same PR still being in the persistent queue of `prPatchCheckerQueue`. The persistent queue in this case seems unnecessary as `InitializePullRequests` can already restart conflict checking for any PRs that need it.

GiteaMirror commented

@brechtvl commented on GitHub (Feb 22, 2023):

The steps to reproduce this are:

Run Gitea with default configuration
Migrate Blender repository from http://github.com/blender/blender
Create about 10 pull requests (e.g. by editing a single file in the UI and creating a PR)
Edit another file through the UI and commit to main directly
killall gitea quickly while conflict checking is going on
After restart, some pull requests will remain stuck in conflict checking no matter if you do more commits on main or the PR head branch

@brechtvl commented on GitHub (Feb 22, 2023): The steps to reproduce this are: * Run Gitea with default configuration * Migrate Blender repository from http://github.com/blender/blender * Create about 10 pull requests (e.g. by editing a single file in the UI and creating a PR) * Edit another file through the UI and commit to main directly * `killall gitea` quickly while conflict checking is going on * After restart, some pull requests will remain stuck in conflict checking no matter if you do more commits on main or the PR head branch

GiteaMirror commented

@wxiaoguang commented on GitHub (Feb 23, 2023):

This is a test for this problem

The results are quite surprising.

TestPullRequest_QueueStuck completed1=1, executed2=0, has2=20, executed3=100, has3=19
TestPullRequest_QueueStuck completed1=4, executed2=0, has2=20, executed3=100, has3=19
TestPullRequest_QueueStuck completed1=4, executed2=0, has2=20, executed3=101, has3=19

func TestPullRequest_QueueStuck(t *testing.T) {
	setting_module.AppWorkPath = "/tmp"
	_ = util.RemoveAll("/tmp/data")
	unittest.PrepareTestEnv(t)
	setting_module.InitProviderAndLoadCommonSettingsForTest()
	setting_module.LoadQueueSettings()

	q1 := func() (completedTasks []string) {
		startWhen100Ready := make(chan struct{}) // only start data cnosuming when the 100 tasks are all pushed into queue
		stopAt20Shutdown := make(chan struct{})  // stop and shutdown at the 20th item

		testHandler := func(data ...queue.Data) []queue.Data {
			<-startWhen100Ready
			time.Sleep(100 * time.Millisecond)
			for _, datum := range data {
				s := datum.(string)
				completedTasks = append(completedTasks, s)
				if s == "task-20" {
					close(stopAt20Shutdown)
					return nil
				}
			}
			return nil
		}

		q := queue.CreateUniqueQueue("pr_patch_checker_test", testHandler, "")
		q.Run(func(atShutdown func()) { go func() { <-stopAt20Shutdown; atShutdown() }() }, func(atTerminate func()) {})

		// add 100 tasks to the queue
		for i := 0; i < 100; i++ {
			_ = q.Push("task-" + strconv.Itoa(i))
		}

		close(startWhen100Ready)
		<-stopAt20Shutdown
		return
	}

	q2 := func() (executedTasks []string, hasTasks []string) {
		stop := make(chan struct{})

		// collect the tasks that have been executed
		testHandler := func(data ...queue.Data) []queue.Data {
			for _, datum := range data {
				executedTasks = append(executedTasks, datum.(string))
			}
			return nil
		}

		q := queue.CreateUniqueQueue("pr_patch_checker_test", testHandler, "")
		q.Run(func(atShutdown func()) { go func() { <-stop; atShutdown() }() }, func(atTerminate func()) {})

		// wait for a while to see whether there are tasks to get executed.
		time.Sleep(1 * time.Second)

		// check whether the tasks are still in the queue
		for i := 0; i < 100; i++ {
			if has, _ := q.Has("task-" + strconv.Itoa(i)); has {
				hasTasks = append(hasTasks, "task-"+strconv.Itoa(i))
			}
		}
		close(stop)
		return
	}

	q3 := func() (executedTasks []string, hasTasks []string) {
		stop := make(chan struct{})

		testHandler := func(data ...queue.Data) []queue.Data {
			for _, datum := range data {
				executedTasks = append(executedTasks, datum.(string))
			}
			return nil
		}

		q := queue.CreateUniqueQueue("pr_patch_checker_test", testHandler, "")
		q.Run(func(atShutdown func()) { go func() { <-stop; atShutdown() }() }, func(atTerminate func()) {})

		// re-run all tasks
		for i := 0; i < 100; i++ {
			_ = q.Push("task-" + strconv.Itoa(i))
		}

		// wait for a while
		time.Sleep(1 * time.Second)

		// check whether the tasks are still in the queue
		for i := 0; i < 100; i++ {
			if has, _ := q.Has("task-" + strconv.Itoa(i)); has {
				hasTasks = append(hasTasks, "task-"+strconv.Itoa(i))
			}
		}
		close(stop)
		return
	}

	completedTasks1 := q1() // run some tasks and shutdown at an intermediate point
	time.Sleep(time.Second)
	executedTasks2, hasTasks2 := q2() // restart the queue to check the tasks in it
	time.Sleep(time.Second)
	executedTasks3, hasTasks3 := q3() // try to re-run all tasks

	log.Error("TestPullRequest_QueueStuck completed1=%v, executed2=%v, has2=%v, executed3=%v, has3=%v",
		len(completedTasks1), len(executedTasks2), len(hasTasks2), len(executedTasks3), len(hasTasks3))
}

@wxiaoguang commented on GitHub (Feb 23, 2023): This is a test for this problem The results are quite surprising. ``` TestPullRequest_QueueStuck completed1=1, executed2=0, has2=20, executed3=100, has3=19 TestPullRequest_QueueStuck completed1=4, executed2=0, has2=20, executed3=100, has3=19 TestPullRequest_QueueStuck completed1=4, executed2=0, has2=20, executed3=101, has3=19 ``` ```go func TestPullRequest_QueueStuck(t *testing.T) { setting_module.AppWorkPath = "/tmp" _ = util.RemoveAll("/tmp/data") unittest.PrepareTestEnv(t) setting_module.InitProviderAndLoadCommonSettingsForTest() setting_module.LoadQueueSettings() q1 := func() (completedTasks []string) { startWhen100Ready := make(chan struct{}) // only start data cnosuming when the 100 tasks are all pushed into queue stopAt20Shutdown := make(chan struct{}) // stop and shutdown at the 20th item testHandler := func(data ...queue.Data) []queue.Data { <-startWhen100Ready time.Sleep(100 * time.Millisecond) for _, datum := range data { s := datum.(string) completedTasks = append(completedTasks, s) if s == "task-20" { close(stopAt20Shutdown) return nil } } return nil } q := queue.CreateUniqueQueue("pr_patch_checker_test", testHandler, "") q.Run(func(atShutdown func()) { go func() { <-stopAt20Shutdown; atShutdown() }() }, func(atTerminate func()) {}) // add 100 tasks to the queue for i := 0; i < 100; i++ { _ = q.Push("task-" + strconv.Itoa(i)) } close(startWhen100Ready) <-stopAt20Shutdown return } q2 := func() (executedTasks []string, hasTasks []string) { stop := make(chan struct{}) // collect the tasks that have been executed testHandler := func(data ...queue.Data) []queue.Data { for _, datum := range data { executedTasks = append(executedTasks, datum.(string)) } return nil } q := queue.CreateUniqueQueue("pr_patch_checker_test", testHandler, "") q.Run(func(atShutdown func()) { go func() { <-stop; atShutdown() }() }, func(atTerminate func()) {}) // wait for a while to see whether there are tasks to get executed. time.Sleep(1 * time.Second) // check whether the tasks are still in the queue for i := 0; i < 100; i++ { if has, _ := q.Has("task-" + strconv.Itoa(i)); has { hasTasks = append(hasTasks, "task-"+strconv.Itoa(i)) } } close(stop) return } q3 := func() (executedTasks []string, hasTasks []string) { stop := make(chan struct{}) testHandler := func(data ...queue.Data) []queue.Data { for _, datum := range data { executedTasks = append(executedTasks, datum.(string)) } return nil } q := queue.CreateUniqueQueue("pr_patch_checker_test", testHandler, "") q.Run(func(atShutdown func()) { go func() { <-stop; atShutdown() }() }, func(atTerminate func()) {}) // re-run all tasks for i := 0; i < 100; i++ { _ = q.Push("task-" + strconv.Itoa(i)) } // wait for a while time.Sleep(1 * time.Second) // check whether the tasks are still in the queue for i := 0; i < 100; i++ { if has, _ := q.Has("task-" + strconv.Itoa(i)); has { hasTasks = append(hasTasks, "task-"+strconv.Itoa(i)) } } close(stop) return } completedTasks1 := q1() // run some tasks and shutdown at an intermediate point time.Sleep(time.Second) executedTasks2, hasTasks2 := q2() // restart the queue to check the tasks in it time.Sleep(time.Second) executedTasks3, hasTasks3 := q3() // try to re-run all tasks log.Error("TestPullRequest_QueueStuck completed1=%v, executed2=%v, has2=%v, executed3=%v, has3=%v", len(completedTasks1), len(executedTasks2), len(hasTasks2), len(executedTasks3), len(hasTasks3)) } ```

GiteaMirror commented

8540fc45b1/modules/queue/queue_disk_channel.go (L175-L197)

@zeripath commented on GitHub (Feb 25, 2023):

@wxiaoguang Thanks for your testcase. It's not quite right but it's pointed me in the correct direction.

Queues are extremely concurrent and the handler in q1 is going to drop a lot of data - you cannot expect that as soon as you tell a queue to shutdown that no more data will be handled - In fact you can almost guarantee that at least a few more will be handled.

I'm just working through it trying to make it safe and properly concurrent safe and I'm placing it in modules/queue once I've slightly rewritten it - (I'm not sure what q3 is supposed to reliably do.)

Now to address the Warning - if you switch on Trace logging for modules/queue:

gitea manager logging add console --name traceconsole --level TRACE --expression modules/queue

[log]
MODE = ..., traceconsole
...

[log.traceconsole]
LEVEL=trace
MODE=console
EXPRESSION=modules/queue

You would see that internal channelqueue is explicitly drained at Terminate. There should be no loss of data - the warning is not the cause of the issue you're investigating.

I assume you're actually investigating things not being emptied on startup - and that's a real bug.

If you look at:

and compare with:

8540fc45b1/modules/queue/unique_queue_disk_channel.go (L212-L226)

I think you'll see the real bug.

@zeripath commented on GitHub (Feb 25, 2023): @wxiaoguang Thanks for your testcase. It's not quite right but it's pointed me in the correct direction. Queues are extremely concurrent and the handler in q1 is going to drop a lot of data - you cannot expect that as soon as you tell a queue to shutdown that no more data will be handled - In fact you can almost guarantee that at least a few more will be handled. I'm just working through it trying to make it safe and properly concurrent safe and I'm placing it in modules/queue once I've slightly rewritten it - (I'm not sure what q3 is supposed to reliably do.) --- Now to address the Warning - if you switch on Trace logging for modules/queue: ```bash gitea manager logging add console --name traceconsole --level TRACE --expression modules/queue ``` ```ini [log] MODE = ..., traceconsole ... [log.traceconsole] LEVEL=trace MODE=console EXPRESSION=modules/queue ``` You would see that internal channelqueue is explicitly drained at Terminate. There should be no loss of data - the warning is not the cause of the issue you're investigating. --- I assume you're actually investigating things not being emptied on startup - and that's a real bug. If you look at: https://github.com/go-gitea/gitea/blob/8540fc45b11eff9a73753ca139f8ea5c38509bf5/modules/queue/queue_disk_channel.go#L175-L197 and compare with: https://github.com/go-gitea/gitea/blob/8540fc45b11eff9a73753ca139f8ea5c38509bf5/modules/queue/unique_queue_disk_channel.go#L212-L226 I think you'll see the real bug.

GiteaMirror commented

@wxiaoguang commented on GitHub (Feb 25, 2023):

Actually, I do not care about whether the queue is reliable or not. And in most cases, the queue doesn't need to be reliable.

But has3=19 seems to cause the bug described by this issue. These tasks are just stuck in queue.

@wxiaoguang commented on GitHub (Feb 25, 2023): Actually, I do not care about whether the queue is reliable or not. And in most cases, the queue doesn't need to be reliable. But `has3=19` seems to cause the bug described by this issue. These tasks are just stuck in queue.

GiteaMirror commented