Actions Runner processes just one task, then waits forever. (Possible bug in ActionTasksVersion) #13025

Closed
opened 2025-11-02 10:28:00 -06:00 by GiteaMirror · 5 comments
Owner

Originally created by @honx on GitHub (May 24, 2024).

Description

This will be a bit longer. I'll describe the behaviour first, then what I found by debugging:

I am running a private instance of gitea 1.22.0+rc1-126-gec771fdfcd and act_runner release 0.2.10
The runner is registered as repository runner.

When act_runner connects and finds a (matching) task waiting, it will take the task and run it correctly and without problems. It will then wait for more task.

However if i then trigger another task (of the same workflow), it will not take that task and just sit idle forever.

If it connects and there is no task waiting, it keeps waiting. If a workflow is then triggered and a task waiting, it will not receive the task and handle it.

So the runner config and operation itself is fine but it will only ever process one job.

This is what I found after some debugging:

The runner will continuously /api/actions/runner.v1.RunnerService/FetchTask but not receive more than one task, so the problem seemed to be on the gitea side.

I looked into FetchTask() in ./routers/api/actions/runner/runner.go

After some debugging, i found that the comparison

https://github.com/go-gitea/gitea/blob/main/routers/api/actions/runner/runner.go#L155

is why no second task is delivered.

At the first (working) call, the runner sends tasksVersion as 0 and that is compared with latestVersion which is 1. So pickTask gets called and a task is sent back.

From the next call on, the runner sends 1, it gets compared to latestVersion 1 and pickTask is not called.

The comparison is part of

https://github.com/go-gitea/gitea/issues/24544

I dug deeper into GetTasksVersionByScope and IncreaseTaskVersion.

And IncreaseTaskVersion is also called each time a job changes state.

And increaseTasksVersionByScope calls increaseTasksVersionByScope for 3 scopes. Once with the ownerID set to 0, once with the
repoID set to 0 and once with both set to 0.

Also, increaseTasksVersionByScope will implicitely insert if no existing record is found.

So for my runner with ownerID 3 and repoID 1, i get three records:

mysql> select * from action_tasks_version;
+----+----------+---------+---------+--------------+--------------+
| id | owner_id | repo_id | version | created_unix | updated_unix |
+----+----------+---------+---------+--------------+--------------+
|  1 |        0 |       0 |   15990 |   1716545692 |   1716545692 |
|  2 |        3 |       0 |   24066 |   1716545692 |   1716545692 |
|  3 |        0 |       1 |   15990 |   1716545692 |   1716545692 |
+----+----------+---------+---------+--------------+--------------+
3 rows in set (0.00 sec)

However, querying the version is done differently.

GetTasksVersionByScope just does:

has, err := db.GetEngine(ctx).Where("owner_id = ? AND repo_id = ?", ownerID, repoID).Get(&tasksVersion)

And for ownerID 3 and repoID 1, this never returns a matching value since the insert command always sets at least one value to 0.
The function returns 0, since no value was found (and IncreaseTaskVersion gets called but that again wont create a matching entry).

The value of 0 gets increased to 1 by latestVersion++ and it gets compared to the 1 the runner sent. It matches, so pickTask is not called.

I may completely misunderstand how this is supposed to work (and I have never programmed in Go), but it seems this whole mechanism will not work for repository scope runners where both ownerID and repoID are non-zero.

A friend tried this out with an org-scope runner and there it works fine.

Since the whole mechanism is just an optizmiation, I would just disable the comparison and everything works again.

But I'm unsure how to fix this properly, so I did not include a PR.....

Using the maximum of all scopes could work for this case

SELECT MAX(version) FROM action_tasks_version WHERE (owner_id = 3) or (repo_id = 1) or (owner_id = 0 and repo_id = 0);

but I'm not sure if this breaks other things and how to get this into the xorm.

I didn't try to reproduce it on the demo site yet.

I set both gitea and act_runner to debug logging, but the runner is silent and gitea just produces lots of

2024-05-24T21:17:48.988208+00:00 localhost gitea[93710]: 2024/05/24 21:17:48 ...s/process/manager.go:188:Add() [T] Start 665103fc: POST: /api/actions/runner.v1.RunnerService/FetchTask (request)
2024-05-24T21:17:48.988286+00:00 localhost gitea[93710]: 2024/05/24 21:17:48 ...eb/routing/logger.go:47:func1() [T] router: started   POST /api/actions/runner.v1.RunnerService/FetchTask for 4.180.7.35:37748
2024-05-24T21:17:49.020301+00:00 localhost gitea[93710]: 2024/05/24 21:17:49 ...eb/routing/logger.go:102:func1() [I] router: completed POST /api/actions/runner.v1.RunnerService/FetchTask for 4.180.7.35:37748, 200 OK in 32.2ms @ <autogenerated>:1(http.Handler.ServeHTTP-fm)
2024-05-24T21:17:49.020413+00:00 localhost gitea[93710]: 2024/05/24 21:17:49 ...s/process/manager.go:231:remove() [T] Done 665103fc: POST: /api/actions/runner.v1.RunnerService/FetchTask

calls.

I have a lot of logs where I manually and very crudely added debug output. I can provide these, if they help.

Gitea Version

1.22.0+rc1-126-gec771fdfcd

Can you reproduce the bug on the Gitea demo site?

No

Log Gist

No response

Screenshots

No response

Git Version

2.43.0

Operating System

Ubuntu 24.04

How are you running Gitea?

Bug shows up both with the rc1 binary as well as self-built from main.

I run gitea from systemd without docker.

act_runner is running directly on the system and from commandline. it processes task in docker.

Database

MySQL/MariaDB

Originally created by @honx on GitHub (May 24, 2024). ### Description This will be a bit longer. I'll describe the behaviour first, then what I found by debugging: I am running a private instance of gitea 1.22.0+rc1-126-gec771fdfcd and act_runner release 0.2.10 The runner is registered as repository runner. When act_runner connects and finds a (matching) task waiting, it will take the task and run it correctly and without problems. It will then wait for more task. However if i then trigger another task (of the same workflow), it will not take that task and just sit idle forever. If it connects and there is no task waiting, it keeps waiting. If a workflow is then triggered and a task waiting, it will not receive the task and handle it. So the runner config and operation itself is fine but it will only ever process one job. This is what I found after some debugging: The runner will continuously /api/actions/runner.v1.RunnerService/FetchTask but not receive more than one task, so the problem seemed to be on the gitea side. I looked into FetchTask() in ./routers/api/actions/runner/runner.go After some debugging, i found that the comparison https://github.com/go-gitea/gitea/blob/main/routers/api/actions/runner/runner.go#L155 is why no second task is delivered. At the first (working) call, the runner sends tasksVersion as 0 and that is compared with latestVersion which is 1. So pickTask gets called and a task is sent back. From the next call on, the runner sends 1, it gets compared to latestVersion 1 and pickTask is not called. The comparison is part of https://github.com/go-gitea/gitea/issues/24544 I dug deeper into GetTasksVersionByScope and IncreaseTaskVersion. And IncreaseTaskVersion is also called each time a job changes state. And increaseTasksVersionByScope calls increaseTasksVersionByScope for 3 scopes. Once with the ownerID set to 0, once with the repoID set to 0 and once with both set to 0. Also, increaseTasksVersionByScope will implicitely insert if no existing record is found. So for my runner with ownerID 3 and repoID 1, i get three records: ``` mysql> select * from action_tasks_version; +----+----------+---------+---------+--------------+--------------+ | id | owner_id | repo_id | version | created_unix | updated_unix | +----+----------+---------+---------+--------------+--------------+ | 1 | 0 | 0 | 15990 | 1716545692 | 1716545692 | | 2 | 3 | 0 | 24066 | 1716545692 | 1716545692 | | 3 | 0 | 1 | 15990 | 1716545692 | 1716545692 | +----+----------+---------+---------+--------------+--------------+ 3 rows in set (0.00 sec) ``` However, querying the version is done differently. GetTasksVersionByScope just does: `has, err := db.GetEngine(ctx).Where("owner_id = ? AND repo_id = ?", ownerID, repoID).Get(&tasksVersion)` And for ownerID 3 and repoID 1, this never returns a matching value since the insert command always sets at least one value to 0. The function returns 0, since no value was found (and IncreaseTaskVersion gets called but that again wont create a matching entry). The value of 0 gets increased to 1 by latestVersion++ and it gets compared to the 1 the runner sent. It matches, so pickTask is not called. I may completely misunderstand how this is supposed to work (and I have never programmed in Go), but it seems this whole mechanism will not work for repository scope runners where both ownerID and repoID are non-zero. A friend tried this out with an org-scope runner and there it works fine. Since the whole mechanism is just an optizmiation, I would just disable the comparison and everything works again. But I'm unsure how to fix this properly, so I did not include a PR..... Using the maximum of all scopes could work for this case `SELECT MAX(version) FROM action_tasks_version WHERE (owner_id = 3) or (repo_id = 1) or (owner_id = 0 and repo_id = 0);` but I'm not sure if this breaks other things and how to get this into the xorm. I didn't try to reproduce it on the demo site yet. I set both gitea and act_runner to debug logging, but the runner is silent and gitea just produces lots of ``` 2024-05-24T21:17:48.988208+00:00 localhost gitea[93710]: 2024/05/24 21:17:48 ...s/process/manager.go:188:Add() [T] Start 665103fc: POST: /api/actions/runner.v1.RunnerService/FetchTask (request) 2024-05-24T21:17:48.988286+00:00 localhost gitea[93710]: 2024/05/24 21:17:48 ...eb/routing/logger.go:47:func1() [T] router: started POST /api/actions/runner.v1.RunnerService/FetchTask for 4.180.7.35:37748 2024-05-24T21:17:49.020301+00:00 localhost gitea[93710]: 2024/05/24 21:17:49 ...eb/routing/logger.go:102:func1() [I] router: completed POST /api/actions/runner.v1.RunnerService/FetchTask for 4.180.7.35:37748, 200 OK in 32.2ms @ <autogenerated>:1(http.Handler.ServeHTTP-fm) 2024-05-24T21:17:49.020413+00:00 localhost gitea[93710]: 2024/05/24 21:17:49 ...s/process/manager.go:231:remove() [T] Done 665103fc: POST: /api/actions/runner.v1.RunnerService/FetchTask ``` calls. I have a lot of logs where I manually and very crudely added debug output. I can provide these, if they help. ### Gitea Version 1.22.0+rc1-126-gec771fdfcd ### Can you reproduce the bug on the Gitea demo site? No ### Log Gist _No response_ ### Screenshots _No response_ ### Git Version 2.43.0 ### Operating System Ubuntu 24.04 ### How are you running Gitea? Bug shows up both with the rc1 binary as well as self-built from main. I run gitea from systemd without docker. act_runner is running directly on the system and from commandline. it processes task in docker. ### Database MySQL/MariaDB
GiteaMirror added the topic/gitea-actionstype/bug labels 2025-11-02 10:28:00 -06:00
Author
Owner

@honx commented on GitHub (May 24, 2024):

I cleaned up my debug logging a bit. Here is a gist:

https://gist.github.com/honx/525e27b360576f3c5fa11ca26bffa918

@honx commented on GitHub (May 24, 2024): I cleaned up my debug logging a bit. Here is a gist: https://gist.github.com/honx/525e27b360576f3c5fa11ca26bffa918
Author
Owner

@honx commented on GitHub (May 25, 2024):

I found a possible one-line fix and created the above PR. This fixes my problem and should also leave the behaviour of global and org scope runners unchanged.

However, to keep the database clean, maybe some extra code for transfer of onwership for a repository and on installation may be good. (See comment in the PR).

I'll keep testing this PR in my installation

@honx commented on GitHub (May 25, 2024): I found a possible one-line fix and created the above PR. This fixes my problem and should also leave the behaviour of global and org scope runners unchanged. However, to keep the database clean, maybe some extra code for transfer of onwership for a repository and on installation may be good. (See comment in the PR). I'll keep testing this PR in my installation
Author
Owner

@honx commented on GitHub (May 27, 2024):

I did some more testing and repo runners still don't behave correctly. I'll do more debugging and report back. The fix i put in the PR does not seem sufficient.

@honx commented on GitHub (May 27, 2024): I did some more testing and repo runners still don't behave correctly. I'll do more debugging and report back. The fix i put in the PR does not seem sufficient.
Author
Owner

@honx commented on GitHub (May 27, 2024):

I closed the PR since i cannot say that the fix is actually correct.

I debugged a bit more but I need further information before i can continue.

I am unsure how repo scope runners are propoerly recorded in the action_runner table.

The cause of the hanging runner was a repo scope runner that had both owner and repo ID non-zero:

select id,owner_id,repo_id from action_runner;
+----+----------+---------+
| id | owner_id | repo_id |
+----+----------+---------+
|  1 |        1 |       2 |
+----+----------+---------+

However, when I now add further repo scope runners, they only have repo_id set and the owner ID is 0.

So my question is:

Which is the correct format for a repo scope runner in the action_runner table:

ownerID = 0 and repoID = <repoid>

or

ownerID = <id of the repo owner> and repoID = <repoid>

?
(My original PR would handle the second alternative).

If the first alternativ is correct, I would have to find out where the runner entry in action_runner came from.

Either way there seems to be something wrong with the way that entries in action_runner are created since I se both formats in two of my instances and I did no manual DB changes at either of them.

But I currently have no idea where the entries with two non-zero fields come from and cannot reproduce their creation.

@honx commented on GitHub (May 27, 2024): I closed the PR since i cannot say that the fix is actually correct. I debugged a bit more but I need further information before i can continue. I am unsure how repo scope runners are propoerly recorded in the action_runner table. The cause of the hanging runner was a repo scope runner that had both owner and repo ID non-zero: ``` select id,owner_id,repo_id from action_runner; +----+----------+---------+ | id | owner_id | repo_id | +----+----------+---------+ | 1 | 1 | 2 | +----+----------+---------+ ``` However, when I now add further repo scope runners, they only have repo_id set and the owner ID is 0. So my question is: Which is the correct format for a repo scope runner in the action_runner table: ownerID = 0 and repoID = \<repoid\> or ownerID = \<id of the repo owner\> and repoID = \<repoid\> ? (My original PR would handle the second alternative). If the first alternativ is correct, I would have to find out where the runner entry in action_runner came from. Either way there seems to be something wrong with the way that entries in action_runner are created since I se both formats in two of my instances and I did no manual DB changes at either of them. But I currently have no idea where the entries with two non-zero fields come from and cannot reproduce their creation.
Author
Owner

@honx commented on GitHub (May 27, 2024):

Seems like the DB entries were wrong and "ownerID = 0 and repoID = " is the correct format for repo scope runners.

Since i currently cannot reproduce the behaviour (i.e. the incorrect DB entries) i will close this ticket.

I have SQL query logging and debug logging turned on on my instance. If it shows up again, I will open a new issue.

@honx commented on GitHub (May 27, 2024): Seems like the DB entries were wrong and "ownerID = 0 and repoID = <repoid>" is the correct format for repo scope runners. Since i currently cannot reproduce the behaviour (i.e. the incorrect DB entries) i will close this ticket. I have SQL query logging and debug logging turned on on my instance. If it shows up again, I will open a new issue.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/gitea#13025