mirror of
https://github.com/go-gitea/gitea.git
synced 2026-03-18 06:03:09 -05:00
High Idle CPU usage (fresh install) #7208
Closed
opened 2025-11-02 07:19:25 -06:00 by GiteaMirror
·
46 comments
No Branch/Tag Specified
main
release/v1.25
release/v1.24
release/v1.23
release/v1.22
release/v1.21
release/v1.20
release/v1.19
release/v1.18
release/v1.17
release/v1.16
release/v1.15
release/v1.14
release/v1.13
release/v1.12
release/v1.11
release/v1.10
release/v1.9
release/v1.8
v1.25.3
v1.25.2
v1.25.1
v1.25.0
v1.24.7
v1.25.0-rc0
v1.26.0-dev
v1.24.6
v1.24.5
v1.24.4
v1.24.3
v1.24.2
v1.24.1
v1.24.0
v1.23.8
v1.24.0-rc0
v1.25.0-dev
v1.23.7
v1.23.6
v1.23.5
v1.23.4
v1.23.3
v1.23.2
v1.23.1
v1.23.0
v1.23.0-rc0
v1.24.0-dev
v1.22.6
v1.22.5
v1.22.4
v1.22.3
v1.22.2
v1.22.1
v1.22.0
v1.23.0-dev
v1.22.0-rc1
v1.21.11
v1.22.0-rc0
v1.21.10
v1.21.9
v1.21.8
v1.21.7
v1.21.6
v1.21.5
v1.21.4
v1.21.3
v1.21.2
v1.20.6
v1.21.1
v1.21.0
v1.21.0-rc2
v1.21.0-rc1
v1.20.5
v1.22.0-dev
v1.21.0-rc0
v1.20.4
v1.20.3
v1.20.2
v1.20.1
v1.20.0
v1.19.4
v1.21.0-dev
v1.20.0-rc2
v1.20.0-rc1
v1.20.0-rc0
v1.19.3
v1.19.2
v1.19.1
v1.19.0
v1.19.0-rc1
v1.20.0-dev
v1.19.0-rc0
v1.18.5
v1.18.4
v1.18.3
v1.18.2
v1.18.1
v1.18.0
v1.17.4
v1.18.0-rc1
v1.19.0-dev
v1.18.0-rc0
v1.17.3
v1.17.2
v1.17.1
v1.17.0
v1.17.0-rc2
v1.16.9
v1.17.0-rc1
v1.18.0-dev
v1.16.8
v1.16.7
v1.16.6
v1.16.5
v1.16.4
v1.16.3
v1.16.2
v1.16.1
v1.16.0
v1.15.11
v1.17.0-dev
v1.16.0-rc1
v1.15.10
v1.15.9
v1.15.8
v1.15.7
v1.15.6
v1.15.5
v1.15.4
v1.15.3
v1.15.2
v1.15.1
v1.14.7
v1.15.0
v1.15.0-rc3
v1.14.6
v1.15.0-rc2
v1.14.5
v1.16.0-dev
v1.15.0-rc1
v1.14.4
v1.14.3
v1.14.2
v1.14.1
v1.14.0
v1.13.7
v1.14.0-rc2
v1.13.6
v1.13.5
v1.14.0-rc1
v1.15.0-dev
v1.13.4
v1.13.3
v1.13.2
v1.13.1
v1.13.0
v1.12.6
v1.13.0-rc2
v1.14.0-dev
v1.13.0-rc1
v1.12.5
v1.12.4
v1.12.3
v1.12.2
v1.12.1
v1.11.8
v1.12.0
v1.11.7
v1.12.0-rc2
v1.11.6
v1.12.0-rc1
v1.13.0-dev
v1.11.5
v1.11.4
v1.11.3
v1.10.6
v1.12.0-dev
v1.11.2
v1.10.5
v1.11.1
v1.10.4
v1.11.0
v1.11.0-rc2
v1.10.3
v1.11.0-rc1
v1.10.2
v1.10.1
v1.10.0
v1.9.6
v1.9.5
v1.10.0-rc2
v1.11.0-dev
v1.10.0-rc1
v1.9.4
v1.9.3
v1.9.2
v1.9.1
v1.9.0
v1.9.0-rc2
v1.10.0-dev
v1.9.0-rc1
v1.8.3
v1.8.2
v1.8.1
v1.8.0
v1.8.0-rc3
v1.7.6
v1.8.0-rc2
v1.7.5
v1.8.0-rc1
v1.9.0-dev
v1.7.4
v1.7.3
v1.7.2
v1.7.1
v1.7.0
v1.7.0-rc3
v1.6.4
v1.7.0-rc2
v1.6.3
v1.7.0-rc1
v1.7.0-dev
v1.6.2
v1.6.1
v1.6.0
v1.6.0-rc2
v1.5.3
v1.6.0-rc1
v1.6.0-dev
v1.5.2
v1.5.1
v1.5.0
v1.5.0-rc2
v1.5.0-rc1
v1.5.0-dev
v1.4.3
v1.4.2
v1.4.1
v1.4.0
v1.4.0-rc3
v1.4.0-rc2
v1.3.3
v1.4.0-rc1
v1.3.2
v1.3.1
v1.3.0
v1.3.0-rc2
v1.3.0-rc1
v1.2.3
v1.2.2
v1.2.1
v1.2.0
v1.2.0-rc3
v1.2.0-rc2
v1.1.4
v1.2.0-rc1
v1.1.3
v1.1.2
v1.1.1
v1.1.0
v1.0.2
v1.0.1
v1.0.0
v0.9.99
Labels
Clear labels
$20
$250
$50
$500
backport/done
💎 Bounty
docs-update-needed
good first issue
hacktoberfest
issue/bounty
issue/confirmed
issue/critical
issue/duplicate
issue/needs-feedback
issue/not-a-bug
issue/regression
issue/stale
issue/workaround
lgtm/need 2
modifies/api
modifies/translation
outdated/backport/v1.18
outdated/theme/markdown
outdated/theme/timetracker
performance/bigrepo
performance/cpu
performance/memory
performance/speed
pr/breaking
proposal/accepted
proposal/rejected
pr/wip
pull-request
reviewed/wontfix
💰 Rewarded
skip-changelog
status/blocked
topic/accessibility
topic/api
topic/authentication
topic/build
topic/code-linting
topic/commit-signing
topic/content-rendering
topic/deployment
topic/distribution
topic/federation
topic/gitea-actions
topic/issues
topic/lfs
topic/mobile
topic/moderation
topic/packages
topic/pr
topic/projects
topic/repo
topic/repo-migration
topic/security
topic/theme
topic/ui
topic/ui-interaction
topic/ux
topic/webhooks
topic/wiki
type/bug
type/deprecation
type/docs
type/enhancement
type/feature
type/miscellaneous
type/proposal
type/question
type/refactoring
type/summary
type/testing
type/upstream
Mirrored from GitHub Pull Request
No Label
performance/cpu
Milestone
No items
No Milestone
Projects
Clear projects
No project
No Assignees
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: github-starred/gitea#7208
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @elmodor on GitHub (Apr 19, 2021).
[x]):Description
I have a high idle CPU usage with my upgraded gitea server, so I tried a fresh deployment and have the same issue.
After a fresh install (even 1 hour after) the CPU% usage is about 5% of my system. One user, no repository.
In my testing VM (also Debian 10, 1 CPU thread only) the idle CPU% usage is about 0,1% (with the exact same docker-compose.yml and configuration).
This happens with 14.1 root or rootless version. (haven't tried others)
I know 5% is not that high, but should it really be that much with a fresh install and while idle?
According to htop the CPU is used by /usr/local/bin/gitea -c /etc/gitea/app.ini web. I've set the logging to debug but there's no logs when being idle.
Any help is appreciated :)
Screenshots
@tinxx commented on GitHub (Apr 26, 2021):
I'm seeing the same behaviour on my machine.
I have activated the access log to make sure no one is accessing my Gitea instance and triggering something.
It is definitely idle load.
On Log Level "Info" I can see a log like this every 10 seconds (:05, :15, :25, ...):
Can you confirm that?
Cheers,
Tinxx
@systemdarena commented on GitHub (Apr 26, 2021):
I am also noticing a CPU usage that seems high for gitea being idle with no usage at all. On my server that also hosts samba shares that are used every day, and other standard services like chrony, rsyslogd, crond, and firewalld(a python daemon!), gitea had the highest TIME value on my server after 24 days, by a large amount. I also see the SQL query lines every 10 seconds, but I don't think that is the issue, since that only runs every 10 seconds, and I see the cpu usage spikes seem to be pretty constant, though it jumps between threads. The parent process will be at 1% or 2% CPU usage according to htop with the refresh set to 1 second. With htop's refresh set to much shorter, like 0.1 seconds then it jumps from 0% to 8% back and forth constantly.
What may be more telling is the strace output, which shows a constant polling of epoll_pwait() on several threads:
[pid 944513] 14:39:28.794308 epoll_pwait(3, <unfinished ...>
[pid 944547] 14:39:28.794348 epoll_pwait(3, <unfinished ...>
[pid 944513] 14:39:28.794389 <... epoll_pwait resumed>[], 128, 0, NULL, 824674766514) = 0
[pid 944547] 14:39:28.794424 <... epoll_pwait resumed>[], 128, 0, NULL, 0) = 0
[pid 944513] 14:39:28.794460 epoll_pwait(3, <unfinished ...>
[pid 944547] 14:39:28.794498 futex(0xc00280e150, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 944514] 14:39:28.797018 <... nanosleep resumed>NULL) = 0
[pid 944514] 14:39:28.797112 futex(0x595ee58, FUTEX_WAIT_PRIVATE, 0, {tv_sec=0, tv_nsec=16761648} <unfinished ...>
[pid 944513] 14:39:28.813747 <... epoll_pwait resumed>[], 128, 19, NULL, 19) = 0
[pid 944513] 14:39:28.813841 epoll_pwait(3, [], 128, 0, NULL, 0) = 0
[pid 944513] 14:39:28.813953 epoll_pwait(3, <unfinished ...>
[pid 944514] 14:39:28.814087 <... futex resumed>) = -1 ETIMEDOUT (Connection timed out)
[pid 944514] 14:39:28.814158 nanosleep({tv_sec=0, tv_nsec=10000000}, <unfinished ...>
[pid 944513] 14:39:28.815166 <... epoll_pwait resumed>[], 128, 1, NULL, 19) = 0
[pid 944513] 14:39:28.815263 futex(0xc00280e150, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 944547] 14:39:28.815395 <... futex resumed>) = 0
[pid 944513] 14:39:28.815425 epoll_pwait(3, <unfinished ...>
[pid 944547] 14:39:28.815470 epoll_pwait(3, <unfinished ...>
[pid 944513] 14:39:28.815510 <... epoll_pwait resumed>[], 128, 0, NULL, 824765049992) = 0
[pid 944547] 14:39:28.815546 <... epoll_pwait resumed>[], 128, 0, NULL, 0) = 0
[pid 944513] 14:39:28.815583 epoll_pwait(3, <unfinished ...>
[pid 944547] 14:39:28.815622 futex(0xc00280e150, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 944514] 14:39:28.824461 <... nanosleep resumed>NULL) = 0
[pid 944514] 14:39:28.824553 futex(0x595ee58, FUTEX_WAIT_PRIVATE, 0, {tv_sec=0, tv_nsec=2534876} <unfinished ...>
[pid 944513] 14:39:28.826860 <... epoll_pwait resumed>[], 128, 11, NULL, 19) = 0
[pid 944513] 14:39:28.826973 epoll_pwait(3, [], 128, 0, NULL, 0) = 0
[pid 944513] 14:39:28.827089 epoll_pwait(3, <unfinished ...>
[pid 944514] 14:39:28.827298 <... futex resumed>) = -1 ETIMEDOUT (Connection timed out)
[pid 944514] 14:39:28.827369 nanosleep({tv_sec=0, tv_nsec=10000000}, <unfinished ...>
[pid 944513] 14:39:28.828289 <... epoll_pwait resumed>[], 128, 1, NULL, 19) = 0
[pid 944513] 14:39:28.828379 epoll_pwait(3, [], 128, 0, NULL, 0) = 0
[pid 944513] 14:39:28.828495 futex(0xc00280e150, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 944547] 14:39:28.828617 <... futex resumed>) = 0
[pid 944513] 14:39:28.828646 epoll_pwait(3, <unfinished ...>
[pid 944547] 14:39:28.828686 epoll_pwait(3, <unfinished ...>
[pid 944513] 14:39:28.828727 <... epoll_pwait resumed>[], 128, 0, NULL, 824707085640) = 0
[pid 944547] 14:39:28.828761 <... epoll_pwait resumed>[], 128, 0, NULL, 0) = 0
[pid 944513] 14:39:28.828797 epoll_pwait(3, <unfinished ...>
[pid 944547] 14:39:28.828836 futex(0xc00280e150, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 944514] 14:39:28.837662 <... nanosleep resumed>NULL) = 0
[pid 944514] 14:39:28.837754 futex(0x595ee58, FUTEX_WAIT_PRIVATE, 0, {tv_sec=0, tv_nsec=28499131} <unfinished ...>
[pid 944513] 14:39:28.866087 <... epoll_pwait resumed>[], 128, 37, NULL, 19) = 0
[pid 944513] 14:39:28.866181 epoll_pwait(3, [], 128, 0, NULL, 0) = 0
[pid 944513] 14:39:28.866293 epoll_pwait(3, <unfinished ...>
[pid 944514] 14:39:28.866464 <... futex resumed>) = -1 ETIMEDOUT (Connection timed out)
[pid 944514] 14:39:28.866536 nanosleep({tv_sec=0, tv_nsec=10000000}, <unfinished ...>
[pid 944513] 14:39:28.867492 <... epoll_pwait resumed>[], 128, 1, NULL, 19) = 0
[pid 944513] 14:39:28.867585 futex(0xc00280e150, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 944547] 14:39:28.867717 <... futex resumed>) = 0
[pid 944513] 14:39:28.867747 epoll_pwait(3, <unfinished ...>
[pid 944547] 14:39:28.867794 epoll_pwait(3, <unfinished ...>
[pid 944513] 14:39:28.867827 <... epoll_pwait resumed>[], 128, 0, NULL, 824697945778) = 0
[pid 944547] 14:39:28.867862 <... epoll_pwait resumed>[], 128, 0, NULL, 0) = 0
[pid 944513] 14:39:28.867917 epoll_pwait(3, <unfinished ...>
[pid 944547] 14:39:28.867952 futex(0xc00280e150, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 944514] 14:39:28.876822 <... nanosleep resumed>NULL) = 0
[pid 944514] 14:39:28.876919 futex(0x595ee58, FUTEX_WAIT_PRIVATE, 0, {tv_sec=0, tv_nsec=10181233} <unfinished ...>
[pid 944513] 14:39:28.887213 <... epoll_pwait resumed>[], 128, 19, NULL, 19) = 0
[pid 944513] 14:39:28.887311 epoll_pwait(3, <unfinished ...>
[pid 944514] 14:39:28.887354 <... futex resumed>) = -1 ETIMEDOUT (Connection timed out)
[pid 944513] 14:39:28.887400 <... epoll_pwait resumed>[], 128, 0, NULL, 0) = 0
[pid 944514] 14:39:28.887437 nanosleep({tv_sec=0, tv_nsec=10000000}, <unfinished ...>
[pid 944513] 14:39:28.887495 futex(0xc00280e150, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 944547] 14:39:28.887566 <... futex resumed>) = 0
[pid 944513] 14:39:28.887595 <... futex resumed>) = 1
[pid 944547] 14:39:28.887628 futex(0xc000680950, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 944530] 14:39:28.887740 <... futex resumed>) = 0
[pid 944547] 14:39:28.887768 epoll_pwait(3, <unfinished ...>
[pid 944513] 14:39:28.887802 epoll_pwait(3, <unfinished ...>
[pid 944547] 14:39:28.887848 <... epoll_pwait resumed>[], 128, 0, NULL, 824672440224) = 0
[pid 944530] 14:39:28.887877 epoll_pwait(3, <unfinished ...>
[pid 944513] 14:39:28.887924 <... epoll_pwait resumed>[], 128, 0, NULL, 824701657488) = 0
[pid 944547] 14:39:28.887963 epoll_pwait(3, <unfinished ...>
[pid 944530] 14:39:28.887993 <... epoll_pwait resumed>[], 128, 0, NULL, 0) = 0
[pid 944513] 14:39:28.888028 futex(0x595f6d0, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 944530] 14:39:28.888070 futex(0xc000680950, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 944547] 14:39:28.894220 <... epoll_pwait resumed>[], 128, 6, NULL, 19) = 0
[pid 944547] 14:39:28.894315 epoll_pwait(3, [], 128, 0, NULL, 0) = 0
[pid 944547] 14:39:28.894437 futex(0xc000680950, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 944530] 14:39:28.894563 <... futex resumed>) = 0
[pid 944547] 14:39:28.894599 epoll_pwait(3, <unfinished ...>
[pid 944530] 14:39:28.894644 epoll_pwait(3, <unfinished ...>
[pid 944547] 14:39:28.894677 <... epoll_pwait resumed>[], 128, 0, NULL, 824677068466) = 0
[pid 944530] 14:39:28.894714 <... epoll_pwait resumed>[], 128, 0, NULL, 0) = 0
[pid 944547] 14:39:28.894749 epoll_pwait(3, <unfinished ...>
[pid 944530] 14:39:28.894789 futex(0xc000680950, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 944514] 14:39:28.897652 <... nanosleep resumed>NULL) = 0
[pid 944514] 14:39:28.897739 futex(0x595ee58, FUTEX_WAIT_PRIVATE, 0, {tv_sec=0, tv_nsec=17650309} <unfinished ...>
[pid 944547] 14:39:28.915023 <... epoll_pwait resumed>[], 128, 20, NULL, 19) = 0
[pid 944547] 14:39:28.915115 epoll_pwait(3, [], 128, 0, NULL, 0) = 0
[pid 944547] 14:39:28.915219 epoll_pwait(3, <unfinished ...>
[pid 944514] 14:39:28.915536 <... futex resumed>) = -1 ETIMEDOUT (Connection timed out)
[pid 944514] 14:39:28.915603 nanosleep({tv_sec=0, tv_nsec=10000000}, <unfinished ...>
[pid 944547] 14:39:28.916414 <... epoll_pwait resumed>[], 128, 1, NULL, 19) = 0
[pid 944547] 14:39:28.916505 epoll_pwait(3, [], 128, 0, NULL, 0) = 0
[pid 944547] 14:39:28.916609 futex(0xc000680950, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 944547] 14:39:28.916735 epoll_pwait(3, <unfinished ...>
[pid 944530] 14:39:28.916770 <... futex resumed>) = 0
[pid 944547] 14:39:28.916805 <... epoll_pwait resumed>[], 128, 0, NULL, 824707085960) = 0
[pid 944530] 14:39:28.916841 epoll_pwait(3, <unfinished ...>
[pid 944547] 14:39:28.916880 epoll_pwait(3, <unfinished ...>
@techknowlogick commented on GitHub (Apr 27, 2021):
There are various scheduled tasks that happen periodically, what would be most helpful is information from pprof (specifically the diagram it provides) as then CPU usage could be traced throughout the codebase.
@ddan39 commented on GitHub (Apr 27, 2021):
big surprise...
pprof001.svg.pdf
i guess strace doesn't lie. i just wasted over an hour of my time for that stupid pprof that strace already clearly showed us.
@zeripath commented on GitHub (Apr 27, 2021):
@ddan39 if you had given us the out file instead of the svg we could have looked at what was causing the poll waits.
You may find changing your queue configurations may change the number of polling.
In terms of the
GetUIDsAndNotificationCountsthis is simply the eventsource - if you do not want it to run turn it off. (see https://docs.gitea.io/en-us/config-cheat-sheet/#ui---notification-uinotification)@ddan39 commented on GitHub (Apr 27, 2021):
Ah, yeah, It was getting late and I was getting a bit frustrated trying to get pprof to even work when I have zero go knowledge. Sorry about responding a bit harshly. I was going to upload more files(well to be honest, I was trying to upload svg first, which was denied) but github was only letting me attach certain file types to the comments. I probably should've just zipped them both together... it looks like .zip files are allowed. When I get off work I will attach some more files.
I will look into the queue settings, thanks.
I was surprised to see gitea seemingly polling the epoll_wait function so fast like that to be honest. With go being all about concurrency I figured it could just do a long blocking call... but again, I'm a newb here.
@elmodor commented on GitHub (Apr 27, 2021):
I got pprof and strace running inside the docker of gitea:
But whenever I try to kill the running gitea process to try to start it with pprof or strace I get booted out of the container. How do I run this inside the docker container?
@ddan39 commented on GitHub (Apr 27, 2021):
edit: removed bad info about pprof, easy to use it below
to get profile, simply add to your app.ini [server] section
ENABLE_PPROF = trueand then after gitea has been running for a while run the commandgo tool pprof -top ./path/to/gitea-bin http://127.0.0.1:6060/@jolheiser commented on GitHub (Apr 27, 2021):
Gitea has pprof http endpoints built in.
See https://docs.gitea.io/en-us/config-cheat-sheet/#server-server
@ddan39 commented on GitHub (Apr 27, 2021):
well, shit.
@elmodor commented on GitHub (Apr 27, 2021):
So thanks to Etzelia in Discord I got pprof to run.
Heres a short guide on how to do it:
https://github.com/go-gitea/gitea/issues/14772
Here is my output of 10min, this is a fresh rootless docker install. No repo, one user. Docker stats showed around 4-5% CPU usage of my host system (with a 1sec refresh rate of docker stats):
and second run
I attached the pprof output. Loadable with:
go tool pprof pprof.outpprof1.pdf
pprof2.pdf
pprofs.zip
@elmodor commented on GitHub (Apr 30, 2021):
@jolheiser Can you do anything with those pprofs?
@zeripath commented on GitHub (Apr 30, 2021):
I mean these are pretty low-level go runtime issues to do with selects and polling. I've put up a PR that might(?) help if it's the case that it's due to having too many go-routines waiting on selects that are to blame.
Could you try building #15686 to see if this helps?
@elmodor commented on GitHub (May 1, 2021):
Yeah I fully understand that this might not be an easy fix and not one that has a high priority.
I've build the rootless docker based on your PR changes @zeripath but sadly I did not see any changes on the idle CPU usage on a fresh install. (apk add takes ages inside the docker build container...). Still around 5% after booting it up, configure it and creating one user.
If it helps I attached another pprof (2x10min) of the docker container running #15686
pprof.gitea.samples.cpu.pb.zip
@zeripath commented on GitHub (May 1, 2021):
OK,
So I'm really not certain what to do here - we could be chasing a ghost that is fundamentally not fixable but I think it's likely that the above pprofs and straces aren't really capturing what the real problem is - simply because by measuring they're causing wakeups.
My suspicion is that the wakeups are caused by the level queues work loop checking if there is work for it to do - it will currently do this every 100ms and there are more than a few queues that all wake up and check.
So just to prove that set your app.ini:
(Remove/change any other
[queue.*]sections as appropriate - they should all be TYPE=channel)Then just check with
topas well as pprof or strace. (There are few potential other things that could be causing frequent wake ups too like potentially the DB connector. Does changing the DB type reduce this?)Now - the presumption when this code was made was that this is a minimal and insignificant potential issue. As this is the third report, clearly that presumption is incorrect.
So what can be done?
Well... the majority of the queues are actually using the persistable-queue type - which means that once the level queue is empty it should never get any more work. So any further polling in this case is actually unnecessary - but there are significant concurrency edgecases that mean asserting when that further polling can stop is actually hard.
However, a leveldb only can have only one process open it at a time so... we could realistically check the length of the queue and if it is 0 block waiting for a push that will cause us to have some work to do. The trouble is getting the concurrency correct here and handling stopping properly.
For redis queues though I'm not certain what we can do about polling.
In all both cases 100ms was chosen as a "reasonable" default fall back polling time rather doing some more complicated exponential backoff as balanced between responsiveness and backend load.
@tinxx commented on GitHub (May 1, 2021):
Hi @zeripath,
I don't have time to look into pprof (this is a whole new topic for me) but setting the queue type to
channelhas a significant impact on the CPU usage on my small server.When the idle CPU usage of Gitea was around 6% before the change it is now down to around 3%.
Should I reset the queue type after testing or what is the implication on setting it to
channel?If the default is
persistable-channelthat sounds like queues would survive Gitea restarts.Cheers,
Tinxx
@zeripath commented on GitHub (May 1, 2021):
Here's another PR that will cause level db queues to simply wait on empty instead of polling at the cost of another select
@zeripath commented on GitHub (May 1, 2021):
@tinxx as I say above I am suspicious that pprof may be a red-herring. Could you try the #15693 PR to see if that reduces the default baseline load to similar to that for type=channel?
@tinxx commented on GitHub (May 1, 2021):
@zeripath I have managed to build your branch
wait-on-emptyand ran it on my server.After initial setup with one user the load oscillates around 2.5%.
I have ran a fresh release version and created a new user, etc. to find the CPU load oscillating around 6% again.
@elmodor commented on GitHub (May 1, 2021):
Hey @zeripath I appreciate the help!
The new PR seems to have reduced the CPU usage from 5% to 3%-4% maybe?
However, the queue change got it from 5% to 1%-2%.:
I will be running pprofs and be posting them later.
@zeripath commented on GitHub (May 1, 2021):
OK so I think now it's worth checking if the combination: https://github.com/zeripath/gitea/tree/prs-15693-15686-15696 helps
@elmodor commented on GitHub (May 1, 2021):
So https://github.com/zeripath/gitea/tree/prs-15693-15686-15696 has a CPU usage of around 2%-3% in my case (without queue set to channel). This is better than the previous PRs :)
However, with that PR and queue's set to channel it's 1%-2%.
Do you still need any pprofs from the previous or this PR / queue channels?
@elmodor commented on GitHub (May 1, 2021):
top from #15693 (pprof.gitea.samples.cpu.002.pb.gz)
top from queue set to channel + #15693 (pprof.gitea.samples.cpu.001.pb.gz)
top from prs-15693-15686-15696 (pprof.gitea.samples.cpu.003.pb.gz):
top from prs-15693-15686-15696 + queue set to channel (pprof.gitea.samples.cpu.004.pb.gz):
pprof.gitea.samples.cpu.pbs.zip
@zeripath commented on GitHub (May 1, 2021):
OK I've just pushed another little change up to that branch and #15696 that would allow:
(Which could still be changed to
TYPE=channelas necessary.)The point is that this will not have worker go-routines running unless they're actually needed.
BTW This
[queue.*]in your app.ini where does it come from? It doesn't mean or do anything.@ddan39 commented on GitHub (May 1, 2021):
sorry for the delay, i saw someone else had already posted their profile so figured there was no rush to get mine, but i've attached it anyways, and the pprof top output is further down. this is actually the first time i've gotten back to my home PC since my last comment.
gitea.prof.zip
i have tried the code from prs-15693-15686-15696, the profile is attached and pprof top looks like:
pprof.gitea.samples.cpu.005.pb.gz
File: gitea
Build ID: f765ff3bcf5b25bdfd8831c602c0de66f0ef4b57
Type: cpu
Time: May 1, 2021 at 5:20pm (EDT)
Duration: 30s, Total samples = 110ms ( 0.37%)
Showing nodes accounting for 110ms, 100% of 110ms total
flat flat% sum% cum cum%
40ms 36.36% 36.36% 40ms 36.36% runtime.epollwait
20ms 18.18% 54.55% 20ms 18.18% context.(*cancelCtx).Done
20ms 18.18% 72.73% 20ms 18.18% runtime.futex
10ms 9.09% 81.82% 10ms 9.09% runtime.mallocgc
10ms 9.09% 90.91% 30ms 27.27% runtime.notesleep
10ms 9.09% 100% 10ms 9.09% runtime.wirep
and in case it helps, here is the head of my original profile top output:
File: gitea
Build ID: 950109434956ed68d7c7e0741a6ff3e8586f7990
Type: cpu
Time: Apr 27, 2021 at 12:40am (EDT)
Duration: 13.91mins, Total samples = 5.08s ( 0.61%)
Showing nodes accounting for 4.45s, 87.60% of 5.08s total
Dropped 173 nodes (cum <= 0.03s)
flat flat% sum% cum cum%
1.15s 22.64% 22.64% 1.15s 22.64% runtime.epollwait
0.87s 17.13% 39.76% 0.87s 17.13% runtime.futex
0.20s 3.94% 43.70% 2.97s 58.46% runtime.findrunnable
0.12s 2.36% 46.06% 0.25s 4.92% runtime.selectgo
0.11s 2.17% 48.23% 0.18s 3.54% runtime.scanobject
0.10s 1.97% 50.20% 0.39s 7.68% runtime.checkTimers
0.09s 1.77% 51.97% 0.31s 6.10% runtime.mallocgc
0.09s 1.77% 53.74% 0.09s 1.77% runtime.nextFreeFast (inline)
0.07s 1.38% 55.12% 0.07s 1.38% runtime.(*mcache).prepareForSweep
0.07s 1.38% 56.50% 0.11s 2.17% runtime.nanotime (partial-inline)
0.07s 1.38% 57.87% 1.22s 24.02% runtime.netpoll
0.07s 1.38% 59.25% 0.62s 12.20% runtime.notesleep
0.07s 1.38% 60.63% 3.41s 67.13% runtime.schedule
0.06s 1.18% 61.81% 0.07s 1.38% runtime.lock2
@ddan39 commented on GitHub (May 1, 2021):
Running the code from
b1f6a0cfd3with the 2 queue settings set to WORKERS=0 and BOOST_WORKERS=1
File: gitea
Build ID: dc8e3f5cdde5399ead75d4998c4c670eeef426bd
Type: cpu
Time: May 1, 2021 at 5:39pm (EDT)
Duration: 30s, Total samples = 130ms ( 0.43%)
Showing nodes accounting for 130ms, 100% of 130ms total
flat flat% sum% cum cum%
30ms 23.08% 23.08% 30ms 23.08% runtime.epollwait
10ms 7.69% 30.77% 10ms 7.69% github.com/syndtr/goleveldb/leveldb.(*version).walkOverlapping
10ms 7.69% 38.46% 10ms 7.69% runtime.(*randomOrder).start (inline)
10ms 7.69% 46.15% 10ms 7.69% runtime.findObject
10ms 7.69% 53.85% 90ms 69.23% runtime.findrunnable
10ms 7.69% 61.54% 10ms 7.69% runtime.futex
10ms 7.69% 69.23% 10ms 7.69% runtime.ifaceeq
10ms 7.69% 76.92% 10ms 7.69% runtime.lock2
10ms 7.69% 84.62% 10ms 7.69% runtime.nanotime1
10ms 7.69% 92.31% 10ms 7.69% runtime.nobarrierWakeTime (inline)
10ms 7.69% 100% 20ms 15.38% runtime.scanobject
0 0% 100% 20ms 15.38% code.gitea.io/gitea/modules/queue.(*ByteFIFOQueue).readToChan
0 0% 100% 10ms 7.69% code.gitea.io/gitea/modules/queue.(*LevelQueueByteFIFO).Pop
0 0% 100% 10ms 7.69% code.gitea.io/gitea/modules/queue.(*LevelUniqueQueueByteFIFO).Pop
0 0% 100% 20ms 15.38% gitea.com/lunny/levelqueue.(*Queue).RPop
0 0% 100% 10ms 7.69% gitea.com/lunny/levelqueue.(*UniqueQueue).RPop
0 0% 100% 20ms 15.38% github.com/syndtr/goleveldb/leveldb.(*DB).Get
0 0% 100% 20ms 15.38% github.com/syndtr/goleveldb/leveldb.(*DB).get
0 0% 100% 10ms 7.69% github.com/syndtr/goleveldb/leveldb.(*version).get
0 0% 100% 10ms 7.69% github.com/syndtr/goleveldb/leveldb.memGet
0 0% 100% 10ms 7.69% runtime.futexsleep
0 0% 100% 20ms 15.38% runtime.gcBgMarkWorker
0 0% 100% 20ms 15.38% runtime.gcBgMarkWorker.func2
0 0% 100% 20ms 15.38% runtime.gcDrain
0 0% 100% 10ms 7.69% runtime.lock (inline)
0 0% 100% 10ms 7.69% runtime.lockWithRank (inline)
0 0% 100% 10ms 7.69% runtime.mPark
0 0% 100% 90ms 69.23% runtime.mcall
0 0% 100% 10ms 7.69% runtime.nanotime (inline)
0 0% 100% 30ms 23.08% runtime.netpoll
0 0% 100% 10ms 7.69% runtime.notesleep
0 0% 100% 90ms 69.23% runtime.park_m
0 0% 100% 90ms 69.23% runtime.schedule
0 0% 100% 10ms 7.69% runtime.stopm
0 0% 100% 20ms 15.38% runtime.systemstack
with default config:
File: gitea
Build ID: dc8e3f5cdde5399ead75d4998c4c670eeef426bd
Type: cpu
Time: May 1, 2021 at 5:34pm (EDT)
Duration: 30s, Total samples = 410ms ( 1.37%)
Showing nodes accounting for 410ms, 100% of 410ms total
flat flat% sum% cum cum%
150ms 36.59% 36.59% 150ms 36.59% runtime.epollwait
60ms 14.63% 51.22% 60ms 14.63% runtime.futex
20ms 4.88% 56.10% 290ms 70.73% runtime.findrunnable
20ms 4.88% 60.98% 20ms 4.88% runtime.nobarrierWakeTime (inline)
20ms 4.88% 65.85% 40ms 9.76% runtime.scanobject
10ms 2.44% 68.29% 20ms 4.88% github.com/syndtr/goleveldb/leveldb/memdb.(*DB).Find
10ms 2.44% 70.73% 10ms 2.44% github.com/syndtr/goleveldb/leveldb/memdb.(*DB).findGE
10ms 2.44% 73.17% 10ms 2.44% runtime.(*mheap).allocSpan
10ms 2.44% 75.61% 10ms 2.44% runtime.(*mspan).markBitsForIndex (inline)
10ms 2.44% 78.05% 10ms 2.44% runtime.acquirep
10ms 2.44% 80.49% 10ms 2.44% runtime.checkTimers
10ms 2.44% 82.93% 10ms 2.44% runtime.findObject
10ms 2.44% 85.37% 10ms 2.44% runtime.lock2
10ms 2.44% 87.80% 70ms 17.07% runtime.mPark
10ms 2.44% 90.24% 10ms 2.44% runtime.madvise
10ms 2.44% 92.68% 20ms 4.88% runtime.mallocgc
10ms 2.44% 95.12% 310ms 75.61% runtime.mcall
10ms 2.44% 97.56% 10ms 2.44% runtime.selectgo
10ms 2.44% 100% 90ms 21.95% runtime.stopm
0 0% 100% 30ms 7.32% code.gitea.io/gitea/modules/queue.(*ByteFIFOQueue).readToChan
0 0% 100% 30ms 7.32% code.gitea.io/gitea/modules/queue.(*LevelQueueByteFIFO).Pop
0 0% 100% 20ms 4.88% code.gitea.io/gitea/modules/queue.(*WorkerPool).addWorkers.func1
0 0% 100% 20ms 4.88% code.gitea.io/gitea/modules/queue.(*WorkerPool).doWork
@elmodor commented on GitHub (May 1, 2021):
I was hoping that
[queue.*]would set this for allqueue.*settings as well. Are there any other queue settings? queue.task?I think we (well mostly @zeripath ;) ) is making some good progress here.
Latest prs-15693-15686-15696 with:
it is around 2%-3%
it is around 0.5%-1%
pprof from this:
and with
it is steady 0% !
pprof from this:
@ddan39 commented on GitHub (May 1, 2021):
i see that too when setting
i get near constant 0% cpu usage at idle. i can also see with strace that the constant rapid calling of epoll_pwait() and futex() no longer happen. i only see a group of calls like every 10 seconds that are pretty minimal.
are there any possible side-effects of using these settings?
@zeripath commented on GitHub (May 1, 2021):
[queue]is doing the work what you think[queue.*]does. It's just that the issue indexer is annoying and has its own forced default.So I think we can stop with the pprofs just now.
It looks like the are two fundamental issue for your idle CPU usage:
#15696 & #15693 reduce around half of the work of point 1. but if that is still not enough then you will have to set
TYPE=channel.With the three PRs together, using channel queues - which does have the cost of a potential loss of data on shutdown at times of load - you should just flush the queues before you shutdown - and the below config you should be able to get gitea down to 30 goroutines when absolutely idle.
I think there may be a way of getting the persistable channel queue to shutdown its runner once it's empty so I will continue to see if that can be improved. Another option is to see if during shutdown we can flush the channel queues to reduce the risk of leaving something in them.
I'm not otherwise certain if I can reduce the number of basal goroutines further but it looks like that's enough.
There's likely some Linux resource setting you can increase that would allow go to avoid futex cycling with more goroutines but I'm no expert here and don't know.
@elmodor commented on GitHub (May 1, 2021):
I've let it run for a bit longer and computed the avg. CPU usage of the docker container in idle which was around 0.14% CPU usage average (with all three queue settings and all three PRs). That's already lots better than the 5% we started with before.
Loss of data would obviously not be good.
However, when running the flush command I get this:
This sounds like an error? I've run this as gitea and root inside the rootless container - both the same.
Thanks for you patient and awesome work so far!
@zeripath commented on GitHub (May 1, 2021):
I suspect your manager call needs you to set your config option correctly.
@elmodor commented on GitHub (May 2, 2021):
Indeed, that fixed it. It was quite late yesterday...
Anyway, is there any more testing/feedback that you require or is this the maximum of reduceable goroutins for now? (not complaining, I'm already quite happy that you were able to reduce it this much)
Not sure if the flush could be done automatically on a docker shutdown/stop
@zeripath commented on GitHub (May 2, 2021):
I've changed the persistable-channel queues to shutdown their level dbs once they're empty or not run them at all if they're empty which brings the number of baseline goroutines on three pr branch with default configuration to 80, and 74 with
WORKERS=0,BOOST_WORKERS=1above.I'll have a think about what we can do about reducing these further.
With the current state of prs-15693-15686-15696 76e34f05fbcc3d0069e708a440f882cf4da77a01
and app.ini
The starting number of go-routines is down to 64-65 ish.
I can think of potentially one more small improvement in the queues infrastructure (just use contexts instead of channels) but it may not be worth it. It may not be possible to reduce the number of goroutines in the persistable channel further though. I still need to figure out some way of improving polling to reduce baseline load with the RedisQueue - likely this will be with Redis Signals but I'm not sure.
In terms of other baseline go-routines - there may be a few other places where the changing the CancelFunc trick may apply.
@zeripath commented on GitHub (May 3, 2021):
Down to 59 now - we can stop looking at
prs-15693-15686-15696and just look at #15693 as I've moved #15686 into it with 23 on the channel only variants.The next way to reduce the number of goroutines for persistable channels is to use a common queue instead.
@elmodor commented on GitHub (May 3, 2021):
Awesome work! I tested the prs one earlier today but I will test the new one tomorrow.
I also think that there are some DB calls because the only time gitea (with channels) has >0% CPU now is at times where the DB (postgres) also has a CPU spike.
@zeripath commented on GitHub (May 3, 2021):
So I think if you set:
Then the persistent channel queues will all share a level dB instead of opening their own. That should significantly reduce the number of goroutines and on the current state of that pr probably drop it to around 29 goroutines at rest.
@zeripath commented on GitHub (May 3, 2021):
That's probably the event source. We currently poll the db even if there's no one connected but I think we can get it to stop when no one is connected.
@dicksonleong commented on GitHub (May 4, 2021):
On 1.14.1, using the below settings
Does reduce the gitea idle CPU to almost 0%, however when I create a new repository and push (using SSH), the webui does not show the repository code (still showing "Quick Guide") but internally the code is pushed to Git successfully. Changing the workers to 1 fix this.
@zeripath commented on GitHub (May 4, 2021):
The above config is not meant to work on 1.14.1 and requires changes made on 1.15 and in particular with the PRs I have linked.
These changes will almost certainly not be backported to 1.14.x.
@elmodor commented on GitHub (May 4, 2021):
I have tested it with 1.14.1 as well but the CPU spikes are a bit higher. However, if you run this with 1.14.1 you risk data loss and other undefined behavior if I understood the changes correctly which were made in #15693 (and are not present in 1.14.x)
So with #15693 I have using:
I had ~26 goroutines, CPU usage of 0-0.5%,with spikes being usually at the same time as the DB spikes.
And using:
I had ~57 goroutines, CPU usage of 0.2-0.7%
@zeripath commented on GitHub (May 4, 2021):
OK looks like the CONN_STR isn't doing the trick of forcing the use of a common leveldb otherwise the goroutines should have been around half that. I'll take a look - but it might be that we have to actually specifically tell the queues to use a common db.
@zeripath commented on GitHub (May 4, 2021):
The below is the current configuration for #15693 using common queues:
That has a baseline goroutines number of 29.
Those on 1.14.1 who want reductions should find that:
should cause significant reductions in baseline goroutine numbers - although to nowhere near as low levels as #15693 allows.
I had intended to make it possible to force a common leveldb connection by setting CONN_STR appropriately but I suspect handling the people that insist on copying the app.example.ini as their app.ini has caused this to break. I suspect just adding a new option to
[queue]could be used to enforce a common leveldb connection.@elmodor commented on GitHub (May 4, 2021):
Yeah, with that configuration for #15693 in docker I got ~32 goroutines and a CPU usage of around 0.05%-0.8%.
Since you haven't used the channels in the settings lately I assume using channels should be avoided?
@zeripath commented on GitHub (May 4, 2021):
Whether the channel only queue is appropriate for you is a decision about your data-safety and how you start and shutdown your instance. With #15693 the channel queue will at least attempt to flush itself at time of shutdown but... there's still no guarantee that every piece of data in the queue is dealt with.
If you have an absolutely low load - and if you flush before shutting down - then it should be fine to use channel only queues - but as I've shown above it's possible to get it baseline goroutines down to 29 using a leveldb backend which doesn't have this gotcha.
It's up to you really - persistable channels probably have to be the default which is why I've been working to reduce the baseline load there. Other Queue implementations are possible too.
@Gnlfz007 commented on GitHub (Jul 14, 2021):
Isn't there a way to reduce the idle load to 0%, possibly giving up some functionality?
Background: I'm running it on my x-slow NAS N40L where I get an idle load of ~3%. This is not acceptable; I'm kicking out any service raising my electricity bill sucking CPU permanently. My current workaround is "systemctl { stop | start } gitea".
Btw; the performance of GITEA is incredible on this slow NAS! Probably because there is tight object code instead of lame web scripts or java bloat.
@zeripath commented on GitHub (Jul 14, 2021):
I am guessing that you're running 1.14.x and not 1.15.x/main.
Please try main.
I think since all the PRs mentioned above (and now explicitly linked to this issue) have been merged we can actually close this issue as the baseline load on an idle gitea is now back to almost zero.