OOM caused by numerous crawls #14106

Closed
opened 2025-11-02 11:02:59 -06:00 by GiteaMirror · 24 comments
Owner

Originally created by @H0llyW00dzZ on GitHub (Feb 6, 2025).

Description

In the latest versions, 1.23.2 and 1.23.3, memory leaks occur. (update: see below, not memory leak, not regression)

These OOMs are caused by numerous crawls, such as those used by Facebook Inc. (Meta), Amazon (AWS), and other entities that fetch data excessively for AI training.

My Gitea self-hosted configuration:

  • Sessions using files
  • Cache using Redis with a TTL of 5 hours, and the last commit cache is 10K
  • No SSH

Screenshots

Image
Image

The logs exemplify how these companies use crawls for their AI.

Image

Essentially, memory leaks occur when there are many fetch requests, leading to crashes due to excessive memory consumption (thanks to OOM Kubernetes).

Originally created by @H0llyW00dzZ on GitHub (Feb 6, 2025). ### Description ~~In the latest versions, 1.23.2 and 1.23.3, memory leaks occur.~~ (update: see below, not memory leak, not regression) These OOMs are caused by numerous crawls, such as those used by Facebook Inc. (Meta), Amazon (AWS), and other entities that fetch data excessively for AI training. My Gitea self-hosted configuration: - Sessions using files - Cache using Redis with a TTL of 5 hours, and the last commit cache is 10K - No SSH ### Screenshots ![Image](https://github.com/user-attachments/assets/a43fe1cf-17f8-4297-84d8-8c233f0d604e) ![Image](https://github.com/user-attachments/assets/9fe9d123-aade-4467-b279-1325b1fb71d2) The logs exemplify how these companies use crawls for their AI. ![Image](https://github.com/user-attachments/assets/62ac2582-17d6-4d40-92b0-0a4c121548f6) Essentially, memory leaks occur when there are many fetch requests, leading to crashes due to excessive memory consumption (thanks to OOM Kubernetes).
GiteaMirror added the issue/needs-feedback label 2025-11-02 11:02:59 -06:00
Author
Owner

@wxiaoguang commented on GitHub (Feb 6, 2025):

Could you download a diagnosis report from "admin panel -> monitor -> trace" when the memory goes high?

The report contains heap dump (no sensitive data) and could help to locate the problem.

@wxiaoguang commented on GitHub (Feb 6, 2025): Could you download a diagnosis report from "admin panel -> monitor -> trace" when the memory goes high? The report contains heap dump (no sensitive data) and could help to locate the problem.
Author
Owner

@H0llyW00dzZ commented on GitHub (Feb 6, 2025):

Could you download a diagnosis report from "admin panel -> monitor -> trace" when the memory goes high?

The report contains heap dump (no sensitive data) and could help to locate the problem.

Here is the system notice:

Image

This is the system status, which shows an inconsistent system status, as I mentioned earlier in #33311.

Image

@H0llyW00dzZ commented on GitHub (Feb 6, 2025): > Could you download a diagnosis report from "admin panel -> monitor -> trace" when the memory goes high? > > The report contains heap dump (no sensitive data) and could help to locate the problem. Here is the system notice: ![Image](https://github.com/user-attachments/assets/6a1807f0-cea1-48b3-b6ec-9f4147d0b8d0) This is the system status, which shows an inconsistent system status, as I mentioned earlier in #33311. ![Image](https://github.com/user-attachments/assets/7e6672b0-71a5-4c90-988c-b7fa64171f24)
Author
Owner

@wxiaoguang commented on GitHub (Feb 6, 2025):

Could you download a diagnosis report from "admin panel -> monitor -> trace" when the memory goes high?

The report contains heap dump (no sensitive data) and could help to locate the problem.

@wxiaoguang commented on GitHub (Feb 6, 2025): Could you download a diagnosis report from "admin panel -> monitor -> trace" when the memory goes high? The report contains heap dump (no sensitive data) and could help to locate the problem.
Author
Owner

@wxiaoguang commented on GitHub (Feb 6, 2025):

If the memory is not related to Gitea process, then maybe you need to figure out which process consumes that memory, for example: git process? or some other commands?

@wxiaoguang commented on GitHub (Feb 6, 2025): If the memory is not related to Gitea process, then maybe you need to figure out which process consumes that memory, for example: git process? or some other commands?
Author
Owner

@H0llyW00dzZ commented on GitHub (Feb 6, 2025):

Could you download a diagnosis report from "admin panel -> monitor -> trace" when the memory goes high?

The report contains heap dump (no sensitive data) and could help to locate the problem.

I can't capture the memory usage when it spikes via the trace admin panel because every time memory consumption goes high (e.g., 7 GiB), it crashes due to OOM Kubernetes.

@H0llyW00dzZ commented on GitHub (Feb 6, 2025): > Could you download a diagnosis report from "admin panel -> monitor -> trace" when the memory goes high? > > The report contains heap dump (no sensitive data) and could help to locate the problem. I can't capture the memory usage when it spikes via the trace admin panel because every time memory consumption goes high (e.g., 7 GiB), it crashes due to OOM Kubernetes.
Author
Owner

@wxiaoguang commented on GitHub (Feb 6, 2025):

Could you download a diagnosis report from "admin panel -> monitor -> trace" when the memory goes high?
The report contains heap dump (no sensitive data) and could help to locate the problem.

I can't capture the memory usage when it spikes via the trace admin panel because every time memory consumption goes high (e.g., 7 GiB), it crashes due to OOM Kubernetes.

Is it clear that which process consumes that much memory? The Gitea web server process itself, or other processes like "ssh" or "git" or "gitea serve/hook"?

@wxiaoguang commented on GitHub (Feb 6, 2025): > > Could you download a diagnosis report from "admin panel -> monitor -> trace" when the memory goes high? > > The report contains heap dump (no sensitive data) and could help to locate the problem. > > I can't capture the memory usage when it spikes via the trace admin panel because every time memory consumption goes high (e.g., 7 GiB), it crashes due to OOM Kubernetes. Is it clear that which process consumes that much memory? The Gitea web server process itself, or other processes like "ssh" or "git" or "gitea serve/hook"?
Author
Owner

@wxiaoguang commented on GitHub (Feb 6, 2025):

The logs exemplify how these companies use crawls for their AI.

Essentially, memory leaks occur when there are many fetch requests, leading to crashes due to excessive memory consumption (thanks to OOM Kubernetes).

If the OOM is caused by crawls, then it isn't a regression: each request consumes memory, some large repo/files consume more, then if there are lot of requests, these requests do consume a lot of memory and would lead to OOM. Maybe you could try to make stop the crawls and/or require sign-in for your instance.

So I think we need to make the problem clearer.

@wxiaoguang commented on GitHub (Feb 6, 2025): > The logs exemplify how these companies use crawls for their AI. > > Essentially, memory leaks occur when there are many fetch requests, leading to crashes due to excessive memory consumption (thanks to OOM Kubernetes). If the OOM is caused by crawls, then it isn't a regression: each request consumes memory, some large repo/files consume more, then if there are lot of requests, these requests do consume a lot of memory and would lead to OOM. Maybe you could try to make stop the crawls and/or require sign-in for your instance. So I think we need to make the problem clearer.
Author
Owner

@H0llyW00dzZ commented on GitHub (Feb 6, 2025):

Could you download a diagnosis report from "admin panel -> monitor -> trace" when the memory goes high?
The report contains heap dump (no sensitive data) and could help to locate the problem.

I can't capture the memory usage when it spikes via the trace admin panel because every time memory consumption goes high (e.g., 7 GiB), it crashes due to OOM Kubernetes.

Is it clear that which process consumes that much memory? The Gitea web server process itself, or other processes like "ssh" or "git" or "gitea serve/hook"?

Most likely, it's from Git because the stack trace shows this:

Image

Image

Image

When there are many requests, such as GET requests to view repositories from crawls, memory consumption goes high, and it crashes due to being OOM killed by Kubernetes.

@H0llyW00dzZ commented on GitHub (Feb 6, 2025): > > > Could you download a diagnosis report from "admin panel -> monitor -> trace" when the memory goes high? > > > The report contains heap dump (no sensitive data) and could help to locate the problem. > > > > > > I can't capture the memory usage when it spikes via the trace admin panel because every time memory consumption goes high (e.g., 7 GiB), it crashes due to OOM Kubernetes. > > Is it clear that which process consumes that much memory? The Gitea web server process itself, or other processes like "ssh" or "git" or "gitea serve/hook"? Most likely, it's from Git because the stack trace shows this: <details> ![Image](https://github.com/user-attachments/assets/86bbdd3d-8c55-4a3d-b01a-d6b551eb5c45) ![Image](https://github.com/user-attachments/assets/31b964b5-6a17-4d84-b451-4c9ab646f7bc) ![Image](https://github.com/user-attachments/assets/16612f76-fc1c-4752-a94b-65ae2c3dafb3) </details> When there are many requests, such as GET requests to view repositories from crawls, memory consumption goes high, and it crashes due to being OOM killed by Kubernetes.
Author
Owner

@H0llyW00dzZ commented on GitHub (Feb 6, 2025):

Also right now, I've rolled back to version 1.23.1 and reduced the cache for last commit messages from 10K to 5K in the app.ini configuration. Let's see if it still crashes.

@H0llyW00dzZ commented on GitHub (Feb 6, 2025): Also right now, I've rolled back to version 1.23.1 and reduced the cache for last commit messages from 10K to 5K in the `app.ini` configuration. Let's see if it still crashes.
Author
Owner

@wxiaoguang commented on GitHub (Feb 6, 2025):

TBH, I do not see related change between 1.23.1 ~ 1.23.3

https://github.com/go-gitea/gitea/compare/v1.23.1...v1.23.3

@wxiaoguang commented on GitHub (Feb 6, 2025): TBH, I do not see related change between 1.23.1 ~ 1.23.3 https://github.com/go-gitea/gitea/compare/v1.23.1...v1.23.3
Author
Owner

@H0llyW00dzZ commented on GitHub (Feb 6, 2025):

TBH, I do not see related change between 1.23.1 ~ 1.23.3

v1.23.1...v1.23.3

Well It worked fine for me previously, with uptime of over a month without crashing due to high memory consumption.

And now, after rolling back, it still crashes.

h0llyw00dzz@ubuntu-pro:~$ kubectl get pods -n gitea
NAME                     READY   STATUS    RESTARTS      AGE
gitea-5cb7dff998-xwb5r   1/1     Running   1 (40s ago)   10m

h0llyw00dzz@ubuntu-pro:~$ kubectl describe pods -n gitea
Containers:
  gitea:
    Container ID:   containerd://866d173132606a07e7937e7dfb430533cf1e5a8ad515044e496486416f6a485c
    Image:          gitea/gitea:1.23.1
    Image ID:       docker.io/gitea/gitea@sha256:c3be67d5c31694f8c27e5f3ab87630cceadf05abb795ab0ed70ba14b5edfc29c
    Port:           3000/TCP
    Host Port:      0/TCP
    State:          Running
      Started:      Thu, 06 Feb 2025 18:30:03 +0700
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Thu, 06 Feb 2025 18:20:10 +0700
      Finished:     Thu, 06 Feb 2025 18:30:01 +0700
@H0llyW00dzZ commented on GitHub (Feb 6, 2025): > TBH, I do not see related change between 1.23.1 ~ 1.23.3 > > [v1.23.1...v1.23.3](https://github.com/go-gitea/gitea/compare/v1.23.1...v1.23.3) Well It worked fine for me previously, with uptime of over a month without crashing due to high memory consumption. And now, after rolling back, it still crashes. ```logs h0llyw00dzz@ubuntu-pro:~$ kubectl get pods -n gitea NAME READY STATUS RESTARTS AGE gitea-5cb7dff998-xwb5r 1/1 Running 1 (40s ago) 10m h0llyw00dzz@ubuntu-pro:~$ kubectl describe pods -n gitea Containers: gitea: Container ID: containerd://866d173132606a07e7937e7dfb430533cf1e5a8ad515044e496486416f6a485c Image: gitea/gitea:1.23.1 Image ID: docker.io/gitea/gitea@sha256:c3be67d5c31694f8c27e5f3ab87630cceadf05abb795ab0ed70ba14b5edfc29c Port: 3000/TCP Host Port: 0/TCP State: Running Started: Thu, 06 Feb 2025 18:30:03 +0700 Last State: Terminated Reason: OOMKilled Exit Code: 137 Started: Thu, 06 Feb 2025 18:20:10 +0700 Finished: Thu, 06 Feb 2025 18:30:01 +0700 ```
Author
Owner

@wxiaoguang commented on GitHub (Feb 6, 2025):

Well, as I said above: it can't be a regression, it can't be related to the new version.

There are just more crawls now. If you do not have that much resource support the crawls, maybe you need to block the crawls.

@wxiaoguang commented on GitHub (Feb 6, 2025): Well, as I said above: it can't be a regression, it can't be related to the new version. There are just more crawls now. If you do not have that much resource support the crawls, maybe you need to block the crawls.
Author
Owner

@H0llyW00dzZ commented on GitHub (Feb 6, 2025):

Well, as I said above: it can't be a regression, it can't be related to the new version.

There are just more crawls now. If you do not have that much resource support the crawls, maybe you need to block the crawls.

For now, I've enabled REQUIRE_SIGNIN_VIEW to disable crawls used by companies like Facebook (Meta) and Amazon (AWS) for training their AI. It seems they are likely overusing (Abuse) the crawls for AI purposes.

Blocking these crawls by IP is ineffective because their IPs frequently change.

@H0llyW00dzZ commented on GitHub (Feb 6, 2025): > Well, as I said above: it can't be a regression, it can't be related to the new version. > > There are just more crawls now. If you do not have that much resource support the crawls, maybe you need to block the crawls. For now, I've enabled `REQUIRE_SIGNIN_VIEW` to disable crawls used by companies like Facebook (Meta) and Amazon (AWS) for training their AI. It seems they are likely overusing (Abuse) the crawls for AI purposes. Blocking these crawls by IP is ineffective because their IPs frequently change.
Author
Owner

@H0llyW00dzZ commented on GitHub (Feb 6, 2025):

@wxiaoguang

The problem was solved by blocking their ASN, likely used for abusive AI training (e.g., Facebook Inc. (Meta), Amazon (AWS)). Now, only crawls from Google, used for indexing in their search engine, are allowed via Kubernetes Ingress Nginx. However, I believe it would be beneficial to expand the admin panel with additional features to block crawls based on IPs, User-Agent, and ASN. This would help prevent high memory consumption, likely due to memory leaks, which can cause crashes.

@H0llyW00dzZ commented on GitHub (Feb 6, 2025): @wxiaoguang The problem was solved by blocking their ASN, likely used for abusive AI training (e.g., Facebook Inc. (Meta), Amazon (AWS)). Now, only crawls from Google, used for indexing in their search engine, are allowed via Kubernetes Ingress Nginx. However, I believe it would be beneficial to expand the admin panel with additional features to block crawls based on IPs, User-Agent, and ASN. This would help prevent high memory consumption, likely due to memory leaks, which can cause crashes.
Author
Owner

@H0llyW00dzZ commented on GitHub (Feb 6, 2025):

The proof that blocking bad crawls used by Facebook Inc. (Meta) and Amazon (AWS) for AI training has effectively solved the memory usage issue, which was previously being abused excessively for profit.

Image

Note

Memory usage has returned to normal, even with legitimate crawls like Google Search and others used for SEO, unlike the abusive AI training crawls from large companies such as Facebook Inc. (Meta) and Amazon (AWS).

@H0llyW00dzZ commented on GitHub (Feb 6, 2025): The proof that blocking bad crawls used by Facebook Inc. (Meta) and Amazon (AWS) for AI training has effectively solved the memory usage issue, which was previously being abused excessively for profit. ![Image](https://github.com/user-attachments/assets/2884ffe0-a9e9-4efd-a691-ab25fe0da63d) > [!NOTE] > Memory usage has returned to normal, even with legitimate crawls like Google Search and others used for SEO, unlike the abusive AI training crawls from large companies such as Facebook Inc. (Meta) and Amazon (AWS).
Author
Owner

@H0llyW00dzZ commented on GitHub (Feb 15, 2025):

@wxiaoguang I've resolved this problem by increasing the Redis cache pool size to 500 and switching the session storage from files to Redis, using the same pool size of 500. This results in a total pool size of 1000.

The Stats:

Redis:
Image

Pods:
Image
Image
Image

However, this solution is only temporary because, without Redis, the memory usage leads to excessive consumption.

@H0llyW00dzZ commented on GitHub (Feb 15, 2025): @wxiaoguang I've resolved this problem by increasing the Redis cache pool size to 500 and switching the session storage from files to Redis, using the same pool size of 500. This results in a total pool size of 1000. The Stats: Redis: ![Image](https://github.com/user-attachments/assets/7497b136-3fe1-41f5-a81c-95a001f8e54c) Pods: ![Image](https://github.com/user-attachments/assets/b45597de-bd83-468b-b695-dd092098e120) ![Image](https://github.com/user-attachments/assets/fc601e81-1c73-4b35-9e4f-e0f9fdc87e57) ![Image](https://github.com/user-attachments/assets/d90e9dc2-ec5d-49e5-a2a3-6d8a4057a890) However, this solution is only temporary because, without Redis, the memory usage leads to excessive consumption.
Author
Owner

@wxiaoguang commented on GitHub (Apr 9, 2025):

In 1.23.7 , we have this:

Add a config option to block "expensive" pages for anonymous users (#34024) (#34071)

@wxiaoguang commented on GitHub (Apr 9, 2025): In 1.23.7 , we have this: Add a config option to block "expensive" pages for anonymous users (#34024) (#34071)
Author
Owner

@H0llyW00dzZ commented on GitHub (Apr 9, 2025):

In 1.23.7 , we have this:

Add a config option to block "expensive" pages for anonymous users (#34024) (#34071)

@wxiaoguang, I've been trying that configuration option, but it seems similar to REQUIRE_SIGNIN_VIEW = true, which may not be ideal for open-source repositories. I think it would be more effective to implement a rate limiter based on IP addresses or user agents, or both, for areas that consume a lot of memory (e.g., example.com/repo/commit/sha1commit). This could reduce resource usage, such as memory, especially since many AI crawlers use the same IPs and user agents when crawling a site.

@H0llyW00dzZ commented on GitHub (Apr 9, 2025): > In 1.23.7 , we have this: > > Add a config option to block "expensive" pages for anonymous users ([#34024](https://github.com/go-gitea/gitea/pull/34024)) ([#34071](https://github.com/go-gitea/gitea/pull/34071)) @wxiaoguang, I've been trying that configuration option, but it seems similar to `REQUIRE_SIGNIN_VIEW = true`, which may not be ideal for open-source repositories. I think it would be more effective to implement a rate limiter based on IP addresses or user agents, or both, for areas that consume a lot of memory (e.g., example.com/repo/commit/sha1commit). This could reduce resource usage, such as memory, especially since many AI crawlers use the same IPs and user agents when crawling a site.
Author
Owner

@wxiaoguang commented on GitHub (Apr 9, 2025):

which may not be ideal for open-source repositories.

For "open source public site", my proposal is https://github.com/go-gitea/gitea/pull/33951#discussion_r2032324964

I don't run a public site, so I can't comment too much for this problem.

@wxiaoguang commented on GitHub (Apr 9, 2025): > which may not be ideal for open-source repositories. For "open source public site", my proposal is `https://github.com/go-gitea/gitea/pull/33951#discussion_r2032324964` I don't run a public site, so I can't comment too much for this problem.
Author
Owner

@H0llyW00dzZ commented on GitHub (Apr 9, 2025):

which may not be ideal for open-source repositories.

For "open source public site", my proposal is https://github.com/go-gitea/gitea/pull/33951#discussion_r2032324964

I don't run a public site, so I can't comment too much for this problem.

@wxiaoguang, I run a public site primarily for mirroring repositories. Also the implementation of #33951 could indeed help reduce resource usage. It's quite similar to a rate limiter, which would be beneficial in managing resource consumption effectively.

@H0llyW00dzZ commented on GitHub (Apr 9, 2025): > > which may not be ideal for open-source repositories. > > For "open source public site", my proposal is `https://github.com/go-gitea/gitea/pull/33951#discussion_r2032324964` > > I don't run a public site, so I can't comment too much for this problem. @wxiaoguang, I run a public site primarily for mirroring repositories. Also the implementation of #33951 could indeed help reduce resource usage. It's quite similar to a rate limiter, which would be beneficial in managing resource consumption effectively.
Author
Owner

@wxiaoguang commented on GitHub (Apr 20, 2025):

#33951 has been merged, does it work for your case?

@wxiaoguang commented on GitHub (Apr 20, 2025): #33951 has been merged, does it work for your case?
Author
Owner

@H0llyW00dzZ commented on GitHub (Apr 20, 2025):

#33951 has been merged, does it work for your case?

@wxiaoguang I haven't tried it yet. My git site is using Gitea 1.23.7, not the nightly build, as I prefer long-term stability due to its running on k8s.

@H0llyW00dzZ commented on GitHub (Apr 20, 2025): > [#33951](https://github.com/go-gitea/gitea/pull/33951) has been merged, does it work for your case? @wxiaoguang I haven't tried it yet. My git site is using Gitea 1.23.7, not the nightly build, as I prefer long-term stability due to its running on k8s.
Author
Owner

@H0llyW00dzZ commented on GitHub (May 1, 2025):

@wxiaoguang I've been using version 1.24.0-rc0. The performance is better now, unlike previously when memory usage increased a lot.

Image

However, I'm not sure yet if it's fixed, as my Gitea self-hosted site currently shows no crawling detected. I might update you later if crawling is detected.

@H0llyW00dzZ commented on GitHub (May 1, 2025): @wxiaoguang I've been using version [1.24.0-rc0](https://github.com/go-gitea/gitea/tree/v1.24.0-rc0). The performance is better now, unlike previously when memory usage increased a lot. ![Image](https://github.com/user-attachments/assets/84c55cc3-3684-4e68-b6ce-c57478802a3c) However, I'm not sure yet if it's fixed, as my Gitea self-hosted site currently shows no crawling detected. I might update you later if crawling is detected.
Author
Owner

@GiteaBot commented on GitHub (Jun 2, 2025):

We close issues that need feedback from the author if there were no new comments for a month. 🍵

@GiteaBot commented on GitHub (Jun 2, 2025): We close issues that need feedback from the author if there were no new comments for a month. :tea:
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/gitea#14106