Make the default /robots.txt reject all crawlers #255

New Issue

GiteaMirror · 2025-11-02T03:16:06-06:00

GiteaMirror commented

2025-11-02 03:16:06 -06:00

Originally created by @sztanpet on GitHub (Jan 20, 2017).

I am wondering whether this is going too far or not. In my mind, the default for privately set-up gitea instances should be private by default and that entails rejecting crawlers too as a way to reduce surprise to the user.

Originally created by @sztanpet on GitHub (Jan 20, 2017). I am wondering whether this is going too far or not. In my mind, the default for privately set-up gitea instances should be *private by default* and that entails rejecting crawlers too as a way to reduce surprise to the user.

GiteaMirror added the type/docs label 2025-11-02 03:16:06 -06:00

GiteaMirror closed this issue

2025-11-02 03:16:06 -06:00

GiteaMirror commented

2025-11-02 03:16:07 -06:00

@sztanpet commented on GitHub (Jan 20, 2017):

Not even mentioning as secure as possible but still making it easy to use for the user by default. Which should entail hiding version numbers, disabling gravatar and other information leaking features and making the default be private repositories, etc, but that is a separate discussion.

@sztanpet commented on GitHub (Jan 20, 2017): Not even mentioning as secure as possible but still making it easy to use for the user by default. Which should entail hiding version numbers, disabling gravatar and other information leaking features and making the default be private repositories, etc, but that is a separate discussion.

GiteaMirror commented

2025-11-02 03:16:07 -06:00

@bkcsoft commented on GitHub (Jan 20, 2017):

Maybe not make it default, but if REQUIRE_SIGNIN_VIEW is set to true, and /robots.txt isn't found, Gitea could provide a default "block all robots".txt 🙂

@bkcsoft commented on GitHub (Jan 20, 2017): Maybe not make it default, but if `REQUIRE_SIGNIN_VIEW` is set to true, and `/robots.txt` isn't found, Gitea could provide a default "block all robots".txt 🙂

GiteaMirror commented

2025-11-02 03:16:08 -06:00

@bkcsoft commented on GitHub (Jan 20, 2017):

(Since if REQUIRE_SIGNIN_VIEW is set it seems m00t for a crawler to crawl it 😛 )

@bkcsoft commented on GitHub (Jan 20, 2017): (Since if `REQUIRE_SIGNIN_VIEW` is set it seems m00t for a crawler to crawl it 😛 )

GiteaMirror commented

2025-11-02 03:16:08 -06:00

@sztanpet commented on GitHub (Jan 20, 2017):

well yes, but at that point it doesn't really matter, so I think it doesn't go far enough

@sztanpet commented on GitHub (Jan 20, 2017): well yes, but at that point it doesn't really matter, so I think it doesn't go far enough

GiteaMirror commented

2025-11-02 03:16:08 -06:00

@tboerger commented on GitHub (Jan 20, 2017):

IMHO we should not block everything by default. For sure there are enough instances that don't want to block everything. Private repositories are anyway blocked at all because it's not accessible. If somebody really wants to enforce that, he can add a robots.txt to the custom folder.

@tboerger commented on GitHub (Jan 20, 2017): IMHO we should not block everything by default. For sure there are enough instances that don't want to block everything. Private repositories are anyway blocked at all because it's not accessible. If somebody really wants to enforce that, he can add a robots.txt to the custom folder.

GiteaMirror commented

2025-11-02 03:16:08 -06:00

@strk commented on GitHub (Jan 26, 2017):

I've just had a problem with robots, but in my case the service is running from a suburl so serving a robots.txt from Gitea would not have helped. Unless I'm missing a specification allowing for that. What I've been reading (not much) came from http://www.robotstxt.org/robotstxt.html

For top-level installs, generating a robots.txt would indeed be good as it would allow for example preventing bots from downloading archives for each committish, which in turn fills up disk space (see #769) - according to the lecture above (robotstxt.org) you cannot use globs in a robots.txt file so having it automatically generated helps with instances where everyone can create new repos...

@strk commented on GitHub (Jan 26, 2017): I've just had a problem with robots, but in my case the service is running from a suburl so serving a robots.txt from Gitea would not have helped. Unless I'm missing a specification allowing for that. What I've been reading (not much) came from http://www.robotstxt.org/robotstxt.html For top-level installs, generating a robots.txt would indeed be good as it would allow for example preventing bots from downloading archives for each committish, which in turn fills up disk space (see #769) - according to the lecture above (robotstxt.org) you cannot use globs in a robots.txt file so having it automatically generated helps with instances where everyone can create new repos...

GiteaMirror commented

2025-11-02 03:16:08 -06:00

@lunny commented on GitHub (Oct 14, 2019):

We could have two examples, one for private sites another for public sites.

@lunny commented on GitHub (Oct 14, 2019): We could have two examples, one for private sites another for public sites.

GiteaMirror commented

2025-11-02 03:16:09 -06:00

@guillep2k commented on GitHub (Oct 14, 2019):

I agree with some comments I've read: Gitea should come with some sensible default robots.txt for public sites, not as a sample but installed as default. The users will of course be able to replace it as they see fit.

BTW: what are robots.txt used for in private sites?

EDIT: I thought it meant intranet sites, sorry!

@guillep2k commented on GitHub (Oct 14, 2019): I agree with some comments I've read: Gitea should come with some sensible default robots.txt for public sites, not as a sample but installed as default. The users will of course be able to replace it as they see fit. BTW: what are robots.txt used for in private sites? EDIT: I thought it meant intranet sites, sorry!

GiteaMirror commented

2025-11-02 03:16:09 -06:00

@zeripath commented on GitHub (Oct 14, 2019):

I agree it is probably reasonable to provide a sensible example of robots.txt for a basic public site -that's specific knowledge that's appropriate for Gitea. For private sites, we could put something on the website documentation but it's basically:

User-agent: * 
Disallow: /

I guess we have to decide what level of basic support we think we should give - but our documentation is supposed to cover specific Gitea information. This would probably class as basic hardening and therefore just about appropriate.

@zeripath commented on GitHub (Oct 14, 2019): I agree it is probably reasonable to provide a sensible example of robots.txt for a basic public site -that's specific knowledge that's appropriate for Gitea. For private sites, we could put something on the website documentation but it's basically: ``` User-agent: * Disallow: / ``` I guess we have to decide what level of basic support we think we should give - but our documentation is supposed to cover specific Gitea information. This would probably class as basic hardening and therefore just about appropriate.

GiteaMirror commented

2025-11-02 03:16:09 -06:00

@8ctopus commented on GitHub (Dec 31, 2019):

I just experienced the negative surprise to see my private gitea repository indexed. I naively thought search engines would not find the subfolder on my website, but they did.
Based on my experience, I would suggest:

create a default robots.txt that rejects all crawlers
make it clear in the installation documentation that by default the gitea installation will be indexed by search engines.

@8ctopus commented on GitHub (Dec 31, 2019): I just experienced the negative surprise to see my private gitea repository indexed. I naively thought search engines would not find the subfolder on my website, but they did. Based on my experience, I would suggest: 1. create a default robots.txt that rejects all crawlers 2. make it clear in the installation documentation that by default the gitea installation will be indexed by search engines.

GiteaMirror commented

2025-11-02 03:16:10 -06:00

@tboerger commented on GitHub (Dec 31, 2019):

Private repositories won't get indexed. You simply got public repos what should be totally obviously indexed if they are found by Google or other search engines.

@tboerger commented on GitHub (Dec 31, 2019): Private repositories won't get indexed. You simply got public repos what should be totally obviously indexed if they are found by Google or other search engines.

GiteaMirror commented

2025-11-02 03:16:10 -06:00

@8ctopus commented on GitHub (Jan 4, 2020):

@tboerger what I mean, is that I have a repo I need to share with fellow team members that I don't want to be indexed by search engines. For ease of use, I also opted to have the url publicly accessible provided you know the address.

@8ctopus commented on GitHub (Jan 4, 2020): @tboerger what I mean, is that I have a repo I need to share with fellow team members that I don't want to be indexed by search engines. For ease of use, I also opted to have the url publicly accessible provided you know the address.

GiteaMirror commented

2025-11-02 03:16:10 -06:00

@tboerger commented on GitHub (Jan 4, 2020):

Than you should add a custom robots.txt. Not everybody wants to hide all the repos. If something is private, make it private. Everything else could be generally fine to get indexed.

The few exceptions that want to avoid it: add a robots.txt to the customization.

@tboerger commented on GitHub (Jan 4, 2020): Than you should add a custom robots.txt. Not everybody wants to hide all the repos. If something is private, make it private. Everything else could be generally fine to get indexed. The few exceptions that want to avoid it: add a robots.txt to the customization.

GiteaMirror commented

2025-11-02 03:16:10 -06:00

@lunny commented on GitHub (Oct 15, 2020):

An config item and option should be in the installation page to let users chose if allow crawlers.

@lunny commented on GitHub (Oct 15, 2020): An config item and option should be in the installation page to let users chose if allow crawlers.

GiteaMirror commented

2025-11-02 03:16:10 -06:00

@techknowlogick commented on GitHub (Dec 9, 2020):

Documentation has been added for those that want to change their install.

@techknowlogick commented on GitHub (Dec 9, 2020): Documentation has been added for those that want to change their install.

GiteaMirror commented

2025-11-02 03:16:11 -06:00

@alexanderadam commented on GitHub (Dec 9, 2020):

In case someone is looking for it, you can find it here.

@alexanderadam commented on GitHub (Dec 9, 2020): In case someone is looking for it, you can find it [here](https://docs.gitea.io/en-us/search-engines-indexation/).

GiteaMirror commented

2025-11-02 03:16:11 -06:00

@Mikaela commented on GitHub (Dec 9, 2020):

I was hoping it would be something like https://git.nixnet.services/robots.txt and advise what are the addresses to block the same page appearing in multiple languages and how to allow indexing outside of specific commits or similar that most likely aren't useful for casual search engine user.

I think some sort of an explanation for the X-Robot-Tag header or including it in Gitea and explaining its relationship to robots.txt could also be useful, but I guess that is a separate issue.

@Mikaela commented on GitHub (Dec 9, 2020): I was hoping it would be something like https://git.nixnet.services/robots.txt and advise what are the addresses to block the same page appearing in multiple languages and how to allow indexing outside of specific commits or similar that most likely aren't useful for casual search engine user. I think some sort of an explanation for the X-Robot-Tag header or including it in Gitea and explaining its relationship to robots.txt could also be useful, but I guess that is a separate issue.

GiteaMirror commented

2025-11-02 03:16:11 -06:00

@techknowlogick commented on GitHub (Dec 9, 2020):

@Mikaela in the latest stable version I contributed a PR that removes the links in footer to swap languages, for an alternative that provides the same functionality but without crawlers knowing about the links.

@techknowlogick commented on GitHub (Dec 9, 2020): @Mikaela in the latest stable version I contributed a PR that removes the links in footer to swap languages, for an alternative that provides the same functionality but without crawlers knowing about the links.

GiteaMirror referenced this issue

2025-11-02 11:42:36 -06:00

[PR #255] [MERGED] Golint fixed for modules/cron #15263

Sign in to join this conversation.

Branches Tags

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/gitea#255