Make the default /robots.txt reject all crawlers #255

Closed
opened 2025-11-02 03:16:06 -06:00 by GiteaMirror · 18 comments
Owner

Originally created by @sztanpet on GitHub (Jan 20, 2017).

I am wondering whether this is going too far or not. In my mind, the default for privately set-up gitea instances should be private by default and that entails rejecting crawlers too as a way to reduce surprise to the user.

Originally created by @sztanpet on GitHub (Jan 20, 2017). I am wondering whether this is going too far or not. In my mind, the default for privately set-up gitea instances should be *private by default* and that entails rejecting crawlers too as a way to reduce surprise to the user.
GiteaMirror added the type/docs label 2025-11-02 03:16:06 -06:00
Author
Owner

@sztanpet commented on GitHub (Jan 20, 2017):

Not even mentioning as secure as possible but still making it easy to use for the user by default. Which should entail hiding version numbers, disabling gravatar and other information leaking features and making the default be private repositories, etc, but that is a separate discussion.

@sztanpet commented on GitHub (Jan 20, 2017): Not even mentioning as secure as possible but still making it easy to use for the user by default. Which should entail hiding version numbers, disabling gravatar and other information leaking features and making the default be private repositories, etc, but that is a separate discussion.
Author
Owner

@bkcsoft commented on GitHub (Jan 20, 2017):

Maybe not make it default, but if REQUIRE_SIGNIN_VIEW is set to true, and /robots.txt isn't found, Gitea could provide a default "block all robots".txt 🙂

@bkcsoft commented on GitHub (Jan 20, 2017): Maybe not make it default, but if `REQUIRE_SIGNIN_VIEW` is set to true, and `/robots.txt` isn't found, Gitea could provide a default "block all robots".txt 🙂
Author
Owner

@bkcsoft commented on GitHub (Jan 20, 2017):

(Since if REQUIRE_SIGNIN_VIEW is set it seems m00t for a crawler to crawl it 😛 )

@bkcsoft commented on GitHub (Jan 20, 2017): (Since if `REQUIRE_SIGNIN_VIEW` is set it seems m00t for a crawler to crawl it 😛 )
Author
Owner

@sztanpet commented on GitHub (Jan 20, 2017):

well yes, but at that point it doesn't really matter, so I think it doesn't go far enough

@sztanpet commented on GitHub (Jan 20, 2017): well yes, but at that point it doesn't really matter, so I think it doesn't go far enough
Author
Owner

@tboerger commented on GitHub (Jan 20, 2017):

IMHO we should not block everything by default. For sure there are enough instances that don't want to block everything. Private repositories are anyway blocked at all because it's not accessible. If somebody really wants to enforce that, he can add a robots.txt to the custom folder.

@tboerger commented on GitHub (Jan 20, 2017): IMHO we should not block everything by default. For sure there are enough instances that don't want to block everything. Private repositories are anyway blocked at all because it's not accessible. If somebody really wants to enforce that, he can add a robots.txt to the custom folder.
Author
Owner

@strk commented on GitHub (Jan 26, 2017):

I've just had a problem with robots, but in my case the service is running from a suburl so serving a robots.txt from Gitea would not have helped. Unless I'm missing a specification allowing for that. What I've been reading (not much) came from http://www.robotstxt.org/robotstxt.html

For top-level installs, generating a robots.txt would indeed be good as it would allow for example preventing bots from downloading archives for each committish, which in turn fills up disk space (see #769) - according to the lecture above (robotstxt.org) you cannot use globs in a robots.txt file so having it automatically generated helps with instances where everyone can create new repos...

@strk commented on GitHub (Jan 26, 2017): I've just had a problem with robots, but in my case the service is running from a suburl so serving a robots.txt from Gitea would not have helped. Unless I'm missing a specification allowing for that. What I've been reading (not much) came from http://www.robotstxt.org/robotstxt.html For top-level installs, generating a robots.txt would indeed be good as it would allow for example preventing bots from downloading archives for each committish, which in turn fills up disk space (see #769) - according to the lecture above (robotstxt.org) you cannot use globs in a robots.txt file so having it automatically generated helps with instances where everyone can create new repos...
Author
Owner

@lunny commented on GitHub (Oct 14, 2019):

We could have two examples, one for private sites another for public sites.

@lunny commented on GitHub (Oct 14, 2019): We could have two examples, one for private sites another for public sites.
Author
Owner

@guillep2k commented on GitHub (Oct 14, 2019):

I agree with some comments I've read: Gitea should come with some sensible default robots.txt for public sites, not as a sample but installed as default. The users will of course be able to replace it as they see fit.

BTW: what are robots.txt used for in private sites?

EDIT: I thought it meant intranet sites, sorry!

@guillep2k commented on GitHub (Oct 14, 2019): I agree with some comments I've read: Gitea should come with some sensible default robots.txt for public sites, not as a sample but installed as default. The users will of course be able to replace it as they see fit. BTW: what are robots.txt used for in private sites? EDIT: I thought it meant intranet sites, sorry!
Author
Owner

@zeripath commented on GitHub (Oct 14, 2019):

I agree it is probably reasonable to provide a sensible example of robots.txt for a basic public site -that's specific knowledge that's appropriate for Gitea. For private sites, we could put something on the website documentation but it's basically:

User-agent: * 
Disallow: /

I guess we have to decide what level of basic support we think we should give - but our documentation is supposed to cover specific Gitea information. This would probably class as basic hardening and therefore just about appropriate.

@zeripath commented on GitHub (Oct 14, 2019): I agree it is probably reasonable to provide a sensible example of robots.txt for a basic public site -that's specific knowledge that's appropriate for Gitea. For private sites, we could put something on the website documentation but it's basically: ``` User-agent: * Disallow: / ``` I guess we have to decide what level of basic support we think we should give - but our documentation is supposed to cover specific Gitea information. This would probably class as basic hardening and therefore just about appropriate.
Author
Owner

@8ctopus commented on GitHub (Dec 31, 2019):

I just experienced the negative surprise to see my private gitea repository indexed. I naively thought search engines would not find the subfolder on my website, but they did.
Based on my experience, I would suggest:

  1. create a default robots.txt that rejects all crawlers
  2. make it clear in the installation documentation that by default the gitea installation will be indexed by search engines.
@8ctopus commented on GitHub (Dec 31, 2019): I just experienced the negative surprise to see my private gitea repository indexed. I naively thought search engines would not find the subfolder on my website, but they did. Based on my experience, I would suggest: 1. create a default robots.txt that rejects all crawlers 2. make it clear in the installation documentation that by default the gitea installation will be indexed by search engines.
Author
Owner

@tboerger commented on GitHub (Dec 31, 2019):

Private repositories won't get indexed. You simply got public repos what should be totally obviously indexed if they are found by Google or other search engines.

@tboerger commented on GitHub (Dec 31, 2019): Private repositories won't get indexed. You simply got public repos what should be totally obviously indexed if they are found by Google or other search engines.
Author
Owner

@8ctopus commented on GitHub (Jan 4, 2020):

@tboerger what I mean, is that I have a repo I need to share with fellow team members that I don't want to be indexed by search engines. For ease of use, I also opted to have the url publicly accessible provided you know the address.

@8ctopus commented on GitHub (Jan 4, 2020): @tboerger what I mean, is that I have a repo I need to share with fellow team members that I don't want to be indexed by search engines. For ease of use, I also opted to have the url publicly accessible provided you know the address.
Author
Owner

@tboerger commented on GitHub (Jan 4, 2020):

Than you should add a custom robots.txt. Not everybody wants to hide all the repos. If something is private, make it private. Everything else could be generally fine to get indexed.

The few exceptions that want to avoid it: add a robots.txt to the customization.

@tboerger commented on GitHub (Jan 4, 2020): Than you should add a custom robots.txt. Not everybody wants to hide all the repos. If something is private, make it private. Everything else could be generally fine to get indexed. The few exceptions that want to avoid it: add a robots.txt to the customization.
Author
Owner

@lunny commented on GitHub (Oct 15, 2020):

An config item and option should be in the installation page to let users chose if allow crawlers.

@lunny commented on GitHub (Oct 15, 2020): An config item and option should be in the installation page to let users chose if allow crawlers.
Author
Owner

@techknowlogick commented on GitHub (Dec 9, 2020):

Documentation has been added for those that want to change their install.

@techknowlogick commented on GitHub (Dec 9, 2020): Documentation has been added for those that want to change their install.
Author
Owner

@alexanderadam commented on GitHub (Dec 9, 2020):

In case someone is looking for it, you can find it here.

@alexanderadam commented on GitHub (Dec 9, 2020): In case someone is looking for it, you can find it [here](https://docs.gitea.io/en-us/search-engines-indexation/).
Author
Owner

@Mikaela commented on GitHub (Dec 9, 2020):

I was hoping it would be something like https://git.nixnet.services/robots.txt and advise what are the addresses to block the same page appearing in multiple languages and how to allow indexing outside of specific commits or similar that most likely aren't useful for casual search engine user.

I think some sort of an explanation for the X-Robot-Tag header or including it in Gitea and explaining its relationship to robots.txt could also be useful, but I guess that is a separate issue.

@Mikaela commented on GitHub (Dec 9, 2020): I was hoping it would be something like https://git.nixnet.services/robots.txt and advise what are the addresses to block the same page appearing in multiple languages and how to allow indexing outside of specific commits or similar that most likely aren't useful for casual search engine user. I think some sort of an explanation for the X-Robot-Tag header or including it in Gitea and explaining its relationship to robots.txt could also be useful, but I guess that is a separate issue.
Author
Owner

@techknowlogick commented on GitHub (Dec 9, 2020):

@Mikaela in the latest stable version I contributed a PR that removes the links in footer to swap languages, for an alternative that provides the same functionality but without crawlers knowing about the links.

@techknowlogick commented on GitHub (Dec 9, 2020): @Mikaela in the latest stable version I contributed a PR that removes the links in footer to swap languages, for an alternative that provides the same functionality but without crawlers knowing about the links.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/gitea#255