Robots.txt Detected

Summary#

Invicti detected a Robots.txt file with potentially sensitive content.

Impact#

Depending on the content of the file, an attacker might discover hidden directories and files.

Remediation#

Ensure you have nothing sensitive exposed within this file, such as the path of an administration panel. If disallowed paths are sensitive and you want to keep it from unauthorized access, do not write them in the Robots.txt, and ensure they are correctly protected by means of authentication.

Robots.txt is only used to instruct search robots which resources should be indexed and which ones are not.

The following block can be used to tell the crawler to index files under /web/ and ignore the rest:

User-Agent: *
Allow: /web/
Disallow: /

Please note that when you use the instructions above, search engines will not index your website except for the specified directories.

If you want to hide certain section of the website from the search engines X-Robots-Tag can be set in the response header to tell crawlers whether the file should be indexed or not:

X-Robots-Tag: googlebot: nofollow
X-Robots-Tag: otherbot: noindex, nofollow

By using X-Robots-Tag you don't have to list the these files in your Robots.txt.

It is also not possible to prevent media files from being indexed by putting using Robots Meta Tags. X-Robots-Tag resolves this issue as well.

For Apache, the following snippet can be put into httpd.conf or an .htaccess file to restrict crawlers to index multimedia files without exposing them in Robots.txt

<Files ~ ".pdf$">
 # Don't index PDF files.
 Header set X-Robots-Tag "noindex, nofollow"
</Files>

<Files ~ ".(png|jpe?g|gif)$">
 #Don't index image files.
 Header set X-Robots-Tag "noindex"
</Files>

Classifications#

ISO27001-A.18.1.3

Robots.txt Detected

Related Articles

Build your resistance to threats. And save hundreds of hours each month.