What is Robots.txt?
A robots.txt file restricts web crawlers, such as search engine bots, from accessing specific URLs on a website. It can also be used to adjust the crawling speed for some web crawlers.
All “good” web crawlers adhere to the rules specified in the robots.txt file. However, there are “bad” unregistered crawlers, often utilized for scraping purposes, that completely disregard the robots.txt file.
The robots.txt file must be used to reduce/optimize crawler traffic to a website and it should not be used to control indexing of web pages. Even if a URL is disallowed in robots.txt, it can still be indexed by Google if it’s discovered via an external link.
Syntax of Robots.txt
The syntax of the robots.txt file contains the following fields:
- user-agent: the crawler the rules apply to
- disallow: a path that must not be crawled
- allow: a path that can be crawled (optional)
- sitemap: location of the sitemap file (optional)
- crawl-delay: controls the crawling speed (optional and not supported by GoogleBot)
Here’s an example:
User-agent: RanktrackerSiteAudit
Disallow: /resources/
Allow: /resources/images/
Crawl-delay: 2
Sitemap: https://example.com/sitemap.xml
This robots.txt file instructs RanktrackerSiteAudit crawler not to crawl URLs in the “/resources/” directory except for those in “/resources/images/” and sets the delay between the requests to 2 seconds.
Why is the Robots.txt File Important?
The robots.txt file is important because it enables webmasters to control the behavior of crawlers on their websites, optimizing the crawl budget and restricting the crawling of website sections that are not intended for public access.
Many website owners choose to noindex certain pages such as author pages, login pages, or pages within a membership site. They may also block the crawling and indexing of gated resources like PDFs or videos that require an email opt-in to access.
It’s worth noting that if you use a CMS like WordPress, the /wp-admin/
login page is automatically blocked from being indexed by crawlers.
However, it’s important to note that Google does not recommend relying solely on the robots.txt file to control the indexing of pages. And if you’re making changes to a page, such as adding a “noindex” tag, make sure the page is not disallowed in the robots.txt. Otherwise, Googlebot won’t be able to read it and update its index in a timely manner.
FAQs
What happens if I don’t have a robots.txt file?
Most sites don’t absolutely require a robots.txt file. The purpose of a robots.txt file is to communicate specific instructions to search bots, but this may not be necessary if you have a smaller website or one without a lot of pages that you need to block from the search crawlers.
With that said, there’s also no downside to creating a robots.txt file and having it live on your website. This will make it easy to add directives if you need to do so in the future.
Can I hide a page from search engines using robots.txt?
Yes. Hiding pages from search engines is one of the primary functions of a robots.txt file. You can do this with the disallow parameter and the URL you want to block.
However, it’s important to note that simply hiding a URL from Googlebot using the robots.txt file does not guarantee that it won’t be indexed. In some cases, a URL may still be indexed based on factors such as the text of the URL itself, the anchor text used in external links, and the context of the external page where the URL was discovered.
How to test my robots.txt file?
You can validate your robots.txt file and test how the instructions work on specific URLs using the robots.txt tester in Google Search Console or using external validators, like the one from Merkle.