Intro
Robots Exclusion Protocol (REP) is a Webmaster file that is used to instruct robots. The instructions help the robots crawl web pages and index them for various websites. This REP is sometimes referred to as Robots.txt. They are placed in the top level of the web server directory to be most useful. For example: https://www.123abc.com/robots.txt
REP groups are used as a web standard that regulates bot actions and search engine indexing behavior. Between 1994 and 1997, the original REP defined bot behavior for robots.txt. In 1996, search engines supported additional REP X-robot tags. Search engines handled links where the value contained a “follow” using a microformat rel-no follow.
Robot Cheat Sheet
To Totally Block web crawlers
User-agent: *
Disallow: /
To Block specific web crawlers from a target folder
User-agent: Googlebot
Disallow: /no-google/
To Block specific web crawlers from a target web page
User-agent: Googlebot
Disallow: /no-google/blocked-page.html
User-agent: *
Disallow:
Sitemap: https://www.123abc.com/none-standard-location/sitemap.xml
Exclusion Specific Robot Protocol Tags
URI, REP tags are applied to certain indexer task, and in some cases nosnippet, noarchive and noodpquery engines or a search query. Resources tagged with exclusion tags, search engines such as Bing SERP listings show these external links as forbidden URLs. Besides crawler directives specific search engines will interpret REP tags differently. An example of this can be seen in how Bing will sometimes list outside references on their SERPs as forbidden. Google takes the same listings and wipes out the URL and ODP references on their SERPs. The thought is that X-Robots would overrule directives that conflict with META elements.
Microformats
Particular HTML factors will overrule page settings in micro-formatted index directives. This method of programming requires skills and a very keen grasp of web servers and HTTP protocol. An example of this protocol would be a page of X-Robot tags with a particular element link that say follow then rel-nofollow. Robots.txt indexers usually lack directives, but it is possible to set group indexers of URIs that have a server with sided scripts on the site level.
Pattern Matching
Webmasters can still utilize two separate expressions to denote page exclusion. The two characters are the asterisk and the dollar sign. The asterisk denotes that can represent any combination of characters. The dollar sign is to denote the end of the URL.
Unrestricted Information
Robot files are always public, so it’s important to be aware that anyone can view a robot file attached to a web page. It is also accessible information where the Webmaster blocks the engines from on the server. These public files leave access to private user data that could include private individual data. It is possible to add password protection to keep visitors and others from viewing classified pages that should not be indexed.
Additional Rules
- Simple meta robot parameters like index and follow command should only be used to prevent page indexing and crawling.
- Dangerous bots will most certainly ignore these commands and as such are a useless security plan.
- Each URL is only allowed one “disallow” line.
- Separate robots files are required on each subdomain
- Filenames for the bots are case-sensitive
- Spacing does not separate search parameters
Top SEO Tactics: Robot.txt
Page Blocking – there are several ways to prevent a search engine from indexing and accessing a web page or domain.
Using Robots to Block pages
This exclusion tells the search engine not to crawl the page, but it may still index the page to show it in SERP listings.
No Index Page Blocking
This method of exclusion tells search engines they are allowed to visit the page, but they can not allowed to display the URL or save the page for its index. This is the preferred method of exclusion.
No Following Link to Block Pages
This is not a supported tactic. Search engines can still access pages with this command. Even if the search engine cannot directly follow the page, it can access the content using the browser analytics or other linked pages.
Meta Robots vs. Robots.txt
An example of a website’s robots.txt file can help clarify the process of the program. In the example the robot file is blocking the directory. When the particular URL is searched for in Google it shows that 2760 pages have been disallowed from the directory. In the example, the engine has not crawled the URLs, so they will not appear like traditional listings. These pages will accumulate link juice once they have links attached to them. In addition to their ranking power, they will also begin to gain popularity and trust from appearing in searches. Since the pages can’t be a benefit to the site because they aren’t being crawled. The best way to fix this problem and not have wasted ranking power on a page, it is prudent to use another method of exclusion to remove the individual pages. The coding would appear as: meta tag this method would exhibit better performance than the previous method.