What the robots.txt file does

Posted by Carnildo on 24 June 2019 in English.

Disclaimer: I am not an OSM website developer. All information here was obtained by looking at the OSM GitHub repository and poking at the OSM website.

There’s been some controversy recently over the contents of the OpenStreetMap robots.txt file. I think it might be informative to look at what the file actually does.

Allow: /user/

This does nothing. “Allow” lines in a robots.txt file permit the crawling of URLs that would otherwise be denied, but there’s nothing in the file that would deny the /user hierarchy.

Disallow: /traces/tag/
Disallow: /traces/page/

These are various alternate ways of searching the GPS traces that have been uploaded on the site. The main trace listing is still accessible.

Disallow: /trace/

This is the API endpoint for accessing GPS traces. It is not intended to be displayed in a web browser, and contains nothing useful for a search engine.

Disallow: /api/

This is the API endpoint for editing the map. It is not intended to be displayed in a web browser, and contains nothing useful for a search engine.

Disallow: /edit

This is the URL for the in-browser editor. Everything under this URL is behind a login barrier, and it contains nothing useful for a search engine.

Disallow: /message

This is the URL hierarchy for the on-site PM system. Everything under this URL is behind a login barrier, and it contains nothing useful for a search engine.

Disallow: /login

This is the above-mentioned login barrier. It contains nothing useful for a search engine.

Disallow: /history

This is the visual history browser. The contents change far too rapidly to meaningfully index on a search engine.

Disallow: /geocoder

This is the on-site search system. Search engines searching search engines never ends well.

Disallow: /browse

Disallow: /*lat=
Disallow: /*node=
Disallow: /*way=
Disallow: /*relation=

These are obsolete URL hierarchies for browsing individual map elements. The current URL hierarchy, with URLs of the form way/238241022, can be indexed by search engines.

Disallow: /user/*/traces/
Disallow: /user/*/diary
Disallow: /diary

These are the only entries that block pieces of the site that might be of interest to a search engine. /user/*/traces/ are the description pages for individual GPS traces, /user/*/diary is individual diary entries, and /diary is the main diary listing.

Discussion

Comment from Firefishy on 24 June 2019 at 23:38

Thank you for the write-up.

The /diary disallow is a recent temporary measure to mitigate against some of the spam we’ve recently had and will be removed in a few days.

Comment from freebeer on 29 June 2019 at 12:02

what i find interesting is that from my observations, the presence or absence of diary spam, given lack of knowledge of any other mitigating measures, seems to depend very much on this entry.

that is based on pulling up the generic overall diaries the past days at random insomniac times during nighttime (0-6h gmt), yet not seeing anything beyond the past day’s normal posts.

another thing bothering me is there is no mention of /note in the crawler file, but apart from a single apparent trial balloon matching one of the spams-of-the-time in diaries, in notes countries i visit, i haven’t noticed any flood of notes spam so apparently the diaries have something to offer the spammers that geographically-precise notes lack.

another thing off-topic is that it seems since a commit i’ve seen out-of-context, it appears Notes in en_GB locale (i think) at least, are getting a redundant ago ago, the latter giving precise details. as i am likely within less than 24 hours of another indefinite ‘net cutoff, rather than wasting my time better spent packing, i’d rather not register at github but prefer someone else with time verify my strings observation and do my homework. for me.

with that i bid thee so long, farewell, auf wiederseh’n, adieu, to you and you and you,
i leave and heave (BARF)
good-bye gooooood-byyyyyyyeeeeeee

mind the starters

OpenStreetMap

Carnildo's Diary

What the robots.txt file does

Discussion

Log in to leave a comment