Before
This was originally posted by myself on 6 May 2019. It is an attempt to document the then-recent spam/abuse problem within these Diaries in a scientific manner. The idea was to allow the wider community to get a fuller grasp on the issues and help answer some fundamental questions. After all, nobody knows everything. Here is my attempt to state some of those questions and to try to answer some of them:–
Some Questions:
- Is this Spam, Abuse, or a mixture of the two?
(the motivation behind those two are completely different) - Are they bot-posts, human-posts, or a mixture of the two?
(prevention mechanisms for those two differ) - What are the best methods to prevent this from continuing?
(best practices have been evolved across the last 20 years, but OSM is not your typical forum/blog)
Some Answers:
- There is a general, continual background issue in the Diaries of what appears to be human-posted spam/abuse of up-to 10 posts / day. These range from nonsense posts to actual spam.
- Apr 22: a series of bot-posts began
(that fact determined by speed and duration) - The bot-posts did not contain url-links
(thus technically not spam) - Apr 25 → May 23: those bot-posts rapidly escalated into an average of 11,044 posts / day (min 1, max 29,340)
- May 14, 3am → 7am BST: bots hit these Diaries at a combined maximum rate of 66 post/minute and averaged 27 post/minute
- 18 May: an attempt to stop the bot flood was made by a broken change to robots.txt, intended to prevent Search Engines from indexing the Diaries
- 19 May: the broken change to robots.txt was fixed
- 20 May: the bot-flood abruptly ended (confirming it to be a spam-flood rather than abuse-flood), stuttered into life again on 22 May & finished for good on 23 May
- June 5: a broken link to the non-existent sitemap within robots.txt was fixed
- June 6, 17:07:41 GMT: the robots.txt 5 June fix was allocated to a 4-year-old issue (evidenced by the Last-Modified date obtained shortly after)
- 15 June, 23:56 two more changes ([1] [2]) to robots.txt re-opened the Diaries to be indexed again
- 20 June: the wfgz bots are re-sending 1,000s of posts to the Diaries again
- 23 June: the same robots.txt changes as before to stop Diary indexing successfully stop the bot-flood
- 2 July, 19:16:03 GMT: the same robots.txt changes as before to revert stopping Diary indexing
- 12 July: so far no repeat of the bot-flood. Hooray!
The post of 6 May now continues verbatim.
Mention was made in my last diary and also in Sam Wilson’s diary about the large amounts of spam coming in to overwhelm these Diary pages. In good scientific manner here is a quantification of the issue, obtained by examining ID numbers for all recent surviving Diary posts.
Background
Diary posts are incremented serially. Thus, deducting the theoretical number of posts by the actual number of posts leads to the measure of how many spammer posts may have been removed.
The Numbers
Date End-ID ---------Posts---------
Actual Theory Diff
12 Apr 48187 - (spam)
13 Apr 48193 5 6 1
14 Apr 48195 1 2 1
Mon 15 Apr 48202 2 7 5
16 Apr 48216 5 14 9
17 Apr 48223 2 7 5
18 Apr 48234 5 11 6
19 Apr 48242 5 8 3
20 Apr 48252 8 10 2
21 Apr 48255 2 3 1
Mon 22 Apr 48287 4 32 28
23 Apr 48378 12 91 79
24 Apr 48385 1 7 6
25 Apr 56488 6 8,103 8,097
26 Apr 74643 8 18,155 18,147
27 Apr 99519 2 24,876 24,874
28 Apr 128866 7 29,347 29,340
Mon 29 Apr 140684 3 11,818 11,815
30 Apr 149349 4 8,665 8,661
1 May 152912 13 3,563 3,550
2 May 156826 8 3,914 3,906
3 May 158835 2 2,009 2,007
4 May 158837 1 2 1
5 May 172694 6 13,857 13,851
Mon 6 May 193238 6 20,544 20,538
7 May 210953 2 17,715 17,713
8 May 218281 4 7,328 7,324
9 May 240069 2 21,788 21,786
10 May 256019 7 15,950 15,943
11 May 270022 1 14,003 14,002
12 May 275013 8 4,991 4,983
Mon 13 May 276830 2 1,817 1,815
14 May 283239 2 6,409 6,407
15 May 291589 2 8,350 8,348
16 May 296320 1 4,731 4,730
17 May 318162 6 21,842 21,836
18 May 339272 2 21,110 21,108
19 May 347443 2 8,171 8,169
Mon 20 May 364479 3 17,036 17,033
21 May 364493 7 14 7
22 May 364971 4 479 475
23 May 368657 4 3,686 3,682
24 May 368669 8 12 4
25 May 368675 3 6 3
26 May 368682 4 7 3
Mon 27 May 368691 3 9 6
28 May 368702 3 11 8
29 May 368711 2 9 7
30 May 368716 2 5 3
31 May 368725 7 9 2
1 Jun 368726 1 1 0
2 Jun 368734 2 8 6
Mon 3 Jun 368750 7 16 9
4 Jun 368753 2 3 1
5 Jun 368757 2 4 2
6 Jun 0 0 0
7 Jun 368766 2 9 7
8 Jun 368773 3 7 4
Between 25 Apr & 23 May (29 days):
------------------------- -------
Total : 124 320,272
Daily : 4 11,044
------------------------- -------
Update 8 May:
01:52am BST: I dropped in on the 1st of tonight’s spammers:
Title: translation of ID=210955: Being vomited and vomiting frlse
Text : 苟颜德缕uwrfh 苟颜德缕uwrfh 苟颜德缕uwrfh..wfgz
09:37am BST Sunday 12 May:
The latest spammer is /user/twuptyoe378/diary/274627 (removed)
The first spammer:
Title: translation of ID=270023: Vomiting
Text : 07633abawl
Update 14 May:
The first 3 posts shortly after 1am BST were the now-classic Bengali (bn) wfgz spam. Here is the very first:
Title: (ID=276831): 暮铣德娜侗cjenp
Text : 肆考韭缕节oqgwr肆考韭缕节oqgwr肆考韭缕节oqgwr..wfgz
After 90 minutes we began to get some Chinese (zh-CN) vip spam, which continues until shortly before 20:42 BST. Once again, here is the very first:
Title: (ID=282971): 北京幸运28官方网站
Text : 北京幸运28官方网站 【导师微信:<redacted>】【网址<redacted>.vip 】【加拿大28稳赢法】…
Update 15 May to discover spam stats:
I put in place a cron-job Monday to save the current Diary top-page every 10 minutes from 01:00 BST until 10:00 BST. I investigated it today using egrep & tabulated the listing below.
These are the rates at which the wfgz spammers dropped their spam into these Diaries over the night of 14 May. You will see that they hit a maximum rate of 66 post/minute and averaged 27 post/minute. The spam began shortly after 3am BST and stopped (presumably due to the intervention of OSM admin) at about 7am. As best I can tell, all of the posts made between those times were from these wfgz spammers.
Date 1st Post Posts
------------ ------- -----
May 14 07:10 279688 -
May 14 07:00 282939 66
May 14 06:50 282873 424
May 14 06:40 282449 119
May 14 06:30 282330 167
May 14 06:20 282163 234
May 14 06:10 281929 426
May 14 06:00 281503 231
May 14 05:50 281272 129
May 14 05:40 281143 182
May 14 05:30 280961 292
May 14 05:20 280669 237
May 14 05:10 280432 659
May 14 05:00 279773 130
May 14 04:50 279643 190
May 14 04:40 279453 120
May 14 04:30 279333 321
May 14 04:20 279012 186
May 14 04:10 278826 347
May 14 04:00 278479 466
May 14 03:50 278013 223
May 14 03:40 277790 244
May 14 03:30 277546 342
May 14 03:20 277204 342
May 14 03:10 276862 28
May 14 03:00 276834 0
May 14 02:50 276834 1
May 14 02:40 276833 0
------------ ------- -----
minimum: 119
maximum: 659
average: 273
Discussion