How to Fight Guestbook and Comment Spam

Preface

Within the last years the internet has changed and is nowadays a very important part in almost everybody's life. E-commerce is still growing rapidly and therewith the competition among traders. But since there are much less controls and regulations compared to the 'real world' some virtual advertising 'strategies' are much more insulting than those in real life. One of these 'strategies' is email-spam. Web crawler search websites for (valid) email-addresses and send lots of spam to it, hoping to gain money. However, sending email is free, therefore it is almost free of risk for the sender, especially if zombie PCs are used for that business.

But there are further possibilities, especially nowadays where well working spam filters exist. A legitimate way is called "Search Engine Optimization (SEO)" that is usually performed using legal methods. However, dubious methods can rather be called "Search Engine Manipulation", the major branch being guestbook spam, or more often called comment spam. The interactive parts of other people's websites (like comment forms) are used to create links to one's own website, resulting in a higher search engine rank. Maybe it is a good idea to briefly outline the main principle of modern search engines: Webpages with many links pointing to it are considered more important than other webpages with less links. This is a very coarse simplification, however the main reason for guestbook/comment spam should be clear now.

Spamming comment modules or guestbooks is not done by individuals, since this would be way too expensive, instead spam-bots are used. These bots are usually simple programs, navigating through websites and filling out forms. Sometimes they even try to call the processing script on the webserver only, using previously collected informations about the corresponding variables. Such bots are able to send thousands of spam entries per minute! Depending on how popular your webpage is (and of course how good you are hiding your comment module or guestbook) such spam is just a nuiscance or turns your webpage use- and worthless! For this reason I took some time and collected techniques about the prevention of such spam. Even some code examples are given in PHP, other server-side languages should be possible as well. But first of all a note:

I do not hold any responsibility or guarantee for the presented code samples! They are to be understood as code outlines, you have to do a proper implementation on your own!

Techniques for fighting guestbook and comment spam

First of all we have to clarify what our goal is and how it can be achieved. For our purposes the main goal is to distinguish between proper entries from human visitors and insulting entries from spam bots. Consequently, we have to find differences between a human and a spam bot! We have to think about those differences, every aspect counts, no matter whether it is technical, sociological or anything else. Finally I found these differences:

  1. A bot is unable to understand a text, the most it can do is recognise keywords, while a human understands a text in its context
  2. A bot 'types' much faster than a human
  3. A bot does not interpret pictures
  4. The bot's sole purpose is to drop links to other webpages
  5. Humans are intelligent
  6. Humans use browsers
  7. Humans do not read and interpret HTML-tags

The above points can be interpreted in a tighter or looser way, however I do not want people to split hairs whether e.g. point 5 is always fulfilled or not.

We have finally transformed the abstract posed problem "prevent spam" into a less abstract task "distinguish between humans and bots" by means of distinctive features. Unfortunately these criteria are in some sense fuzzy, humans might behave in some situations like bots and vice versa. A point score will hopefully put things right.

Now as the problem (the task) is readily analysed we can move on to create useful filters. They will widely differ in their detection reliability and visitor restriction (or even exclusion), therefore I will sort them in categories by their amount of excluded users. This decision might seem like a overkill right now, but since usually various techniques are combined, each of them excluding other groups of users, you might end up with a high percentage of visitors excluded for some reasons.

Techniques without exclusion of visitors

The following techniques do not exclude any visitors, no matter which browser they use or which other disabilities they might have (e.g. visual impairment). What is more, they will usually not recognise the presence of bot detections and are hence not discouraged or put off.

Timestamp - Unmask quick bots

A very simple and extremly powerful method is to check the time between a page request and the consequent transmission of the new entry. This can be done on the server side by including a hidden text field into the form. A human visitor needs at least five seconds to compose its entry, while a bot sends his entry usually within these five seconds. Depending on your webpage you can easily adjust those timeslots, for pages with lots of content you can set this period to 30 seconds, while a page containing the form only allows e.g. five or ten seconds.

An implementation with PHP might look like this: The script compose_entry.php provides the form, process_entry.php inserts the submitted entry into a database (or somewhere else):

   //////// compose_entry.php ////////
   ...
   <?php 
     echo "<input type=\"hidden\" name=\"timestamp\" value=\"" . time() ."\">";
   ?>
   ...
   //////// process_entry.php ////////
   ...
   <?php 
     if (time() - $_POST["timestamp"] > 5){
        //process entry
     } else {
        //spam alert!
     }
   ?>
   ...

You could even encrypt the timestamp in a simple way, from my experience spam bots do not fake timestamps (... yet). One more thing I would like to add: Some spam-bots will bookmark 'process_entry.php' and send new spam every day. But since they do not update the timestamp, one can easily filter such entries by applying an upper bound for the time difference.

Worthless - Turn you website worthless for spam bots

A widely spread hint is to forbid search engines to index linked sites. This can be done by pasting a HTML meta tag into you HTML document (c.f. HTML Author's Guide to the Robots META tag):

   <!-- prevent search engines from indexing: -->
   <meta name="robots" content="noindex">
   <!-- forbid them to follow links: -->
   <meta name="robots" content="nofollow">
   <!-- combination of the previous instructions: -->
   <meta name="robots" content="noindex, nofollow">

For spam bots a simple nofollow-instruction should be enough, but a search engine would still index the entry itself. On the other hand, a spam bot does not care about indexing because there is still a possiblity that human users follow the link. However, for separate guestbook areas prevention of indexing might still be a solution, but for pages with content this is not of good use because the content will not be listed in search engines anymore, rendering your page almost worthless.

Bad words - This is not what this site is about...

Spam entries differ from entries done by humans most of the time significantly. Especially words like 'tramadol', 'viagra' or 'phentermine' are useful indicators for spam. And of course never forget the main motivation for spam entries: dropping links. This can only be accomplished by using the appropriate HTML-Tag starting with "<a href=". Such strings can be counted and once a particular threshold is exceeded the entry is detected as spam! But don't be too aggressive: A single occurrence of a 'bad word' is not a reliable indicator, it could be part of a joke posted by a human for example! Apart from that, the spam probabilities for these words differ. While 'phentermine' is a very good indicator for spam, 'cheap' is not a reliable spam indicator, but still appears in spam entries very often. Taking this into account, one should assign different weights (i.e. different point scores) to such words. A simple spam filter might then look like this:

   <?php 
     $entry = $_POST["entry"]     
     $points = 0;
     $points = $points + 1 * substr_count(strtolower($entry), 'viagra');
     $points = $points + 2 * substr_count(strtolower($entry), 'phentermine');
     $points = $points + 2 * substr_count(strtolower($entry), 'tramadol');
     $points = $points + 3 * substr_count(strtolower($entry), '<a href=');
      
     if ($points < 5){
        //process entry
     } else {
        //spam alert
     }
   ?>

This example assumes that 'viagra' is more likely to occur in human entries (for example in a joke) and has therefore a lower weight. Posting a valid HTML-Link on the contrary will be punished severely, hence the heigher point weight. The same procedure can be applied to name and email fields in your form by adding points up, although different weights should be used there since 'viagra' in a URL is almost an idiot proof indicator for spam! The threshold value has to be adjusted by experience with respect to your weights, you should experiment with it.

Obfuscation - Fool the bot

Bots determine the meaning of text-fields usually by their name (the HTML attribute), since it usually represents the meaning of the resulting variable. These varibles are used in processing scripts. On the other hand, a human visitor determines the meaning of a input field from the context it appears in. This tiny difference can be used to identify bots! Consider the following HTML form:

   <!-- bad variant of a guestbook form -->
   ...
   <form action="script.php" method="post">
   Name: <input name="Name" type="text"><br>
   Email: <input name="Email" type="text"><br>
   Homepage: <input name="Homepage" type="text"><br>
   Entry: <textarea name="Entry" cols="50" rows="10"></textarea>
   <input type="submit" value="Submit"> 
   </form>
   ...

From a bot's view this form is nice: The names of the input fields represent their exact meanings! A human visitor however interprets the words "Name:", "Email" and "Entry".

Now let us improve the form above. First of all we can replace the "name"-values by meaningless strings. A bot will then try to guess and might be successful because the order of such input fields is usually the same. A clever bot would even evaluate the strings around the input fields if the "name"-attribute is of no help.

Even better is the following method: Swap the field names! Especially the field-name "homepage" is like a honeybot for bots, they can place their desired links there. But since the purpose of that field is different you can easily detect the bot! In my view it is even more effective to swap the names of the email-field and the homepage-field only! The HTML code for such a form would then look like this:

   <!-- improved variant of a guestbook form -->
   ...
   <form action="script.php" method="post">
   Name: <input name="Name" type="text"><br>
   Email: <input name="Homepage" type="text"><br>
   Homepage: <input name="Email" type="text"><br>
   Entry: <textarea name="Entry" cols="50" rows="10"></textarea>
   <input type="submit" value="Submit"> 
   </form>
   ...

The processing script has to check for two simple indicators: If the field for the homepage contains an email address (a simple check for the character '@' should be sufficient) or vice versa then the entry has to be rejected. If you are not using a homepage field then you can still play the game by swapping the name and email fields, but then you have to check the email address better since some usernames tend to have lots of special characters.

To make it even harder for the bot one could use textareas only and swap the name of the textarea dedicated to the entry as well. Especially in complex layouts a bot will not be able to detect the meaning of the field by its surrounding and has to rely on the textarea's name. You could even provide a homepage field (even if you don't use it) and set its type to hidden. Some bots might still paste their links, but no humans since it is not visible to them!

There is a small drawback of this method: The automatic completion provided by your browser will be irritated since form data is stored together with the input field's name. Anyway, I do not think that this would prevent human visitors from writing their entries...

Moderated guestbook or comment module

The probably best method of keeping spam off your site is moderating your forms. For every new entry the webmaster receives a email asking him to approve the entry. However, the webmaster is forced to read every single entry and then decide whether it is spam or not. On the other hand is it very annoying for the poster if his or her entry is not appearing immediately on the website. Anyway, this method can be useful for entries that passed some filters and are still uncertain to be spam or not.

Bayes filter

Bayes filters are doing pretty well on Emails. So why not use them for guestbook or comment spam? Well, the drawback of Bayes filter is the fact that it has to be trained with 'ham' and 'spam', i.e. 'good' and 'bad' entries. Especially for small sites there are hardly any human entries, but lots of spam entries. In this case you just have no chance to train your filter! But don't worry, the score system explained above is in some sense a static Bayes filter where you are the trainer! ;-)

Techniques with exclusion of hardly any visitors

In this section I will explain techniques that are not fully transparent for all visitors anymore. Therefore they will be either recognized by a human visitor (what might discourage them) or excludes visitors with very exotic (=unusual) browser configurations. However, the portion of discouraged or excluded visitors is still very low.

Traces - Where does the poster come from?

It is remarkable that human visitors are hardly ever typing a URL into a browsers address bar. A human visitor will add a page with content to his or her favorites, but never a page containing a form for a guestbook entry only! Hence a user starts at a content-rich page (this is a necessary assumption that might exclude very few visitors), then navigates to a page with the form (provided that the form is located on a separate page) and then submits the new entry. On the contrary, some spam bots will directly go to the form and drop their entry. This browsing history can be done on server side, PHP provides a powerful and easy-to-use session facility.

Unfortunately this method is only useful for websites with a separate page for composing entries. Therefore it is no help for comment forms at the end of an article. Let us assume three pages, guestbook_display.php, guestbook_compose.php and guestbook_process.php. The first one displays all guestbook entries, the second one is the form for a new entry and the third one stores the new entry in a database. The code will then look as follows:

   //////// guestbook_display.php ////////
   <?php 
     session_start();
     $_SESSION['visited'] = true;
     /* Please have considerations for browsers that do not accept cookies,
     therefore the SessionID should be appended to links */
   ?>
   ...
   //////// guestbook_compose.php ////////
   <?php 
     session_start();
     /* You can store a timestamp here as well */
   ?>
   ...
   //////// guestbook_process.php ////////
   <?php 
     session_start();
     if ($_SESSION['visited'] == true){
         //process entry     
     } else {
         //spam alert!     
     }
   ?>
   ...

As can be seen in the comment above, the session-ID should be appended to all links (or at least to links to guestbook_compose.php), otherwise visitors with browsers that do not accept cookies would be excluded. Further details about passing the session ID on can be found in the PHP manual. A simpler variant is to pass a time-dependent value (e.g. the timestamp) from guestbook_display.php to guestbook_compose.php and then submit it together with all the other variables to guestbook_process.php. Although it is tempting to use the HTTP referrer I DO NOT recommend it because there is no need for the referrer to be set.

We are not evil! - Simply ask the visitor

Another simple but effective method is to include a checkbox in the form. The checkbox is initially unchecked and a text next to it states "I am not a spam-bot" or something similar. A human user will check the box, a spam bot will not be aware of it an simply 'forget' to check it. This method can be extended to two or more questions like the one above, but one has to take into account that a spam bot will either check all or none of them, therefore it would be a good idea to expect some boxes to be checked and others not. In order to not discourage visitors by a huge amount of checkboxes one could use them as a challenge once the visitor submits his or her entry. However, this must be then well highlighted, otherwise the visitor will simply ignore this check. Whether you will surrender your visitors to this kind of quiz game or not is still your choice...

Smart? - Test visitor's intelligence

Similar to the method before is the idea of a simple arithmetic problem that needs to be solved by the visitor. The precomputed solution can be stored in a hidden field, a smarter way is to multiply the solution with a fixed factor and store it then in a hidden field, otherwise the bots would have to copy the solution from the hidden-field only. The code for this 'game' could be:

   <?php 

     $random1 = rand(1,10);
     $random2 = rand(1,10);

     echo "Solution of " . $random1 . " plus " .$random2 . ": ";
     echo "<input type=\"text\" name=\"challenge\">";
     echo "<input type=\"hidden\" name=\"solution\" value=\"" . 4 * ($random1 + $random2) . "\">";
   ?>
   ...

A check in the processing script is simple:

   <?php 
   if ( 4 * $_POST["challenge"] == $_POST["solution"] ){
      //process entry
   } else {
      //spam alert
   }
   ?>
   ...

Again you have to trade off whether or not you want to have such challenges in your forms. As before you could still post it as an intermediate step after the visitor has submitted his or her entry, but do not forget to highlight this challenge appropriately.

Further indicators for spam bots

The previously explained point score can be extended by further indicators that might hit users with exotic (i.e. unusual) browser configurations as well.

At first one can check the visitor's browser. The HTTP request usually contains a browser string, for example "Mozilla/5.0 (compatible; Konqueror/3.2; Linux 2.6.2) (KHTML, like Gecko)". The absence of such a browser string can indicate a spam bot, however, empiric checks on my own webserver have shown that the browser string is usually set to a meaningful value.

Much more interesting is the HTTP version. Many spam bots are dealing with HTTP/1.0 requests, while modern browsers (in fact, almost all non-text-browsers) use HTTP/1.1. Therefore it is reasonable to use this indicator for the point score. The code would look like this:

   <?php 
   if ( strcasecmp($_POST['SERVER_PROTOCOL'], "HTTP/1.0"] == 0 ){
      //increase point score
   }
   ?>
   ...

Techniques with exclusion of a non-neglectable amount of visitors

This section covers techniques that can eliminate many spam entries for the price of some visitor groups whose potion can be more than one percent. Depending on the target audience of your website you might be able to afford such an exclusion of visitor groups or not.

Captchas - Hidden passphrases in pictures

Captachas are small pictures that usually contain distorted letters that are (rather) easily readable for humans, but very hard to decode for computers. Further descriptions of Captchas can be found at Captcha at Wikipedia.

Captchas restrict users for various reasons: Indiviuals with visual impairments cannot decode the image and text-browsers are completely excluded. Even visitors with browsers that are configured to hide images would struggle.

The reason why Captchas are presented as the holy grail against guestbook/comment spam is simple: You can earn lots of money with it! You can find lots of offers in the internet regarding Captchas, but there are many free scripts that generate Captchas as well.

Invisible - Do we see the same?

Since spam bots interpret the structure of a HTML document, but not its visual appearence we can create indicators that make use of this particular difference. We can inject additional fields into a form but prevent them from being displayed. A human visitor will not take notice of it, while a spam bot will see it as part of the form and fill it out, especially it the field's name is something like 'homepage'. An example looks like as follows:

   <!-- A degenerated textarea -->
   <input type="text" name="test" style="height: 0px; width: 0px;">
   <!-- A textarea outside the visible area -->
   <input type="text" name="test" style="position: absolute; left=-1000px;">

The drawback of this method is that visitors with text-browsers are moreless excluded because they would (could) still see the input fields and might fill them out. Even older graphic based browsers might lead to troubles because they do not interpret CSS. However, if you have excluded visitors with text-browsers or poorly equipped graphic-browsers from the the very beginning (you are for example hosting a multimedia site or photo albums) this method is very suitable for you! An improvent of the previous example is to put all CSS-commands into a separate file because then the spam-bot needs to request and parse the whole CSS file to reveal our trick.

Client-Side-Scripts - No entry without the script!

A very interesting method is to generate the HTML-Code for the HTML form at the client. Since most spam bots do not have an own script engine they will simply not see the form. But even bots that interpret embedded script-commands can be trapped if the script commands are located in an extra file. Here is an example using JavaScript in a file printForm.js included from an HTML document guestbook.html:

   /* File printForm.js */
   function testwrite() {
      document.write("<form action=\"process.php\" method=\"post\">");
      document.write("Name: <input name=\"Name\" type=\"text\"><br>");
      document.write("Entry: <textarea name=\"Entry\" cols=\"50\" rows=\"10\"></textarea>");
      document.write("<input type=\"submit\" value=\"Submit\">");
      document.write("</form>");
   }   
   <!-- File guestbook.html -->
   ...
   <!-- Include JavaScript file in HTML head: -->
   <script src="jstest.js" type="text/javascript"></script>
   ...
   <!-- Call the JavaScript function at an appropriate point within the body of your HTML page -->
   <script type="text/javascript">
      testwrite();
   </script>

However, a disadvantage is that client-side scripts are very often disabled by the user, even in modern browsers. Many visitors wouldn't be able to drop an entry since they cannot see the form that should be generated by the client side script. But if you require your visitors to have client side scripts enabled, then this technique is your choice!

Flash - We will achieve our target using multimedia!

Another very effective way of preventing guestbook/comment spam is to use FLASH for the form. Hardly any spam bots can deal with FLASH - the problem is solved! But - keep cool! There is a problem: The user needs to have a FLASH-plugin and needs to enable it. Tiny, lightweight browsers and text-browsers are completely excluded from a flash-guestbook.

For those who are not willing to use FLASH there exists the possibility to use Java, but the problem remains the same: A working Java interpreter is often not available...

Methods that help only at first sight

Our aim is to prevent guestbook spam. If we became too ambitious we would fight human visitors as well, that is why we should never lose the overview over our website as a whole. For those who kept up until this point I am going to list techniques that are usually of no use - at least in my view.

Blocking IP addresses

For some reason a very popular suggestion in various forums is to block the spammer's IP. But this is usually useless because spam bots are using zombie computers that change from time to time. Most of the spam attempts here on my webpage are limited to one per IP. However, you can still have a look into your usage statistics and eliminate IPs with recurrent spam attempts. Therefore the only appropriate use of IP blocking is for single, recurrent spam-IPs, it is very dangerous to dynamically adjust block lists because if you manage to add a web proxy to your block list then a whole subnet is excluded from your guestbook or comment module. What is more, you have to update your list of blocked IPs from time to time and eliminate IPs that have not spammed (or attempted to spam ;-) ) for a certain amount of time.

Force visitors to register

Requiring your users to register just for a single guestbook entry or a single comment is usually a bad idea because this requires too much effort from the visitor. Just think about yourself: Would you drop a comment on a newly visited site if you would have to register first? If the webmaster is known, then maybe yes, otherwise I wouldn't. But maybe you're different...

Confirmations by email

The method of confirmations via email is not as unhandy as the two methods above. This method almost managed to get into one of the other categories, but finally I decided to leave it here. Why? On the one hand such email confirmations require the user to supply a valid email address. After that he has to check his emails and (usually) open a specific web address to complete his entry. In my view this is too much effort for a single entry. And what is even worse: Sloppy programming might even turn your webserver into a spam sling! Don't underestimate that the visitor has no more choice whether or not he is willing to specify his email address, especially if there is no option like 'hide my email address from other users'. Anyway, as for the other two methods above: In particular situations this technique might be meaningful and a good thing to do.

Conclusion

There is no simple and idiot-proof way of preventing guestbook and comment spam, but lots of possibilities to get rid of most of it. One method on its own is not powerful enough to eliminate all spam, but combining all or most of them leads to a very powerful spam barrier. Unfortunately spam bots are improving as well and some methods might become useless in the the near or far future, but this is part of the cat and mouse game...

Now, at the very end of this article, I would like to ask for wishes, suggestions, complaints and so on. Especially hints at grammar errors or misspellings are welcome! You can either use the comment module below or write me an Email, I am looking forward to every kind of serious feedback! For those who are still looking for more informations about spam I can recommend my selection of funny guestbook spam. :-)



Comments:



Nignuntonee, on 23/ 11/ 2012 at 10:07
When I initially commented I clicked the "Notify me when new comments are added" checkbox and now each time a comment is added I get three e-mails with the same comment. Is there any way you can remove people from that service? Thanks!
jake, on 16/ 8/ 2012 at 12:29
hi. I came here after a <a href="google.com">google</a> search on a similar topic
Karl Rupp, on 28/ 1/ 2012 at 21:14
Hi Paul,
you are certainly missing semicolons after 'return true' and 'return false'.
Paul Proft, on 28/ 1/ 2012 at 20:00
I used your timestamp method and inserted the code in my guestbook form (onsubmit and echo):

<form method="post" onsubmit="<?php if(time() - $_POST["timestamp"] > 10) {return true} else {return false} ?>" action="<?=$self?>">
<?php echo "<input type=\"hidden\" name=\"timestamp\" value=\"" . time() ."\">"; ?>
...
(input boxes)
...
<input type="submit" value="Submit">
</form>

It doesn't work (parsing error). Do you see any obvious mistakes in the code?

Thanks.
diet tea, on 26/ 8/ 2010 at 02:07
Wow, this was a really quality post. In theory I' d like to write like this too - taking time and actual effort to make a great article... but what can I say... I procrastinate alot and in no way appear to get something done.
chris, on 1/ 6/ 2010 at 11:17
this article was a great help improving my guestbook.

thanks a lot.
himanshu, on 26/ 1/ 2009 at 06:56
best article about the topic. it list all methods that can be used for stop spam
alan (http://www.motasoft.co.uk), on 21/ 12/ 2008 at 21:07
Best article I have read, seems to cover all the methods I have seen in one hit. Well done. Regards Alan
Benjamin (http://projects.citrosaft.com/floodassassin), on 1/ 12/ 2008 at 19:06
I know, this is also something like spam, but I developed a little class which allows you to do a filtering without all this stuff :) The class is under GPL, so you can use it for free. It also would be nice if I get some feedback.
Look on my Homepage to get more information.
Paul Szilard (http://www.remektek.com.au/wb/index.php), on 2/ 1/ 2008 at 12:27
Brilliant! Thanks for a stack of great ideas! Well done!
Bill (http://www.ijumboloan.com), on 1/ 11/ 2007 at 19:35
how do I put these in my simple contact form ?

the time stamp, and <?php
$entry = $_POST["entry"]
$points = 0;
$points = $points + 1 * substr_count(strtolower($entry), 'viagra');
$points = $points + 2 * substr_count(strtolower($entry), 'phentermine');
$points = $points + 2 * substr_count(strtolower($entry), 'tramadol');
$points = $points + 3 * substr_count(strtolower($entry), '<a href=');

if ($points < 5){
//process entry
} else {
//spam alert
}
?>
_______________

I can email you php script upon response.

thanks
anton (http://www.kernelpanic.nu), on 26/ 6/ 2007 um 23:19
EDIT: Moved back to English page, sorry for the bug :-(

I was reading you english version of this page but wasn't able to post a message there... so I'll try here instead.
I have been trying to fight spam in my guestbook for a while now and found your article pretty interesting. The timestamp-idea is something that I haven't tried yet but I think I will.

One thing that I have seen works very well to prevent spam is to randomly create the names for e.g the input tags. By doing this and storing them in a session variable, the name of the input tags will be different on every visit to the page.

Good luck on you spam fighting. One day we will win ;)
Last update: 5/3/2007