Securing Your Web Site From Email Harvesters

Spam blocking costs all of us time and money. SCOCA has spent money on software and hardware on the problem, and guarding against spam costs us time that would well be spent on other tasks, but the problem grows as spammers become more sophisticated in their techniques. It’s a never-ending arms race. On the School District side you can do much to help with this problem by doing your best to block email harvesters that cruise your web site and gather your email addresses.

 

What are email harvesters?

 

You are probably already familiar with search engine “bots” (or spiders). These are software programs used by sites such as google that cruise the web to gather information about your site from publicly accessible web pages. These bots not only read the information that your viewer can read (text and links), but also check the  “<meta>” tags in the “<head>” section of the page for site descriptions, copyrights, and the “<title>” tag. These bots are benign and serve a useful purpose for people searching the web so that they can find a particular site.

 

Email harvesters work in the same way, but only look for the mailto: tag to grab any email addresses it finds. On SCOCA’s end, blocking harvesters is practically impossible because most of them masquerade either as regular web browsers or even search engine bots.

 

As a test, Brian Rittenour ran a harvester through some of our districts and here is what he was able to find:

 

District 1
Levels: 5

Emails found: 276

Pages/files searched: 1411

Time: 5:56

 

District 2
Levels: 5

Emails Found: 236

Pages/files searched: 1929

Time: 4:22

 

District 3
Levels: 5

Emails Found: 128

Pages/files searched: 104

Time: 0:35

 

District 4
Levels: 5

Emails Found: 157

Pages/files searched: 151

Time: 0:32

 

Brian set the harvester to search through 5 levels on each site, and the harvester looks for all email addresses, plus links to other pages up to  5 levels. You can see from the stats above that large district sites can provide a harvester with lots of pages to check, but still the time is quite fast. With the above example, it would only take a spammer around 11 minutes to grab 797 email addresses.

 

How do I stop harvesters from hitting my site?

 

As mentioned above, you really can’t stop harvesters, but you can make it almost impossible for them to grab email addresses. There are a couple of techniques that for at least now (harvesters are always evolving in the “arms race”) will make it impossible to get your addresses.

 

One technique is to put all of your email addresses into a JAVA (not javascript) or FLASH application. As an example the SCOCA Staff Directory available from the SCOCA Home page is in Flash. Currently, harvesters cannot read Flash .swf files. In the case of our staff directory, it is a separate Flash app that is called from another Flash .swf file, so even finding it is a bit difficult. In addition to that, the directory itself is being pulled from an XML file.

 

Of course, programming in Flash takes time, and as a tech coordinator you probably want an easier solution. Currently the best way of preventing harvesting is to breakup the mailto: link with javascript so that the harvester can’t find it.

 

Using JavaScript to hide your email addresses from Harvesters

 

The typical email link looks like this:

 

<a href=”mailto:somebody@somewhere.com”> Jon Doe</a>

 

Using some JavaScript code, we can break this address up and obfuscate it from harvesters, while making it look and work normally to regular users viewing your page:

 

<SCRIPT LANGUAGE="javascript">

<!-- // Javascript Email Address Encoder

//  by www.stevedawson.com

 

      var first = 'ma';

      var second = 'il';

      var third = 'to:';

      var address = 'somebody';

      var domain = 'somewhere';

      var ext = 'com';

      document.write('<a href="');

      document.write(first+second+third);

      document.write(address);

      document.write('&#64;');

      document.write(domain);

      document.write('.');

      document.write(ext); 

      document.write('">');

      document.write('Jon Doe</a>');

// -->

</script>

 

This JavaScript code is what is used on all SCOCA Department pages. Staff members are listed in the right column of department pages with a link to their email addresses. As an example of how we’ve done this, go to the System Support and Development department page and then right-click to view the page source. Here’s what the first two entries for the email addresses look like:

 

<SCRIPT LANGUAGE="javascript">

<!-- // Javascript Email Address Encoder

//  by www.stevedawson.com

 

     var first = 'ma';

     var second = 'il';

     var third = 'to:';

     var address = 'bbirkhimer';

     var domain = 'scoca-k12';

     var ext = 'org';

     document.write('<a href="');

     document.write(first+second+third);

     document.write(address);

     document.write('&#64;');

     document.write(domain);

     document.write('.');

     document.write(ext); 

     document.write('">');

     document.write('Brian Birkhimer</a>');

// -->

</script>

  <br>

  Software Development Coordinator<br>

    <br>

  <SCRIPT LANGUAGE="javascript">

<!-- // Javascript Email Address Encoder

//  by www.stevedawson.com

 

     var first = 'ma';

     var second = 'il';

     var third = 'to:';

     var address = 'ryanm';

     var domain = 'scoca-k12';

     var ext = 'org';

     document.write('<a href="');

     document.write(first+second+third);

     document.write(address);

     document.write('&#64;');

     document.write(domain);

     document.write('.');

     document.write(ext); 

     document.write('">');

     document.write('Ryan McClay</a>');

// -->

</script>

  <br>

  Systems Manager<br>

    <br>

 

A quick breakdown of this code shows that mailto: is broken into three parts, followed by the username (address), domain (domain), and domain extension (ext). This is then all put together with document.write statements. For the @ character, the ASCII code is used.

 

You may be wondering how to do this if the email address is longer, e.g. jdoe@cu.k12.oh.us . One way of doing this is to concatenate the domain like this:

 

var address = 'jdoe';

     var domain = 'cu.k12.oh';

      var ext = 'us';

 

Or, you can just add more domain variables (with corresponding document.write statements) like this:

 

var address = 'jdoe';

     var domain = 'cu';

     var domain2 = 'k12';

     var domain3 = 'oh';

     var ext = 'us';

…….

document.write(domain);

     document.write('.');

document.write(domain2);

     document.write('.');

document.write(domain3);

     document.write('.');

 

Even though the second method requires more code, I would probably use it to break up the address as much as possible.

 

That’s all there is to making your site more secure against email harvesters. It makes more work for you or your web staff, but the results are worth it, at least until harvesters find a way around this method, which at some point is probably going to happen.

 

There are plenty of other examples of using JavaScript for harvester blocking, and I advise you to search the web and check them out. This code works for us, and was easy to implement.

 

Bonus Section: Why having a database-backed site makes this easier

 

While the JavaScript code in this article works well in blocking harvesters, it is of course a real pain to go through your site and re-write all of your email links. In the case of SCOCA department pages, everything is pulled from a database. I actually only coded one department page, and all of the information is generated automatically depending on the ID of the department. As an example, here are the URLs for our System Support and Development page and the INFOhio Department page:

 

http://www.scoca-k12.org/departments/department.php?did=8

 

http://www.scoca-k12.org/departments/department.php?did=6

 

Both links are the same page (department.php) with only the department ID number (did) being the difference. When you access one of our department pages PHP checks the ID number and then pulls all of the information for that department from the MySQL database. This includes all of the links in the left columns, the articles and bulletins in the middle columns, and the email links in the right column.

 

For the email links, all I had to do is enter the javascript code once within a loop and just pull the email information from the database:

 

 

First here is the simple SQL statement that gets the employee information from the “staff” table:

 

$query_Recordset2 = sprintf("SELECT * FROM staff WHERE did = '%s' ORDER BY lastname ASC",

 

WHERE did = '%s' // pulls the Department ID number from the page URL

 

To make a list of the staff members and their email address, I construct a loop:

 

 (while (!$Recordset2->EOF) {

 

And this allows me to enter all of the code just one time:

 

 

 

<div id="headlines">

  <h3 class="style1"><?php echo "$dept"; ?> Staff</h3>

  <?php

  while (!$Recordset2->EOF) {

?>

  <br>

  <SCRIPT LANGUAGE="javascript">

<!-- // Javascript Email Address Encoder

//  by www.stevedawson.com

 

     var first = 'ma';

     var second = 'il';

     var third = 'to:';

     var address = '<?php echo $Recordset2->Fields('username'); ?>';

     var domain = 'scoca-k12';

     var ext = 'org';

     document.write('<a href="');

     document.write(first+second+third);

     document.write(address);

     document.write('&#64;');

     document.write(domain);

     document.write('.');

     document.write(ext); 

     document.write('">');

     document.write('<?php echo $Recordset2->Fields('firstname'); ?> <?php echo $Recordset2->Fields('lastname'); ?></a>');

// -->

</script>

  <br>

  <?php echo $Recordset2->Fields('position'); ?><br>

  <?php

    $Recordset2->MoveNext();

  }

?>

 

 

 

 

 

.