Filter Spam!Filtering Spam

Combined techniques give best results

Over the last years, the amount of spam and phishing e-mails in the Internet has increased dramatically. Experts say that more than half of all e-mails are unwanted unsolicited bulk e-mails today. While there may be future solutions for sender authentication to reduce it, these are not common yet and not accepted everywhere. In Shamrock's NetMail, a heuristic combination of filter techniques has proven to be successful and effective. Interestingly, these techniques are effective against viruses, worms and Trojans as well.

Spam floods the Internet
What can be done?
Combining techniques
Checking HELO strings
Blacklists - how good are they?
Greylisting: Use at your own risk
SPF: Sender Policy Framework
Stopping backscatter
Sad but true: Spam works!

Outbreak
Anatomy of four spam attacks. Fortunately this e-mail server has blocked
most of these unwanted emails (colored red) based on bogus HELO strings.

Spam floods the Internet

When we talk about spam here, our definition is quite simple: Unrequested and unwanted advertising e-mail. At least half of all Internet e-mail is spam. Where does it come from?

Spam e-mails are not only annoying for the ones who receives them but are also a threat for mail servers. During a heavy spam run, more or less the complete Internet bandwidth can be eaten up by a bot network, and also the server resources (CPU time, TCP control blocks etc.) may reach their limits so that legal e-mails are getting blocked.

If you retrieve your mails from a POP3 mailbox at your Internet provider, you should not rely on a spam filter there. Often spam is only tagged with [SPAM:] in the subject line or with "x-spam:" in the mail header, but the criteria for this are mostly undisclosed, unclear and ineffective. Providers are obliged to deliver e-mails, and a single false positive could have jurisdictional consequences from a complaining customer. So they cannot block misconfigured and misbehaving senders as consequently as a corporate server can do this. If you want to get rid of unwanted e-mails, you will have to filter them out in your own e-mail system, using your own preferences.

What can be done?

Unfortunately, filtering spam is not an easy job, and not an exact science. A filter actually deleting most spam e-mails will probably also delete some ham (non-spam) mails, so-called false positives. And a filter which guarantees not to delete any ham will be very ineffective in deleting obvious spam. So what can be done to be as reliable and effective as possible?

  1. HELO: The receiving SMTP server should block senders with syntactically invalid domains behind HELO/EHLO, like "HELO none" or "HELO win_nt.example.com". Only alphanumeric characters, dots and hyphens are allowed, underscores are not. If an IP is given which is not identical with the sender's IP, it can be blocked, too. (But note that the HELO domain is not necessarily the DNS equivalent of the system's IP address!) Many viruses and spammers use the recipient's IP or domain for HELO instead of their own. It is extremely suspicious if connections from the same IP address use different HELOs.
  2. Own domain: If you have a domain of your own and your internal e-mails do not go through the Internet, you can block all incoming mails having your domain in the From address since the sender is obviously faked.
  3. Dynamic IP addresses belonging to dial-up ranges like DSL or cable modems are typical sources of spam, worms and viruses from trojan horses. They can be detected using an RBL query (real-time blackhole list, like pbl.spamhaus.org). In many cases a direct SMTP delivery can even be diagnosed after POP3 retrieval by counting the Received: lines in the mail header, ignoring ones with local IP addresses. To avoid being blacklisted, dial-up customers should always send mails through their provider's SMTP smarthost, even if they are using their own SMTP server for mail reception.
  4. User unknown: When a specific sender causes lots of "user unknown" errors due to random words before @ in the destination address, his mails can be blocked completely for a while even if some are addressed to existing users. (This works perfectly together with greylisting because the sender will already be blocked when he retries the e-mails later!)
  5. Pipelining: If the sender fires multiple commands and perhaps even the mail text without waiting for a response even though pipelining is not advertised by the server, the mail should be rejected.
  6. Web links in the text may be one hint that an e-mail might be spam, especially if it is one with a query string behind it, allowing the sender to identify who clicked it and thus prove the success of his spam.
  7. HTML-only mails are very common for spam. However, unfortunately, some large webmail providers like Hotmail or Web.de use HTML instead of plain text as the default setting, so blocking all HTML-only mails is dangerous.
  8. Bayes filters are fairly good for single users who sort out their spam and thus let the filter learn what's ham and what's spam, but they can be confused by adding well-selected ham words ("poison"). Also, if you receive most of your e-mails in Spanish or German and most spam is in English, most English e-mails will be deleted as false positives. Bayes filters are not ideal for server-based multi-user solutions, too; a good manual selection of typical spam words may give better results in these environments and also decreases the server CPU load by testing only a few dozen instead of ten thousands of words.
  9. Camouflaged words like V1agra should be added to the banned word index, too. However, replacing characters by punctuation like Cia!is confuses typical Bayes filters, since these are treated as delimiting characters. Only a few mail programs (like Shamrock NetMail) are still able to score such a variation.
  10. Return mails to spammers are nearly useless since they probably won't read them, but it may help to inform senders whose e-mails were filtered erroneously (false positives), which never can be excluded completely. In general, errors should be returned during the SMTP dialogue instad of sending a return mail later. This will also reduce future spam since some spammers are using address verification software.

In fact there are even more ways to make life more difficult for a spammer, like using a tar pit (Teergrube) inserting a delay of a few seconds after each destination address given in the SMTP dialogue in order to slow down mass mailings. However, in the age of Trojans used for spam distribution, this is not always efficient since the same spam e-mail comes from hundreds and thousands of compromised PCs.

Honeypots are also used by some ISPs: They put faked mail addresses with their own domain on a website and wait until harvesters actually send bulk e-mails to these, then they blacklist the originating addresses. But here we are dealing with receiving less spam instead of more, so we will not discuss honeypots in more detail.

Combining techniques

Just as an example, the NetMail software from Shamrock (which is free for up to three users) detects spam e-mails using a combination of the criteria described above, each adding an adjustable percentage to the spam score:

In addition, there is a text filter in NetMail which allows to define words or phrases with a spam probability in percent, e.g. 80 % for "Viagra". These phrases are searched in the mail subject and in the text body. Punctuation characters, line breaks and spaces are ignored when comparing text so that even v.i:a-g.r/a is found. Similar to Bayes filters, negative values are allowed for words which are typical for non-spam. But the word list typically has less than 100 well-selected entries with scores from 25 to 75 percent, so even four "bad" words will cause the mail to be rejected. If some new sort of spam comes out it is fairly easy to add new words or change the score of list entries. However, since different users get very different e-mails, a filter optimized for one company (selling car parts, for instance) is not necessarily well-trimmed for another (e.g. offering medication), so the administrator most probably will have to adjust the scores of some words.

The mail is blocked with a return mail (and optionally stored into a "Spam" folder) if the score sum is 100 to 199. No return mail will be sent back, however, if the sum reaches 200 or more since it is obvious that this cannot be a false positive. If one sender creates more than 10 consecutive "user unknown" errors, he will be blocked completely. At Shamrock and also at many NetMail users this scheme has proven to be very effective and successful.

Checking HELO strings

If you do not retrieve your mails from a POP3 mailbox but run your own SMTP server for e-mail reception, there are even more efficient ways to filter out unwanted mails. Please see our SMTP article for suggestions how to run a server with a DSL or cable Internet access. For instance, the HELO identification of the sending server and its IP address can only be seen with direct reception, and invalid strings can be filtered:

The HELO string...
computer1
mx_1.example.com
82.119.148.246
[127.0.0.1]
is invalid because...
A fully qualified Internet domain is required
Underscores are invalid in a domain name
IP addresses must be in brackets
127.0.0.1 does not match the external IP
and should be something like...
computer1.example.com
mx-1.example.com
[82.119.148.246]
Any of the above

Furthermore, the HELO domain should match the From domain for typical large providers. For instance, if the "from" domain contains "yahoo." then the Helo string should also contain this word. The same is true for AOL, Hotmail, GMX, Web.de and others, but take care: If the e-mail address ends with @msn.com, the mail server and thus the Helo string is not MSN but Hotmail. Of course this check works for a few large providers only, but since much spam comes from faked Yahoo or Hotmail senders, it is quite efficient (and, by the way, quite similar to SPF = sender policy framework, but avoiding the DNS overhead). Also note that a typical HELO includes a subdomain, i.e. mail.example.com is used, and example.com without a dot in front of it is kind of suspicious.

Don't hang up!
During a heavy spam attack with 500 e-mails per minute or more, it is better to let the spam bots disconnect, and they typically do this after an error to avoid the additional time for the QUIT command. If the server disconnects itself, the operating system holds a TCP control block in the TIME_WAIT state for several minutes. This can quickly cause an exhaustion of server resources, limiting the number of connections per minute to a few hundreds only.

Is it legal to block an obviously wrong HELO string? RFC 2821, the "bible" for SMTP implementation and written in 2001, states: "The argument field contains the fully-qualified domain name of the SMTP client if one is available. In situations in which the SMTP client system does not have a meaningful domain name (e.g., when its address is dynamically allocated and no reverse mapping record is available), the client should send an address literal." So anything else than the Internet domain or the public IP address in brackets is clearly non-RFC compliant.

Regarding underscores and other illegal characters, RFC 2821 says: "Characters outside the set of alphas, digits, and hyphen must not appear in domain name labels for SMTP clients or servers. In particular, the underscore character is not permitted. SMTP servers that receive a command in which invalid character codes have been employed, and for which there are no other reasons for rejection, MUST reject that command with a 501 response."

However, some warm-hearted people cite RFC 1123, dated 1989, when viruses and spam were totally unknown: "The sender-SMTP must ensure that the <domain> parameter in a HELO command is a valid principal host domain name for the client host. The HELO receiver may verify that the HELO parameter really corresponds to the IP address of the sender. However, the receiver must not refuse to accept a message, even if the sender's HELO command fails verification." One might think that this forces a server to accept any bogus HELO string, but this is completely wrong: This text only says that a message must be accepted if the HELO parameter is syntactically correct but a reverse DNS lookup of it (RDNS, RR) does not return the sender's IP. Though some fussy mail servers do a RDNS lookup, this is not a good idea at all since many legal senders have HELO strings like "pc1.local" or similar.

And we might go one step further and block even literal IP addresses in brackets behind HELO or EHLO. RFC documents as the bible for technical implementation say literal IPs are legal, but spam is war, and other rules may apply: If it is obvious that practically all such senders distribute spam or viruses, a policy blocking them may be acceptable.

Another very efficient way to reject dial-up hosts is to block generic HELO domains like 1.2.3.4.dialup.example.com. Though such domain names are completely RFC-compliant, senders using them are typically Trojans in infected PCs. However, there is a chance of some false positives since most broadband providers refuse to allow true domain names even for static IPs. This is why not only dynamic but also static subscribers should prefer an external SMTP smarthost to send their e-mails.

What's this?
Harvester: A program which uses a search
engine to find web pages and then stores all
e-mail addresses from them for sending
unsolicited bulk mails later.
Honeypot: A server which looks like an open
relay, inviting spammers to send their mails
over it and then tracks their IP addresses.
Open relay: An SMTP server over which
anyone can send e-mails to any destination
without authentication, obscuring his identity.
Spam trap: A server whose e-mail address
is published on a web page, waiting for mails
from harvesters to store their addresses e.g.
in a blacklist.

Blacklists - how good are they?

When more and more spam, Trojan, viruses and phishing mails came up, non-profit organisations like Spamhaus and others created blacklists which can be questioned by mail servers whether a specific IP is a known sender of such malware or not. Since the DNS query mechanism is used, such a list is typically called DNSBL (DNS blacklist) or RBL (Real-time Blackhole List). To query if the IP address 1.2.3.4 is a source of unwanted e-mails, reverse its byte order and add the hostname of a blacklist, for example:
4.3.2.1.sbl.spamhaus.org
If the host returns an IP address (typically in the 127.0.0.x range), then mails from the questioned IP address should not be accepted or at least tagged.

There are dozens of blacklists today, many being specialised. For instance, pbl.spamhaus.org finds out if an IP belongs to a dynamic address space typical for dial-up lines, sbl.spamhaus.org collects addresses of known spammers, and xbl.spamhaus.org checks for Trojans and open proxies. In many cases, a combined blacklist like zen.spamhaus.org is also available to avoid multiple queries for one address. After the HELO check we mentioned above has filtered out many unwanted e-mails, a good blacklist typically will still delete a few percent of the remaining mails.

Take care, however! While the process of filling blacklists with evil IP addresses is mostly automated using spam trap mailboxes or honeypots, unfortunately there is a chance of false positives. Imagine an Internet provider with thousands of customers. If only one of them is sending out spam or viruses, the provider's IP or even a whole address range (a netblock) may get blacklisted, affecting other customers as well. When the provider wants to get off the blacklist, this may be a long and frustrating process since most lists are controlled by volunteers. While Spamhaus seems to do a pretty good job with very few false positives, others sometimes block legitimate senders. This is why large Internet providers often keep their own non-public blacklists.

Greylisting: Use at your own risk

Direct SMTP reception also allows greylisting. The technique is simple. For each received e-mail, a triplet is built from the sender's IP, the "from" address and the destination address, and stored in a database. All mails with triplets not having occurred earlier will be blocked temporarily ("451 Try again later" in the SMTP dialogue after Rcpt to). When the sender retries the delivery later on, the triplet already exists and the mail can pass through. Most viruses and many spammers never retry, so greylisting often reduces unwanted mail by 30...80 %, depending on which other filters are used in addition. Unfortunately, some buggy mail systems have problems when sending messages to a greylisting server, so they may have to be whitelisted:

Typical server retry times
Retry after  Servers
0...5  min   33 %
5...20 min   33 %
20..60 min   27 %
1...3  h      5 %
3...12 h      2 %
(Snapshot only, your
results may differ!)

A slightly improved version is "light greylisting" which ignores the lowest (last) byte of the IP address when building the triplet. This is useful since large providers like Web.de run server farms for mail delivery with different IP addresses; typically the first three IP bytes will always be the same for all servers in one farm. Unfortunately this is not true for AOL, Amazon or Google Mail; a retry of the same e-mail often comes from a completely different netblock so they should be excluded from greylisting.

It is quite clear, however, that greylisting will cause a delay for a new sender. Retry times of mail servers are very different, they last from a few seconds to several hours (see table). About two thirds retry within twenty minutes. If you are expecting legal e-mails from varying sources and the risk of delays is not acceptable, it might be better not to activate greylisting or to whitelist critical senders. If the senders are mostly the same all the time it is useful to have a "learning" inactive greylisting before activating it (like the one in the NetMail SMTP server): Even while greylisting is turned off, new triplets are saved, so mails from known senders will not be delayed when greylisting is turned on later.

SPF: Sender Policy Framework

Some large providers like hotmail.com and aol.com whose domain names are frequently faked and abused by spammers have introduced a sender domain verification called SPF. For instance, if your server receives an e-mail from user(at)hotmail.com, it can look for a DNS text record for hotmail.com and then check if the sender IP matches the IP range(s) given in this record. For instance, the record looks like this for the domain kundenserver.de:
v=spf1 ip4:212.227.126.128/25 ?all
The character before "all" proposes what could be done if the given IP range is not matched by the sender: - = fail (reject), ~ = softfail (kind of suspicious, add something to the spam score), ? = neutral (other IPs might be legal).

SPF lookup in Windows
To find out if a domain has an
SPF record or not, simply open
a console and type e.g.:
nslookup -q=TXT gmx.net
The displayed result should be
something like "v=spf...".

SPF1, the original Sender Policy Framework proposal, typically looks at the envelope sender only (return-path, MAIL FROM in SMTP) which is empty for bounce mails, while SPF2.0 (Microsoft's Sender ID proposal) also looks at the From address in the mail header. While AOL and most other SPF supporters have an spf1 record only, Hotmail offers multiple records with both spf1 and spf2.0.

Unfortunately, kundenserver.de is just a provider hosting thousands of customer domains. So if a customer sends his e-mails through the kundenserver.de smarthost, the kundenserver.de SPF record will not help at all because the sender has a different domain. This is why a Sender Rewriting Scheme (SRS) is used for SPF-enabled SMTP smarthosts (though it does have some drawbacks because it effectively disables SPF checks for the original sender's domain which may be faked).

For instance, if your domain is example.com and you are sending mails via the smarthost at provider.net, your original sender address is e.g. user(at)example.com, but the smarthost rewrites it to something like:
SRS0=HHH=TT=example.com=user(at)provider.net
with SRS0 = keyword for first rewrite (most typical). In fewer cases two mail hops are involved, requiring a different sender rewriting scheme SRS1. Sample:
SRS1=HHH=forwarder.net==HHH=TT=example.com=user(at)provider.net
In both samples, HHH is a base64-coded checksum (hash) with an algorithm only the provider knows, so he can check the validity of bounces. TT is a base64-encoded timestamp so that this rewritten address is valid for one month only. Equal signs are used as delimiters; sometimes a plus or minus sign is used instead as the first delimiter behind SRS0 or SRS1.

When such a mail bounces, the return mail goes to the server at provider.net which then should forward it to your own SMTP server or your POP3 mailbox. By checking the validity of timestamps and hashes, the provider can determine if this is a legal bounce and deletes it if not. Unfortunately, this does not help much against collateral spam (so-called backscatter) caused by e-mails with faked sender addresses since these are typically sent directly instead of through a provider smarthost. Consequently, some SMTP servers (such as the one of Shamrock NetMail) silently decode the SRS addresses to their original form before processing them further. For instance,
MAIL FROM: SRS0=94/=WX=example.com=user(at)provider.net
in the SMTP dialogue is decoded to
Return-Path: user(at)example.com
in the header of the received e-mail to avoid ambiguities in further processing.

Yet only few providers check incoming e-mails for SPF validity. One reason is simply that the vast majority of legal mail servers do not have SPF records at all. Even worse, spammers were the very first ones to add SPF records for their bulk domains, so an SPF pass does not mean no spam. A second important reason is that SPF adds a remarkable load to servers -- much more than using a blacklist: DNSBLs have predictable and quite fast response times, while checking SPF entries at a variety of name servers may frequently cause timeouts, blocking server threads for tens of seconds, especially with faked domain names and dynamic IPs.

Some e-mail providers also add a "Receive-SPF:" header line when storing incoming mails into their POP3 mailboxes. Possible words behind Receive-SPF are pass, fail, softfail, neutral, none, unknown, or error. The recipient's mail program could then use rules to filter e-mails based on this line, but it is hard to decide whether you should give "Receive-SPF: pass" a negative or positive spam score: Spammers often have SPF DNS records but not most legal mail senders. "Fail" and "softfail" can be used for adding points to the spam score, all other results should be ignored.

Comparing From and Helo strings for big sites may be an alternative, so that e.g. the Helo must end with yahoo.com if the From address does the same. A good question is if one should accept e-mails e.g. from Yahoo when Helo contains Yahoo.com but the From address is SRS-encoded and only ends with Yahoo.com so that it looks like being forwarded via Yahoo. We think that this should not be treated as legal e-mail since the checksum in the SRS string cannot be validated by the receiver (only the sender knows his private algorithm) and thus is potentially faked.

Stopping backscatter

For some accounts, backscatter is an even bigger problem than spam. Imagine a spammer or a virus abuses your e-mail address for its faked "From". The result will be that many mail servers answer with a return mail (typically from Mailer-Daemon) to the faked sender - your address. This is called backscatter. SPF/SRS tries to reduce this but it does not help much in real life.

Something similar to SRS but less common is BATV (Bounce Address Tag Validation). It adds a 9- to 11-digit key to the localpart of the sender address and "prvs=" (private signature) in front of it. The first digit is a key number, the next three are a day count since 1970 modulo 1000, followed by 6 hex digits calculated from the local address part using a private algorithm. Unfortunately, standardization of BATV was not very successful: Some implementations put the key first and the local part behind it, some do it vice versa. Two samples:
prvs=info=12312A46F3(at)example.com
prvs=12312A46F3=info(at)example.com

To make things totally weird, some use a slash instead of the second equal sign. Others use btv1 instead of prvs and two equal signs instead of one as a delimiter.

However, it is a very bad idea to use BATV for checking the validity of mails with an empty MAIL FROM in the SMTP dialogue (which results in an empty return path in the mail header): If these mails are only accepted if the recipient is a valid BATV string as used when sending mails from a BATV-enabled server, not only backscatter will be rejected but also entirely legal other mails with an empty return path, e.g. for the deletion of unread messages after 30 days. Typically the BATV time stamp is no longer valid then, so these notifications will be blocked unintentionally. So if using an anti-spam appliance offering this feature it should be switched off.

In practice, there are basically three working approaches to detect and delete backscatter. All of them require scanning the text of return mails:

Shamrock NetMail implements all three methods. The last one also allows to reduce the spam score of known senders, since the probability of spam coming from senders where local users have sent e-mails to is lower than for unknown sources. It is obvious, however, that mail servers should avoid sending backscatter whenever possible. For instance, denying an e-mail to a non-existing address on SMTP level is much better than sending a return mail to a potentially faked address later.

Sad but true: Spam works!

It is a fact that spam works, even if only a tiny fraction of it makes its way through filters. The reason is that there are always some curious people out there who click on the links in spam mails or even buy advertised products - not many, but still enough to make it a lucrative business.

Unfortunately there is no FUSSP (Final Ultimate Solution for the Spam Problem), and there never will be. We see an ongoing war between large-scale commercial spammers and the development of advanced anti-spam filtering techniques. Legal measures have no effect in foreign countries. So we will stay busy fighting spam for many years. The main reason is that there are still enough dumb users out there who actually buy products from spammers!

But anyway, following just one simple rule avoids large amounts of unwanted e-mails:
Never ever post an e-mail address as plain text or link on a web page.
This includes guestbooks, forums, private and corporate web pages. Harvested mail addresses from web pages account for most of today's spam. If you need to publish an e-mail address for some reason (instead of using an e-mail form, for instance), use a graphic image to display it (or at least a part of it, e.g. the @ symbol). Never write an e-mail address as text or even mailto link into an web page: It would be scanned by harvesters within days and used by spammers for years.

When you still receive spam, it is highly recommended to use a Whois service in the Internet to find out the responsible person (abuse contact) for the IP address in question. This IP address is typically found in the first Received header line from top. Note that this is the only reliable address in mails; all other things, including the domain name of the sender, can be completely faked.


9/2011 Herwig Feichtinger, Shamrock Software GmbH