Over the last years, the amount of spam and phishing e-mails in the Internet has increased dramatically. Experts say that more than half of all e-mails are unwanted unsolicited bulk e-mails today. While there may be future solutions for sender authentication to reduce it, these are not common yet and not accepted everywhere. In Shamrock's NetMail, a heuristic combination of filter techniques has proven to be successful and effective. Interestingly, these techniques are effective against viruses, worms and Trojans as well.
![]() |
Anatomy of four spam attacks. Fortunately
this e-mail server has blocked most of these unwanted emails (colored red) based on bogus HELO strings. |
When we talk about spam here, our definition is quite simple: Unrequested and unwanted advertising e-mail. At least half of all Internet e-mail is spam. Where does it come from?
Spam e-mails are not only annoying for the ones who receives them but are also a threat for mail servers. During a heavy spam run, more or less the complete Internet bandwidth can be eaten up by a bot network, and also the server resources (CPU time, TCP control blocks etc.) may reach their limits so that legal e-mails are getting blocked.
If you retrieve your mails from a POP3 mailbox at your Internet provider, you should not rely on a spam filter there. Often spam is only tagged with [SPAM:] in the subject line or with "x-spam:" in the mail header, but the criteria for this are mostly undisclosed, unclear and ineffective. Providers are obliged to deliver e-mails, and a single false positive could have jurisdictional consequences from a complaining customer. So they cannot block misconfigured and misbehaving senders as consequently as a corporate server can do this. If you want to get rid of unwanted e-mails, you will have to filter them out in your own e-mail system, using your own preferences.
Unfortunately, filtering spam is not an easy job, and not an exact science. A filter actually deleting most spam e-mails will probably also delete some ham (non-spam) mails, so-called false positives. And a filter which guarantees not to delete any ham will be very ineffective in deleting obvious spam. So what can be done to be as reliable and effective as possible?
In fact there are even more ways to make life more difficult for a spammer, like using a tar pit (Teergrube) inserting a delay of a few seconds after each destination address given in the SMTP dialogue in order to slow down mass mailings. However, in the age of Trojans used for spam distribution, this is not always efficient since the same spam e-mail comes from hundreds and thousands of compromised PCs.
Honeypots are also used by some ISPs: They put faked mail addresses with their own domain on a website and wait until harvesters actually send bulk e-mails to these, then they blacklist the originating addresses. But here we are dealing with receiving less spam instead of more, so we will not discuss honeypots in more detail.
Just as an example, the NetMail software from Shamrock (which is free for up to three users) detects spam e-mails using a combination of the criteria described above, each adding an adjustable percentage to the spam score:
In addition, there is a text filter in NetMail which allows to define words or phrases with a spam probability in percent, e.g. 80 % for "Viagra". These phrases are searched in the mail subject and in the text body. Punctuation characters, line breaks and spaces are ignored when comparing text so that even v.i:a-g.r/a is found. Similar to Bayes filters, negative values are allowed for words which are typical for non-spam. But the word list typically has less than 100 well-selected entries with scores from 25 to 75 percent, so even four "bad" words will cause the mail to be rejected. If some new sort of spam comes out it is fairly easy to add new words or change the score of list entries. However, since different users get very different e-mails, a filter optimized for one company (selling car parts, for instance) is not necessarily well-trimmed for another (e.g. offering medication), so the administrator most probably will have to adjust the scores of some words.
The mail is blocked with a return mail (and optionally stored into a "Spam" folder) if the score sum is 100 to 199. No return mail will be sent back, however, if the sum reaches 200 or more since it is obvious that this cannot be a false positive. If one sender creates more than 10 consecutive "user unknown" errors, he will be blocked completely. At Shamrock and also at many NetMail users this scheme has proven to be very effective and successful.
If you do not retrieve your mails from a POP3 mailbox but run your own SMTP server for e-mail reception, there are even more efficient ways to filter out unwanted mails. Please see our SMTP article for suggestions how to run a server with a DSL or cable Internet access. For instance, the HELO identification of the sending server and its IP address can only be seen with direct reception, and invalid strings can be filtered:
The HELO string... computer1 mx_1.example.com 82.119.148.246 [127.0.0.1] |
is invalid because... A fully qualified Internet domain is required Underscores are invalid in a domain name IP addresses must be in brackets 127.0.0.1 does not match the external IP |
and should be something like... computer1.example.com mx-1.example.com [82.119.148.246] Any of the above |
Furthermore, the HELO domain should match the From domain for typical large providers. For instance, if the "from" domain contains "yahoo." then the Helo string should also contain this word. The same is true for AOL, Hotmail, GMX, Web.de and others, but take care: If the e-mail address ends with @msn.com, the mail server and thus the Helo string is not MSN but Hotmail. Of course this check works for a few large providers only, but since much spam comes from faked Yahoo or Hotmail senders, it is quite efficient (and, by the way, quite similar to SPF = sender policy framework, but avoiding the DNS overhead). Also note that a typical HELO includes a subdomain, i.e. mail.example.com is used, and example.com without a dot in front of it is kind of suspicious.
Don't hang up! During a heavy spam attack with 500 e-mails per minute or more, it is better to let the spam bots disconnect, and they typically do this after an error to avoid the additional time for the QUIT command. If the server disconnects itself, the operating system holds a TCP control block in the TIME_WAIT state for several minutes. This can quickly cause an exhaustion of server resources, limiting the number of connections per minute to a few hundreds only. |
Is it legal to block an obviously wrong HELO string? RFC 2821, the "bible" for SMTP implementation and written in 2001, states: "The argument field contains the fully-qualified domain name of the SMTP client if one is available. In situations in which the SMTP client system does not have a meaningful domain name (e.g., when its address is dynamically allocated and no reverse mapping record is available), the client should send an address literal." So anything else than the Internet domain or the public IP address in brackets is clearly non-RFC compliant.
Regarding underscores and other illegal characters, RFC 2821 says: "Characters outside the set of alphas, digits, and hyphen must not appear in domain name labels for SMTP clients or servers. In particular, the underscore character is not permitted. SMTP servers that receive a command in which invalid character codes have been employed, and for which there are no other reasons for rejection, MUST reject that command with a 501 response."
However, some warm-hearted people cite RFC 1123, dated 1989, when viruses and spam were totally unknown: "The sender-SMTP must ensure that the <domain> parameter in a HELO command is a valid principal host domain name for the client host. The HELO receiver may verify that the HELO parameter really corresponds to the IP address of the sender. However, the receiver must not refuse to accept a message, even if the sender's HELO command fails verification." One might think that this forces a server to accept any bogus HELO string, but this is completely wrong: This text only says that a message must be accepted if the HELO parameter is syntactically correct but a reverse DNS lookup of it (RDNS, RR) does not return the sender's IP. Though some fussy mail servers do a RDNS lookup, this is not a good idea at all since many legal senders have HELO strings like "pc1.local" or similar.
And we might go one step further and block even literal IP addresses in brackets behind HELO or EHLO. RFC documents as the bible for technical implementation say literal IPs are legal, but spam is war, and other rules may apply: If it is obvious that practically all such senders distribute spam or viruses, a policy blocking them may be acceptable.
Another very efficient way to reject dial-up hosts is to block generic HELO domains like 1.2.3.4.dialup.example.com. Though such domain names are completely RFC-compliant, senders using them are typically Trojans in infected PCs. However, there is a chance of some false positives since most broadband providers refuse to allow true domain names even for static IPs. This is why not only dynamic but also static subscribers should prefer an external SMTP smarthost to send their e-mails.
What's this? Harvester: A program which uses a search engine to find web pages and then stores all e-mail addresses from them for sending unsolicited bulk mails later. Honeypot: A server which looks like an open relay, inviting spammers to send their mails over it and then tracks their IP addresses. Open relay: An SMTP server over which anyone can send e-mails to any destination without authentication, obscuring his identity. Spam trap: A server whose e-mail address is published on a web page, waiting for mails from harvesters to store their addresses e.g. in a blacklist. |
When more and more spam, Trojan, viruses and phishing mails came up,
non-profit organisations like Spamhaus and others created blacklists which can
be questioned by mail servers whether a specific IP is a known sender of such
malware or not. Since the DNS query mechanism is used, such a list is
typically called DNSBL (DNS blacklist) or RBL (Real-time
Blackhole List). To query if the IP address 1.2.3.4 is a source of unwanted
e-mails, reverse its byte order and add the hostname of a blacklist, for
example:
4.3.2.1.sbl.spamhaus.org
If the host returns an IP address (typically in the 127.0.0.x range), then
mails from the questioned IP address should not be accepted or at least
tagged.
There are dozens of blacklists today, many being specialised. For instance, pbl.spamhaus.org finds out if an IP belongs to a dynamic address space typical for dial-up lines, sbl.spamhaus.org collects addresses of known spammers, and xbl.spamhaus.org checks for Trojans and open proxies. In many cases, a combined blacklist like zen.spamhaus.org is also available to avoid multiple queries for one address. After the HELO check we mentioned above has filtered out many unwanted e-mails, a good blacklist typically will still delete a few percent of the remaining mails.
Take care, however! While the process of filling blacklists with evil IP addresses is mostly automated using spam trap mailboxes or honeypots, unfortunately there is a chance of false positives. Imagine an Internet provider with thousands of customers. If only one of them is sending out spam or viruses, the provider's IP or even a whole address range (a netblock) may get blacklisted, affecting other customers as well. When the provider wants to get off the blacklist, this may be a long and frustrating process since most lists are controlled by volunteers. While Spamhaus seems to do a pretty good job with very few false positives, others sometimes block legitimate senders. This is why large Internet providers often keep their own non-public blacklists.
Direct SMTP reception also allows greylisting. The technique is simple. For each received e-mail, a triplet is built from the sender's IP, the "from" address and the destination address, and stored in a database. All mails with triplets not having occurred earlier will be blocked temporarily ("451 Try again later" in the SMTP dialogue after Rcpt to). When the sender retries the delivery later on, the triplet already exists and the mail can pass through. Most viruses and many spammers never retry, so greylisting often reduces unwanted mail by 30...80 %, depending on which other filters are used in addition. Unfortunately, some buggy mail systems have problems when sending messages to a greylisting server, so they may have to be whitelisted:
Typical server retry timesRetry after Servers 0...5 min 33 % 5...20 min 33 % 20..60 min 27 % 1...3 h 5 % 3...12 h 2 % (Snapshot only, your results may differ!) |
A slightly improved version is "light greylisting" which ignores the lowest (last) byte of the IP address when building the triplet. This is useful since large providers like Web.de run server farms for mail delivery with different IP addresses; typically the first three IP bytes will always be the same for all servers in one farm. Unfortunately this is not true for AOL, Amazon or Google Mail; a retry of the same e-mail often comes from a completely different netblock so they should be excluded from greylisting.
It is quite clear, however, that greylisting will cause a delay for a new sender. Retry times of mail servers are very different, they last from a few seconds to several hours (see table). About two thirds retry within twenty minutes. If you are expecting legal e-mails from varying sources and the risk of delays is not acceptable, it might be better not to activate greylisting or to whitelist critical senders. If the senders are mostly the same all the time it is useful to have a "learning" inactive greylisting before activating it (like the one in the NetMail SMTP server): Even while greylisting is turned off, new triplets are saved, so mails from known senders will not be delayed when greylisting is turned on later.
Some large providers like hotmail.com and aol.com whose domain names are
frequently faked and abused by spammers have introduced a sender domain
verification called SPF. For instance, if your server receives an e-mail
from userhotmail.com,
it can look for a DNS text record for hotmail.com and then check if the sender
IP matches the IP range(s) given in this record. For instance, the record
looks like this for the domain kundenserver.de:
v=spf1 ip4:212.227.126.128/25 ?all
The character before "all" proposes what could be done if the given IP range
is not matched by the sender: - = fail (reject), ~ = softfail (kind of
suspicious, add something to the spam score), ? = neutral (other IPs might be
legal).
SPF lookup in Windows To find out if a domain has an SPF record or not, simply open a console and type e.g.: nslookup -q=TXT gmx.net The displayed result should be something like "v=spf...". |
SPF1, the original Sender Policy Framework proposal, typically looks at the envelope sender only (return-path, MAIL FROM in SMTP) which is empty for bounce mails, while SPF2.0 (Microsoft's Sender ID proposal) also looks at the From address in the mail header. While AOL and most other SPF supporters have an spf1 record only, Hotmail offers multiple records with both spf1 and spf2.0.
Unfortunately, kundenserver.de is just a provider hosting thousands of customer domains. So if a customer sends his e-mails through the kundenserver.de smarthost, the kundenserver.de SPF record will not help at all because the sender has a different domain. This is why a Sender Rewriting Scheme (SRS) is used for SPF-enabled SMTP smarthosts (though it does have some drawbacks because it effectively disables SPF checks for the original sender's domain which may be faked).
For instance, if your domain is example.com and you are sending mails via
the smarthost at provider.net, your original sender address is e.g. userexample.com,
but the smarthost rewrites it to something like:
SRS0=HHH=TT=example.com=user
provider.net
with SRS0 = keyword for first rewrite (most typical). In fewer cases two mail
hops are involved, requiring a different sender rewriting scheme SRS1. Sample:
SRS1=HHH=forwarder.net==HHH=TT=example.com=user
provider.net
In both samples, HHH is a base64-coded checksum (hash) with an algorithm only
the provider knows, so he can check the validity of bounces. TT is a
base64-encoded timestamp so that this rewritten address is valid for one month
only. Equal signs are used as delimiters; sometimes a plus or minus sign is
used instead as the first delimiter behind SRS0 or SRS1.
When such a mail bounces, the return mail goes to the server at provider.net
which then should forward it to your own SMTP server or your POP3 mailbox. By
checking the validity of timestamps and hashes, the provider can determine if
this is a legal bounce and deletes it if not. Unfortunately, this does not
help much against collateral spam (so-called backscatter) caused by e-mails
with faked sender addresses since these are typically sent directly instead of
through a provider smarthost. Consequently, some SMTP servers (such as the one
of Shamrock NetMail) silently decode the SRS addresses to their original form
before processing them further. For instance,
MAIL FROM: SRS0=94/=WX=example.com=user
provider.net
in the SMTP dialogue is decoded to
Return-Path: user
example.com
in the header of the received e-mail to avoid ambiguities in further
processing.
Yet only few providers check incoming e-mails for SPF validity. One reason is simply that the vast majority of legal mail servers do not have SPF records at all. Even worse, spammers were the very first ones to add SPF records for their bulk domains, so an SPF pass does not mean no spam. A second important reason is that SPF adds a remarkable load to servers -- much more than using a blacklist: DNSBLs have predictable and quite fast response times, while checking SPF entries at a variety of name servers may frequently cause timeouts, blocking server threads for tens of seconds, especially with faked domain names and dynamic IPs.
Some e-mail providers also add a "Receive-SPF:" header line when storing incoming mails into their POP3 mailboxes. Possible words behind Receive-SPF are pass, fail, softfail, neutral, none, unknown, or error. The recipient's mail program could then use rules to filter e-mails based on this line, but it is hard to decide whether you should give "Receive-SPF: pass" a negative or positive spam score: Spammers often have SPF DNS records but not most legal mail senders. "Fail" and "softfail" can be used for adding points to the spam score, all other results should be ignored.
Comparing From and Helo strings for big sites may be an alternative, so that e.g. the Helo must end with yahoo.com if the From address does the same. A good question is if one should accept e-mails e.g. from Yahoo when Helo contains Yahoo.com but the From address is SRS-encoded and only ends with Yahoo.com so that it looks like being forwarded via Yahoo. We think that this should not be treated as legal e-mail since the checksum in the SRS string cannot be validated by the receiver (only the sender knows his private algorithm) and thus is potentially faked.
For some accounts, backscatter is an even bigger problem than spam. Imagine a spammer or a virus abuses your e-mail address for its faked "From". The result will be that many mail servers answer with a return mail (typically from Mailer-Daemon) to the faked sender - your address. This is called backscatter. SPF/SRS tries to reduce this but it does not help much in real life.
Something similar to SRS but less common is BATV (Bounce Address Tag
Validation). It adds a 9- to 11-digit key to the localpart of the sender
address and "prvs=" (private signature) in front of it. The first digit is a
key number, the next three are a day count since 1970 modulo 1000, followed by
6 hex digits calculated from the local address part using a private algorithm.
Unfortunately, standardization of BATV was not very successful: Some
implementations put the key first and the local part behind it, some do it
vice versa. Two samples:
prvs=info=12312A46F3
example.com
prvs=12312A46F3=infoexample.com
To make things totally weird, some use a slash instead of the second equal
sign. Others use btv1 instead of prvs and two equal signs instead of one as a
delimiter.
However, it is a very bad idea to use BATV for checking the validity of mails with an empty MAIL FROM in the SMTP dialogue (which results in an empty return path in the mail header): If these mails are only accepted if the recipient is a valid BATV string as used when sending mails from a BATV-enabled server, not only backscatter will be rejected but also entirely legal other mails with an empty return path, e.g. for the deletion of unread messages after 30 days. Typically the BATV time stamp is no longer valid then, so these notifications will be blocked unintentionally. So if using an anti-spam appliance offering this feature it should be switched off.
In practice, there are basically three working approaches to detect and delete backscatter. All of them require scanning the text of return mails:
Shamrock NetMail implements all three methods. The last one also allows to reduce the spam score of known senders, since the probability of spam coming from senders where local users have sent e-mails to is lower than for unknown sources. It is obvious, however, that mail servers should avoid sending backscatter whenever possible. For instance, denying an e-mail to a non-existing address on SMTP level is much better than sending a return mail to a potentially faked address later.
It is a fact that spam works, even if only a tiny fraction of it makes its way through filters. The reason is that there are always some curious people out there who click on the links in spam mails or even buy advertised products - not many, but still enough to make it a lucrative business.
Unfortunately there is no FUSSP (Final Ultimate Solution for the Spam Problem), and there never will be. We see an ongoing war between large-scale commercial spammers and the development of advanced anti-spam filtering techniques. Legal measures have no effect in foreign countries. So we will stay busy fighting spam for many years. The main reason is that there are still enough dumb users out there who actually buy products from spammers!
But anyway, following just one simple rule avoids large amounts of unwanted
e-mails:
Never ever post
an e-mail address as plain text or link on a web page.
This includes guestbooks, forums, private and corporate web pages. Harvested
mail addresses from web pages account for most of today's spam. If you need to
publish an e-mail address for some reason (instead of using an e-mail form,
for instance), use a graphic image to display it (or at least a part of it,
e.g. the @ symbol). Never write an e-mail address as text or even mailto link
into an web page: It would be scanned by harvesters within days and used by
spammers for years.
When you still receive spam, it is highly recommended to use a Whois service in the Internet to find out the responsible person (abuse contact) for the IP address in question. This IP address is typically found in the first Received header line from top. Note that this is the only reliable address in mails; all other things, including the domain name of the sender, can be completely faked.
© 9/2011 Herwig Feichtinger, Shamrock Software GmbH