Copyright Notice: This article is Copyright 2005 by Gioffre
Consulting. It may not be reproduced, copied, distributed, or posted on
other web sites without the express written permission of Gioffre Consulting.
Are you interested in Advanced GroupWise Administration
training?
Are you interested in implementing Anti-SPAM solutions?
Do you like a fun and interesting training environment?
Then join us for our 5th Annual Novell Training Cruise!
You get top-notch technical training and a trip to the
Caribbean!
Full Details are available at
http://www.gioffre.com/upcomingtrainingcruises.html
SPAM: The Problem and How to Combat It
by Frank Gioffre
17-Sep-2005
About the Author:
Frank Gioffre is an independent consultant who runs Gioffre Consulting. He
plans, configures, designs, and troubleshoots systems for clients throughout the
U.S. Frank focuses on Novell products (NetWare, eDirectory, ZENworks, and
especially GroupWise). His certifications include Master CNE, Master CNI, and
CDE.
The Problem
SPAM has quickly become the number one problem facing
network administrators today. The amount of SPAM received continues to grow
exponentially while federal and state governments try to come up with a
legislative solution to the issue. The CAN-SPAM Act of 2003 has been
ineffective, as the nature of SPAM and the global Internet infrastructure make
enforcement of these laws very difficult. The other problem with the CAN-SPAM
act is that it does not prohibit SPAM itself, but only prohibits the use of
misleading tactics, inaccurate subject lines, unauthorized relaying, and fake
source addresses. This means that even if we achieved 100% compliance with this
law, you still could and would receive SPAM. You can read the entire text of
the CAN-SPAM act (PDF file) here:
Controlling the Assault of Non-Solicited
Pornography and Marketing Act of 2003 (Public Law 108–187—DEC. 16, 2003).
In November 2004, the nation’s first felony SPAM conviction
was handed down in Virginia. This is a promising result, but appeals are
underway and we do not yet know how this will affect further litigation. Once
again, even successful litigation will not substantially curb the flow of SPAM
because:
-
For every conviction, there are thousands of spammers
that are not caught.
-
Spammers can set up shop outside of the United States.
-
So-called "legitimate" marketing E-Mail is not affected
by this law.
The estimates of SPAM volume in the United States vary
greatly, but no one argues that it is a major problem. Recent studies show
that as much as 90% of all E-Mail messages traversing the Internet are
classified as SPAM. This means that not only is SPAM a nuisance, but that
it also impedes the functionality of your E-Mail systems and costs money to
handle. Space is used on E-Mail servers, processor time is consumed
handling the great volume of SPAM, administrator time is taken up dealing with
SPAM problems, and employee performance is eroded by sorting and deleting
unwanted messages.
This article is generic in nature, meaning that it covers
the SPAM problem without regard for any particular E-Mail system. However,
I deal mostly with GroupWise and I will mention some specifics as to how these
issues pertain directly to GroupWise.
The Solution
Legislation is not an effective deterrent, and manual sorting of messages is
impractical, therefore the solution must be found in automated sorting,
classification, and deletion of SPAM through the use of E-Mail scanning
software.
This sort of software has been in existence for several years now, with
different packages using various methods to classify messages as either SPAM or
legitimate E-Mail. The concept behind this software is simple:
electronically read the message and check for key words and characteristics that
uniquely identify a message as SPAM. The implementation of this task
however is anything but simple. Spammers continually modify the content
and format of their SPAM messages in order to avoid detection by these software
packages. The end result is a never-ending cat and mouse game between the
software vendors and the spammers.
The current incarnation of craftily designed SPAM messages include
obfuscation techniques such as:
- Random text added to messages to evade keyword scanners.
- Spaces and other characters mixed in to words to evade keyword scanners.
- False header and source information to evade server blacklists.
- Message with just a URL link to the spammer website.
- 100% graphic message with no keywords to scan.
Hopefully we hold the upper hand since no matter how crafty the spammer is,
all SPAM has one thing in common. That common thread is that all SPAM is
trying to either sell you something or get you to visit a web site for some
other reason. Therefore, we are starting to see more scanning techniques
focused not on where the message is coming from, nor what the content is, but
rather where the message is trying to lead you.
The most effective anti-SPAM software packages must use a combination of
techniques to classify SPAM. Furthermore, the scoring algorithms must be
stringent enough to catch a high percentage of SPAM without a high percentage of
false positives. False positives are legitimate E-Mail messages (otherwise
known as HAM), that are incorrectly classified as SPAM. This is a major
problem, as discarding legitimate E-Mail messages can cause great damage to any
company. The software implementation must walk a fine line in order to
properly classify all messages, and generally it should lean towards allowing
more SPAM rather than discarding HAM.
The last consideration is that of tailoring the classification engine to your
needs. One man's SPAM may be another man's HAM (never thought I'd see that
sentence in print), so the software must have the ability to be fine-tuned based
on the company or even the individual involved. As an example, one big
SPAM category is that of "low mortgage rates". You may not be in the
market for a new home so you want all messages with keywords like "mortgage" and
"low rates" discarded. However, if you are a loan officer at a bank then
that rule would obviously be bad for business.
The Implementation
Anti-SPAM software has made great advances over the past few years.
Current classification techniques allow for very high identification rates with
correspondingly low false positive rates. Let's look at how most Anti-SPAM
software classifies SPAM and then the various techniques used in the
classification process.
Rating the Messages
As good as any software is at classifying SPAM, there are no absolute ratings
of SPAM versus HAM. Rather, we use a scale to give each message a SPAM
score or SPAM rating. The process involves 3 distinct steps:
- Step 1: Apply the Rating Rules and Determine the SPAM Score
The score assigned to each message is determined by cycling through all the
various rating methods and rules applied by the software. Each "hit"
against a particular rule, adds an incremental amount the total score.
The cumulative total of all the rule hits, gives us the total SPAM score for
that message.
- Step 2: Compare SPAM Score to Cut Score
Once the score is determined, it is compared against a cut score. That
cut score is usually set by the system administrator and determines which
messages are classified as SPAM. Any message with a SPAM score above
the cut score is determined to be SPAM.
- Step 3: Process the SPAM
Once a message has been determined to be SPAM, it must be processed
accordingly. There are many different techniques that can be followed.
One technique is to insert a unique keyword (like S-P-A-M) in the subject
line so that the client E-Mail software can process the message with a rule
(delete or move). Other possibilities include moving the message to a
quarantine area, forwarding it to a special SPAM account, or deleting it
altogether. More advanced software can perform different actions based
on multiple cut scores.
The benefit of this method is that no one rule can force a message to be
rated as SPAM. It takes many SPAM-like traits in combination to classify a
message as SPAM. This is how you avoid false positives. As an
example, let's say I received a message from a client asking me what the airfare
would be for me to visit them. I don't want the word "airfare" by itself
to trigger the SPAM alert. However, I do want the unsolicited "low
airfare" messages to go straight to the bit-bucket. Here's a real-life
sample of just such a message that was properly classified as SPAM:
FROM: Travel Flea Market <TravelFleaMarket@brightstuff.com>
SUBJECT: Vacation w/Air 199, Cruise 5 Nts 399 & More Inside
MESSAGE:
-------------------------------------------------------------------------
brightstuff.com
You are subscribed with the email address omitted@gioffre.com. If you wish
to be excluded from future offers, please use the link below:
http://web.web.brightstuff.com/r/alqtkukeauuavhmiv2eprvvb or send an email
to:
unsubscribe-alqtkukeauuavhmiv2eprvvb@brightstuff.com
Email us support@brightstuff.com
-------------------------------------------------------------------------
The rest of the message was all HTML. Note however that there are links
in the message (and also in the HTML) that point back to where they wanted me to
go. The SPAM score for this message was 6.42, as determined by the
accumulation of these hits:
BAYES_20: (-1.95)
HTML_IMAGE_ONLY_20: (0.45)
RCVD_IN_BL_SPAMCOP_NET: (1.22)
HTML_TEXT_AFTER_BODY: (0.06)
MULTI_REMOVAL_1WORD: (0.80)
HTML_FONT_FACE_BAD: (0.04)
URIBL_OB_SURBL: (3.21)
URIBL_SBL: (1.00)
URIBL_WS_SURBL: (1.46)
RCVD_IN_SBL: (0.11)
HTML_TEXT_AFTER_HTML: (0.03)
My system cut score is set at 3.4 so this message is therefore classified as
SPAM.
Rating Methods
It all comes down to the rules and methods that are used to score the
messages. Let's take a look at some of the various methods available.
Keyword/Phrase Matching
With this rating method, the software compares all the words in a message
against a list of "bad" words in its database, with each hit causing the SPAM
score to increase. This is one of the simplest, oldest, and least
effective methods of rating SPAM. Today's sophisticated spammers know
which keywords to avoid. Furthermore, keyword matching alone can also lead
to many false positives. In GroupWise, you can accomplish basic keyword
matching by using rules to match on content and automatically delete the
messages.
Keyword matching can also include FROM address matching. This is the
basic functionality that is built in to your GroupWise client Junk and Block
lists. While this may be effective in blocking "nuisance" E-Mail from real
people, it does little in blocking true SPAM.
Bayesian Filtering
Bayesian filtering is an advanced form of keyword matching. It uses
more sophisticated algorithms that check for good words (from HAM) as well as
bad words (from SPAM), and then calculates a probability score from 0 to 1.
A score of zero indicates that the message has a 0% probability of being SPAM
while a score of 1 indicates a 100% probability of being SPAM. In real
life, you will see that Bayesian scoring typically leads to messages that are
either scored very low (in the 0% to 20% range) or very high (in the 80% to 100%
range) with not much in between.
The key to Bayesian filtering is a good base of keywords to compare against.
This keyword base is called the corpus. Typically you will have a SPAM
corpus and a HAM corpus. Most software packages that use Bayesian
filtering will provide you with a base corpus as well as the ability to add your
own keywords to the corpus. More advanced packages will build the corpus
automatically by letting you submit messages that you personally classify as
SPAM or HAM.
For some further reading on Bayesian filtering and SPAM topics, please visit
Paul Graham's web site at
http://www.paulgraham.com/antispam.html
BlackLists
A blacklist is simply a list of source IP addresses. Your system will
automatically refuse any E-Mail messages originating from any server on your
blacklist. This is a simple and fairly ineffective method of preventing
SPAM since the sophisticated spammers will not keep sending from the same IP
address for very long. In GroupWise, your GWIA (GroupWise Internet Agent)
fully supports setting up your own blacklist.
Realtime Blackhole List (RBL)
A RBL is a list of open relay servers that are being exploited by spammers
for the purpose of sending out great amounts of SPAM. The really nice
feature of RBLs is that you (the system administrator) do not need to keep a RBL
up to date yourself. There are RBL services that your GroupWise system can
use to look up and block open relay sources.
Over the years a few problems have popped up with RBLs. First, there is
the issue of rejecting legitimate E-Mail messages from a non-spammer because
their server has been placed on a RBL. Generally this means that the
server is being used for open relay, but the real problem is the process of
getting a server off the RBL.
There is an old adage that says "Absolute Power Absolutely Corrupts".
This has been proven to be true with some public RBLs available on the Internet.
The power that came with the ability to actually cripple any company by putting
them on the RBL led to the demise of some very well known RBLs. The main
issue is that some RBLs were so self-righteous that they made the problem worse
by refusing to remove servers from the RBL even after they were fixed.
Other RBL services added servers to their RBL simply because they did not like
the way that company did business, making it more a form of censorship.
RBLs today are less popular and much less effective than they were 4 or 5
years ago. It is estimated that only 2% of SPAM can be blocked by using
RBLs alone. Your GWIA can support RBLs by simply specifying the RBL
servers.
Reverse Address Lookup
This method looks at the E-Mail header of the message arriving at your
server. The E-Mail header provides a source IP address. The FROM
address is parsed to extract the source domain. A DNS (Domain Name
Service) reverse lookup is performed on that domain name to yield the registered
IP address for that domain's mail exchange (MX record) host. The reverse
lookup is then compared to the header IP address and if the two do not match,
the message is considered suspect.
Some E-Mail systems (including GroupWise) have the capability to reject
inbound messages based solely on this one criteria. At first it sounds
like a great method since spammers practically never use legal or valid means to
send their messages. However if you turn on this feature, you will end up
tossing large amounts of legitimate messages. With today's mobile
workforce, there are many legitimate reasons why reverse lookup will not work.
As an example let's say that the fictitious person Patrick Star has a work
E-Mail address of pstar@kkrab.com. He
is working from home and using his cable modem service provider's E-Mail server.
He sets up his E-Mail system at home to show a FROM address of
pstar@kkrab.com. When you receive his
E-Mail message and perform a reverse lookup on the E-Mail address, it will not
match the source address since the message was not sent from the office.
This method is so troublesome that I would recommend NEVER turning it on.
Rogue Methods
Certain Internet Service Providers have come up with other methods of
blocking SPAM for their members. While I truly understand the magnitude of
the problem, some of these providers are going way beyond what I would call
sensible. Some of these overly stringent policies are disrupting important
business and personal correspondence. The major culprit in my opinion is
AOL (America Online).
AOL uses many other procedures to filter (so-called) SPAM. The problem
is that the filters are overly strict, almost random in design, and not
disclosed. I can only venture to guess what the filters are by trial and
error.
One method I have seem used is to block any E-Mail messages that have more
than xx number of recipients. The number (xx) depends on the ISP.
While true that spammers do send messages to large distribution lists, there are
legitimate uses for large distribution lists. What it you need to send out
an invitation to a conference and it needs to go to 350 people? Is that
SPAM?
SPAM URI Realtime Blacklists (SURBL)
This is the most powerful new technique, and potentially the one that the
spammers will have the most trouble circumventing. The reason why it works
so well is because SURBL does not care about the source address of the message.
Instead SURBL searches the message (plain text, HTML code, and graphic links)
for embedded destination URIs (web page addresses). As I mentioned before,
there is one common trait that no spammer can avoid in their messages.
They need to direct you to a web site!
The question comes up as to why this would be any more difficult to
circumvent than the source IP address. Sure, web site addresses can be
changed on a regular basis but that would completely destabilize the entire
sales structure they are trying to promote. The purpose of SPAM is to sell
you something. If the web site address keeps changing, the potential
customers won't be able to find it and thus the SPAM becomes useless as a sales
tool. Imagine seeing a television commercial for a product that can be
ordered only by phone. If that phone number were to change after only 1 or
2 days, then the impact and benefit of the commercial is greatly diminished.
For more information on SURBL implementation, please visit
http://www.surbl.org
Are you interested in Advanced GroupWise Administration
training?
Are you interested in implementing Anti-SPAM solutions?
Do you like a fun and interesting training environment?
Then join us for our 5th Annual Novell Training Cruise!
You get top-notch technical training and a trip to the
Caribbean!
Full Details are available at
http://www.gioffre.com/upcomingtrainingcruises.html
The Products
There are many products today that fight SPAM. The product you choose
depends on the type of implementation you desire, the E-Mail system you use, the
cost, and of course the capabilities of the product.
Stand-Alone Solutions
When I discuss stand-alone solutions, I am referring to any solution that is
not a software package and does not integrate directly with your E-Mail system.
There are 2 solutions that fall in to this category.
The first is a stand-alone appliance or hardware device. There are many
vendors that sell these devices. The benefits are that they examine the
raw SMTP (Simple Message Transfer Protocol) and therefore work with any E-Mail
system (GroupWise, SendMail, Lotus Notes, etc...). The biggest
disadvantage is the initial cost of the device itself. I have not reviewed
any of these products.
The second stand-alone solution is an 3rd party company which does all the
scanning for you. There are quite a few reputable companies in this space.
Just like appliances, they examine the raw SMTP message and work with any E-Mail
system. One of the biggest advantages of a service like this is that it
almost completely eliminates any setup, configuration, and maintenance work for
you. The disadvantage is that the operation is not completely under your
control. Pricing for these services is based on the number of users and
most are quite reasonable.
Software Solutions
Generic Solutions
Software solutions can either stand on their own or integrate tightly with
your particular E-Mail system. The solutions that integrate with your
E-Mail system can typically exploit some of the features of that system in order
to become a more complete and feature-rich solution.
No discussion of Anti-SPAM software would be complete without discussing
SpamAssassin, as it is at the core of many other software solutions.
SpamAssassin is an open source project that is making great strides in the
battle against SPAM. As it is open source, many other software packages
(both open source and commercial) use its code, concepts, and innovations in
order to provide you with a solution that meets all your needs.
For more information on SpamAssassin, including downloads and FAQ listings,
please visit
http://spamassassin.apache.org
GroupWise Solutions
There are some excellent Anti-SPAM solutions designed specifically for
GroupWise. Most notable are GWAVA (http://www.gwava.com/products/gwava_overview.html)
and GeeWhiz (http://www.omni-ts.com/products--gee-whiz-spam-filter-and-anti-virus.html).
Of the available products, I personally use, recommend, sell, and install
GeeWhiz. As of this writing the current available version is 1.4.10, with
v2.0 entering public beta. I have been using the v2.0 beta product for 4
months now and I am quite impressed by its capabilities and features.
GeeWhiz performs Anti-SPAM scanning as well as Anti-Virus scanning (with any
Anti-Virus software package) on both GroupWise and NetMail. Version 2.0
adds platform support so that you can run it on NetWare, Linux, and Windows
servers.
In my tests, GeeWhiz v2.0 has correctly rated and classified over 98% of my
SPAM with a false positive rate of less than .0025% (that's less than 1 in
40,000). You may think that even 1 false positive is too many, but
consider that a human being will make errors at a higher percentage rate than
this software.
I have just finished a full review of this product, which is available in the
spring edition (2005) of GroupWise Advisor magazine (http://gwadvisor.com).
Summary
SPAM has truly become the pariah of the E-Mail world. The spammers and
the system administrators have been playing this complex chess game of one
trying to outsmart the other. Legislation, litigation, and other
strong-arm methods have not worked. It is up to the latest breed of
Anti-SPAM software to quash this torrent of useless, time-wasting, annoying, and
often vulgar messages. With the latest methods (including SURBL), it is
our hope that we can finally win this battle, and it looks like we finally can!
Copyright Notice: This article is Copyright 2005 by Gioffre
Consulting. It may not be reproduced, copied, distributed, or posted on
other web sites without the express written permission of Gioffre Consulting. |