INTRODUCTION |
|
xvii | |
PART I AN INTRODUCTION TO SPAM FILTERING |
|
|
|
3 | (22) |
|
|
4 | (1) |
|
|
4 | (3) |
|
|
7 | (10) |
|
|
7 | (2) |
|
|
9 | (1) |
|
|
10 | (3) |
|
|
13 | (1) |
|
Jeff Slaton, the "Spam King" |
|
|
14 | (1) |
|
|
15 | (1) |
|
Stanford Wallace, Cyber Promotions |
|
|
15 | (1) |
|
Floodgate-The First Spamware |
|
|
16 | (1) |
|
Other Significant Events in 1995 |
|
|
16 | (1) |
|
|
17 | (2) |
|
|
17 | (2) |
|
Unsolicited Commercial Email |
|
|
19 | (1) |
|
|
19 | (4) |
|
1998, 1999, and 2000: Three Years of War on Spam |
|
|
20 | (2) |
|
|
22 | (1) |
|
2001 to the Present: Exponential Spam Growth |
|
|
22 | (1) |
|
|
23 | (2) |
|
2 HISTORICAL APPROACHES TO FIGHTING SPAM |
|
|
25 | (20) |
|
Primitive Language Analysis |
|
|
26 | (1) |
|
|
27 | (2) |
|
Propagation and Maintenance Problems |
|
|
28 | (1) |
|
|
29 | (3) |
|
|
29 | (1) |
|
|
30 | (1) |
|
Drawbacks to Heuristic Filtering |
|
|
31 | (1) |
|
|
31 | (1) |
|
|
32 | (1) |
|
|
32 | (2) |
|
|
32 | (1) |
|
|
33 | (1) |
|
|
34 | (1) |
|
Problems with Challenge/Response |
|
|
34 | (1) |
|
|
35 | (2) |
|
|
35 | (1) |
|
|
36 | (1) |
|
|
37 | (1) |
|
|
38 | (1) |
|
|
39 | (2) |
|
|
39 | (1) |
|
|
40 | (1) |
|
|
41 | (3) |
|
|
43 | (1) |
|
|
44 | (1) |
|
|
44 | (1) |
|
3 LANGUAGE CLASSIFICATION CONCEPTS |
|
|
45 | (18) |
|
|
46 | (1) |
|
|
46 | (1) |
|
|
47 | (1) |
|
Using Language Classification to Fight Spam |
|
|
47 | (2) |
|
|
48 | (1) |
|
Statistical Filtering and Bayesian Analysis |
|
|
49 | (1) |
|
Components of a Language Classifier |
|
|
49 | (5) |
|
|
50 | (1) |
|
|
51 | (2) |
|
|
53 | (1) |
|
|
54 | (1) |
|
|
55 | (3) |
|
|
55 | (1) |
|
|
56 | (1) |
|
|
56 | (1) |
|
Train-Until-No-Errors (TUNE) |
|
|
57 | (1) |
|
|
57 | (1) |
|
An Example of a Filter Instance |
|
|
58 | (2) |
|
Step 1: Tokenize the Message |
|
|
58 | (1) |
|
Step 2: Build a Decision Matrix |
|
|
59 | (1) |
|
Step 3: Evaluate the Decision Matrix |
|
|
59 | (1) |
|
Step 4: Train the Message |
|
|
60 | (1) |
|
|
60 | (1) |
|
Efficacy of Statistical Filtering |
|
|
60 | (1) |
|
The Future of Language Classification |
|
|
61 | (1) |
|
The Sovereignty of Statistical Filtering |
|
|
61 | (1) |
|
|
62 | (1) |
|
4 STATISTICAL FILTERING FUNDAMENTALS |
|
|
63 | (24) |
|
|
64 | (1) |
|
Building a Historical Dataset |
|
|
65 | (7) |
|
|
65 | (1) |
|
|
66 | (1) |
|
|
67 | (1) |
|
The Tokenizer and Calculating Token Values |
|
|
68 | (2) |
|
|
70 | (1) |
|
|
71 | (1) |
|
|
71 | (1) |
|
|
72 | (1) |
|
|
72 | (2) |
|
|
73 | (1) |
|
|
74 | (6) |
|
Bayesian Combination (Paul Graham) |
|
|
75 | (1) |
|
Bayesian Combination (Brian Burton) |
|
|
76 | (2) |
|
Robinson's Geometric Mean Test |
|
|
78 | (1) |
|
Fisher-Robinson's Inverse Chi-Square |
|
|
79 | (1) |
|
Improvements to Statistical Analysis |
|
|
80 | (3) |
|
Improving the Decision Matrix |
|
|
80 | (1) |
|
Improvements to Tokenization |
|
|
81 | (1) |
|
|
81 | (1) |
|
|
82 | (1) |
|
|
83 | (1) |
|
|
83 | (4) |
PART II FUNDAMENTALS OF STATISTICAL FILTERING |
|
|
5 DECODING: UNCOMBOBULATING MESSAGES |
|
|
87 | (10) |
|
|
88 | (1) |
|
|
88 | (1) |
|
|
89 | (3) |
|
Quoted-Printable Encoding |
|
|
91 | (1) |
|
|
91 | (1) |
|
|
92 | (1) |
|
|
92 | (1) |
|
|
93 | (1) |
|
|
94 | (1) |
|
|
95 | (1) |
|
|
95 | (2) |
|
6 TOKENIZATION: THE BUILDING BLOCKS OF SPAM |
|
|
97 | (14) |
|
Tokenizing a Heuristic Function |
|
|
98 | (1) |
|
|
98 | (1) |
|
|
99 | (1) |
|
|
100 | (1) |
|
|
101 | (1) |
|
|
101 | (1) |
|
|
102 | (1) |
|
|
103 | (1) |
|
|
104 | (1) |
|
|
105 | (2) |
|
|
107 | (1) |
|
Sparse Binary Polynomial Hashing |
|
|
108 | (1) |
|
|
108 | (1) |
|
|
109 | (2) |
|
7 THE LOW-DOWN DIRTY TRICKS OF SPAMMERS |
|
|
111 | (30) |
|
|
112 | (1) |
|
|
112 | (1) |
|
A Weak Link in Statistical Filters? |
|
|
113 | (1) |
|
|
113 | (12) |
|
|
114 | (1) |
|
|
114 | (1) |
|
|
115 | (2) |
|
|
117 | (2) |
|
|
119 | (2) |
|
|
121 | (2) |
|
|
123 | (1) |
|
|
124 | (1) |
|
|
124 | (1) |
|
|
125 | (7) |
|
|
126 | (1) |
|
|
127 | (3) |
|
Empty but Not Empty Probes |
|
|
130 | (2) |
|
Attacks on the Decision Matrix |
|
|
132 | (7) |
|
|
132 | (2) |
|
|
134 | (1) |
|
|
135 | (2) |
|
|
137 | (2) |
|
|
139 | (2) |
|
8 DATA STORAGE FOR A ZILLION RECORDS |
|
|
141 | (16) |
|
|
142 | (3) |
|
|
142 | (1) |
|
|
142 | (1) |
|
|
143 | (1) |
|
|
143 | (1) |
|
|
143 | (1) |
|
|
143 | (1) |
|
|
144 | (1) |
|
|
144 | (1) |
|
|
144 | (1) |
|
|
145 | (2) |
|
Third-Party Storage Solutions |
|
|
147 | (8) |
|
Stateless Database Implementations |
|
|
147 | (2) |
|
Stateful SQL-Based Solutions |
|
|
149 | (2) |
|
Peter Graf's PBL ISAM Library |
|
|
151 | (2) |
|
|
153 | (2) |
|
Proprietary Implementations |
|
|
155 | (1) |
|
|
155 | (2) |
|
9 SCALING IN LARGE ENVIRONMENTS |
|
|
157 | |
|
|
158 | (9) |
|
Total Disk Space Requirements |
|
|
159 | (2) |
|
|
161 | (3) |
|
Parallelization versus Serialization |
|
|
164 | (1) |
|
Operating System Requirements |
|
|
164 | (1) |
|
|
165 | (1) |
|
I/O Bandwidth Requirements |
|
|
166 | (1) |
|
|
166 | (1) |
|
|
167 | (1) |
|
|
167 | (3) |
|
General Resource Planning |
|
|
168 | (1) |
|
Assessing Resource Utilization |
|
|
169 | (1) |
|
Building a Distributed Model |
|
|
170 | (4) |
|
Round-Robin Distributed Networking |
|
|
170 | (2) |
|
Distributed BGP Networking |
|
|
172 | (2) |
|
|
174 | (3) |
PART III ADVANCED CONCEPTS OF STATISTICAL FILTERING |
|
|
|
177 | (20) |
|
|
178 | (3) |
|
|
178 | (1) |
|
|
179 | (1) |
|
|
180 | (1) |
|
|
181 | (1) |
|
Corrective Training Delay |
|
|
181 | (1) |
|
|
181 | (1) |
|
Measuring the Accuracy of a Specific Filter |
|
|
182 | (3) |
|
|
182 | (1) |
|
|
183 | (2) |
|
Measuring Adaptation in Chaotic Environments |
|
|
185 | (2) |
|
|
185 | (1) |
|
|
186 | (1) |
|
Testing the Effectiveness of Multiple Filters |
|
|
187 | (4) |
|
|
188 | (1) |
|
|
189 | (2) |
|
Comparing Features in a Single Filter |
|
|
191 | (2) |
|
|
191 | (1) |
|
|
192 | (1) |
|
|
193 | (2) |
|
|
193 | (1) |
|
|
194 | (1) |
|
|
194 | (1) |
|
|
195 | (1) |
|
|
195 | (2) |
|
11 CONCEPT IDENTIFICATION: ADVANCED TOKENIZATION |
|
|
197 | (18) |
|
|
198 | (9) |
|
|
199 | (1) |
|
|
200 | (1) |
|
|
201 | (1) |
|
|
202 | (1) |
|
|
203 | (1) |
|
|
204 | (1) |
|
|
205 | (1) |
|
|
206 | (1) |
|
|
207 | (1) |
|
Sparse Binary Polynomial Hashing |
|
|
207 | (3) |
|
|
209 | (1) |
|
|
210 | (1) |
|
|
210 | (3) |
|
|
213 | (2) |
|
12 FIFTH-ORDER MARKOVIAN DISCRIMINATION |
|
|
215 | (12) |
|
|
216 | (2) |
|
Hidden Markov Models (HMMs) |
|
|
218 | (1) |
|
Using Markov Models to Model Text |
|
|
219 | (3) |
|
Classic Bayesian Spam Filter |
|
|
219 | (3) |
|
Bayesian versus Markovian Classification |
|
|
222 | (3) |
|
|
225 | (1) |
|
|
226 | (1) |
|
Floating-Point Renormalization and Underflow |
|
|
226 | (1) |
|
|
226 | (1) |
|
13 INTELLIGENT FEATURE SET REDUCTION |
|
|
227 | (14) |
|
|
228 | (3) |
|
Bayesian Noise Reduction (BNR) |
|
|
231 | (9) |
|
|
232 | (1) |
|
|
233 | (1) |
|
|
234 | (2) |
|
|
236 | (3) |
|
|
239 | (1) |
|
|
239 | (1) |
|
|
240 | (1) |
|
14 COLLABORATIVE ALGORITHMS |
|
|
241 | (16) |
|
|
242 | (5) |
|
|
246 | (1) |
|
|
246 | (1) |
|
|
247 | (1) |
|
Collaborative Neural Meshes |
|
|
248 | (2) |
|
|
249 | (1) |
|
Machine-Automated Blacklists |
|
|
250 | (2) |
|
Streamlined Blackhole List |
|
|
251 | (1) |
|
Weighted Private Block List |
|
|
252 | (1) |
|
|
252 | (1) |
|
|
252 | (1) |
|
|
253 | (1) |
|
|
253 | (1) |
|
|
253 | (2) |
|
|
255 | (1) |
|
|
256 | (1) |
|
|
256 | (1) |
APPENDIX SHINING EXAMPLES OF FILTERING |
|
257 | (18) |
|
|
258 | (3) |
|
|
258 | (1) |
|
|
259 | (1) |
|
Interview with the Author |
|
|
260 | (1) |
|
SpamProbe: A Modified Approach |
|
|
261 | (3) |
|
|
261 | (1) |
|
|
262 | (1) |
|
Interview with the Author |
|
|
262 | (2) |
|
TarProxy: IANA Spam Filter |
|
|
264 | (2) |
|
|
264 | (1) |
|
|
264 | (1) |
|
Interview with the Author |
|
|
265 | (1) |
|
DSPAM: A Large-Scale Filter |
|
|
266 | (4) |
|
|
266 | (1) |
|
|
267 | (1) |
|
Interview with the Author |
|
|
268 | (2) |
|
|
270 | (5) |
|
|
270 | (1) |
|
|
271 | (1) |
|
|
272 | (1) |
|
Interview with the Author |
|
|
272 | (3) |
INDEX |
|
275 | |