Atnaujinkite slapukų nuostatas

Ending Spam [Minkštas viršelis]

3.70/5 (60 ratings by Goodreads)
  • Formatas: Paperback / softback, 312 pages, aukštis x plotis: 234x178 mm
  • Išleidimo metai: 07-May-2005
  • Leidėjas: No Starch Press,US
  • ISBN-10: 1593270526
  • ISBN-13: 9781593270520
Kitos knygos pagal šią temą:
  • Formatas: Paperback / softback, 312 pages, aukštis x plotis: 234x178 mm
  • Išleidimo metai: 07-May-2005
  • Leidėjas: No Starch Press,US
  • ISBN-10: 1593270526
  • ISBN-13: 9781593270520
Kitos knygos pagal šią temą:
Explains how spam works, how network administrators can implement spam filters, or how programmers can develop new remarkably accurate filters using language classification and machine learning. Original. (Advanced) Zdziarski, who maintains a spam filter that can achieve a level of accuracy up to 99.985 percent, leads the charge against what has become a very significant challenge to both productivity and sanity by explaining how it was spawned, its anatomy and physiognomy, how it crawled into our pristine PCs, and how, as responsible and gleeful citizens, we can torch it. He covers how early spam wars were fought and the lessons learned, language classification concepts and statistical filtering fundamentals, and advanced concepts such as testing theory, concept identification, Markovian discrimination, intelligent feature set reductions, and collaborative algorithms. The final chapter includes what Zdziarski terms shining examples of filtering. Annotation ©2005 Book News, Inc., Portland, OR (booknews.com) Fascinating reading for any geek, this landmark title describes, in-depth, how statistical filtering is being used by next generation spam filters to identify and filter spam. Join author John Zdziarski for a look inside the brilliant minds that have conceived clever new ways to fight spam in all its nefarious forms. This landmark title describes, in-depth, how statistical filtering is being used by next-generation spam filters to identify and filter unwanted messages, how spam filtering works and how language classification and machine learning combine to produce remarkably accurate spam filters. After reading Ending Spam, youll have a complete understanding of the mathematical approaches used by todays spam filters as well as decoding, tokenization, various algorithms (including Bayesian analysis and Markovian discrimination) and the benefits of using open-source solutions to end spam. Zdziarski interviewed creators of many of the best spam filters and has included their insights in this revealing examination of the anti-spam crusade. If youre a programmer designing a new spam filter, a network admin implementing a spam-filtering solution, or just someone whos curious about how spam filters work and the tactics spammers use to evade them, Ending Spam will serve as an informative analysis of the war against spammers. TOC Introduction PART I: An Introduction to Spam Filtering Chapter 1: The History of Spam Chapter 2: Historical Approaches to Fighting Spam Chapter 3: Language Classification Concepts Chapter 4: Statistical Filtering Fundamentals PART II: Fundamentals of Statistical Filtering Chapter 5: Decoding: Uncombobulating Messages Chapter 6: Tokenization: The Building Blocks of Spam Chapter 7: The Low-Down Dirty Tricks of Spammers Chapter 8: Data Storage for a Zillion RecordsChapter 9: Scaling in Large Environments PART III: Advanced Concepts of Statistical Filtering Chapter 10: Testing Theory Chapter 11: Concept Identification: Advanced Tokenization Chapter 12: Fifth-Order Markovian Discrimination Chapter 13: Intelligent Feature Set Reduction Chapter 14: Collaborative Algorithms Appendix: Shining Examples of Filtering Index
INTRODUCTION xvii
PART I AN INTRODUCTION TO SPAM FILTERING
1 THE HISTORY OF SPAM
3(22)
The Definition of Spam
4(1)
The Very First Spam
4(3)
Spam: The Early Years
7(10)
Jay-Jay's College Fund
7(2)
The Jesus Spam
9(1)
Canter & Siegel
10(3)
Cancelmoose
13(1)
Jeff Slaton, the "Spam King"
14(1)
"Krazy" Kevin Lipsitz
15(1)
Stanford Wallace, Cyber Promotions
15(1)
Floodgate-The First Spamware
16(1)
Other Significant Events in 1995
16(1)
War Waged on Spam
17(2)
Spamhaus
17(2)
Unsolicited Commercial Email
19(1)
Spam Out of Control
19(4)
1998, 1999, and 2000: Three Years of War on Spam
20(2)
Network Solutions
22(1)
2001 to the Present: Exponential Spam Growth
22(1)
Final Thoughts
23(2)
2 HISTORICAL APPROACHES TO FIGHTING SPAM
25(20)
Primitive Language Analysis
26(1)
Blacklisting
27(2)
Propagation and Maintenance Problems
28(1)
Heuristic Filtering
29(3)
Brightmail
29(1)
SpamAssassin
30(1)
Drawbacks to Heuristic Filtering
31(1)
Maintenance Headaches
31(1)
Scoring
32(1)
Whitelisting
32(2)
A Little Too Effective
32(1)
Forgeries
33(1)
Challenge/Response
34(1)
Problems with Challenge/Response
34(1)
Throttling
35(2)
TarProxy
35(1)
Other Throttling Tools
36(1)
Collaborative Filtering
37(1)
Address Obfuscation
38(1)
New Standards
39(2)
Authenticated SMTP
39(1)
Sender Policy Framework
40(1)
Litigation
41(3)
Spammer Fingerprinting
43(1)
Intellectual Property
44(1)
Final Thoughts
44(1)
3 LANGUAGE CLASSIFICATION CONCEPTS
45(18)
Understanding Accuracy
46(1)
Machine Learning
46(1)
Concept Learning
47(1)
Using Language Classification to Fight Spam
47(2)
Training
48(1)
Statistical Filtering and Bayesian Analysis
49(1)
Components of a Language Classifier
49(5)
The Historical Dataset
50(1)
The Tokenizer
51(2)
The Analysis Engine
53(1)
Providing Feedback
54(1)
Training
55(3)
Train-Everything (TEFT)
55(1)
Train-on-Error (TOE)
56(1)
Train-Until-Mature (TUM)
56(1)
Train-Until-No-Errors (TUNE)
57(1)
When to Train
57(1)
An Example of a Filter Instance
58(2)
Step 1: Tokenize the Message
58(1)
Step 2: Build a Decision Matrix
59(1)
Step 3: Evaluate the Decision Matrix
59(1)
Step 4: Train the Message
60(1)
Step 5: Correct Errors
60(1)
Efficacy of Statistical Filtering
60(1)
The Future of Language Classification
61(1)
The Sovereignty of Statistical Filtering
61(1)
Final Thoughts
62(1)
4 STATISTICAL FILTERING FUNDAMENTALS
63(24)
An Imperfect Solution
64(1)
Building a Historical Dataset
65(7)
Corpus Feeding
65(1)
Starting from Scratch
66(1)
Correcting Errors
67(1)
The Tokenizer and Calculating Token Values
68(2)
Single-Corpus Tokens
70(1)
A Biased Filter
71(1)
Hapaxes
71(1)
Final Product
72(1)
The Analysis Engine
72(2)
Sorting
73(1)
Statistical Combination
74(6)
Bayesian Combination (Paul Graham)
75(1)
Bayesian Combination (Brian Burton)
76(2)
Robinson's Geometric Mean Test
78(1)
Fisher-Robinson's Inverse Chi-Square
79(1)
Improvements to Statistical Analysis
80(3)
Improving the Decision Matrix
80(1)
Improvements to Tokenization
81(1)
Statistical Sedation
81(1)
Iterative Training
82(1)
Learning New Tricks
83(1)
Final Thoughts
83(4)
PART II FUNDAMENTALS OF STATISTICAL FILTERING
5 DECODING: UNCOMBOBULATING MESSAGES
87(10)
Introduction to Encoding
88(1)
Decoding
88(1)
Message Body Encodings
89(3)
Quoted-Printable Encoding
91(1)
Base64 Encoding
91(1)
Custom Encodings
92(1)
Message Header Encodings
92(1)
HTML Encodings
93(1)
Message Actualization
94(1)
Supporting Software
95(1)
Final Thoughts
95(2)
6 TOKENIZATION: THE BUILDING BLOCKS OF SPAM
97(14)
Tokenizing a Heuristic Function
98(1)
Basic Delimiters
98(1)
Redundancy
99(1)
Other Delimiters
100(1)
Exceptions
101(1)
Token Reassembly
101(1)
Degeneration
102(1)
Header Optimizations
103(1)
URL Optimizations
104(1)
HTML Tokenization
105(2)
Word Pairs
107(1)
Sparse Binary Polynomial Hashing
108(1)
Internationalization
108(1)
Final Thoughts
109(2)
7 THE LOW-DOWN DIRTY TRICKS OF SPAMMERS
111(30)
Successful Filtering
112(1)
No More Headaches
112(1)
A Weak Link in Statistical Filters?
113(1)
Attacks on Tokenizers
113(12)
Encoding Abuses
114(1)
Header Encodings
114(1)
Hypertextus Interruptus
115(2)
ASCII Spam
117(2)
Text-Splitting
119(2)
Table-Based Obfuscation
121(2)
URL Encodings
123(1)
Symbolic Text
124(1)
Just Plain Dumb
124(1)
Attacks on the Dataset
125(7)
Mailing List Attacks
126(1)
Bayesian Poisoning
127(3)
Empty but Not Empty Probes
130(2)
Attacks on the Decision Matrix
132(7)
Image Spams
132(2)
Random Strings of Text
134(1)
Word Salad
135(2)
Directed Attacks
137(2)
Final Thoughts
139(2)
8 DATA STORAGE FOR A ZILLION RECORDS
141(16)
Storage Considerations
142(3)
Disk Space
142(1)
Speed
142(1)
Locking
143(1)
Portability
143(1)
Statefulness
143(1)
Recovery
143(1)
I/O Contention
144(1)
Random-Access Features
144(1)
Ease of Use
144(1)
Storage Framework
145(2)
Third-Party Storage Solutions
147(8)
Stateless Database Implementations
147(2)
Stateful SQL-Based Solutions
149(2)
Peter Graf's PBL ISAM Library
151(2)
SQLite
153(2)
Proprietary Implementations
155(1)
Final Thoughts
155(2)
9 SCALING IN LARGE ENVIRONMENTS
157
Requirements Assessment
158(9)
Total Disk Space Requirements
159(2)
Total Processing Power
161(3)
Parallelization versus Serialization
164(1)
Operating System Requirements
164(1)
High Availability
165(1)
I/O Bandwidth Requirements
166(1)
Features
166(1)
End-User Support
167(1)
Sizing Machine Capacity
167(3)
General Resource Planning
168(1)
Assessing Resource Utilization
169(1)
Building a Distributed Model
170(4)
Round-Robin Distributed Networking
170(2)
Distributed BGP Networking
172(2)
Final Thoughts
174(3)
PART III ADVANCED CONCEPTS OF STATISTICAL FILTERING
10 TESTING THEORY
177(20)
The Challenge of Testing
178(3)
Message Continuity
178(1)
Archive Window
179(1)
Purge Simulation
180(1)
Interleave
181(1)
Corrective Training Delay
181(1)
Types of Simulations
181(1)
Measuring the Accuracy of a Specific Filter
182(3)
Test Criteria
182(1)
Performing the Test
183(2)
Measuring Adaptation in Chaotic Environments
185(2)
Test Criteria
185(1)
Performing the Test
186(1)
Testing the Effectiveness of Multiple Filters
187(4)
Test Criteria
188(1)
Performing the Test
189(2)
Comparing Features in a Single Filter
191(2)
Test Criteria
191(1)
Performing the Test
192(1)
Testing Caveats
193(2)
Corrective Training
193(1)
Purge Simulations
194(1)
Test Messages
194(1)
Presuppositions
195(1)
Final Thoughts
195(2)
11 CONCEPT IDENTIFICATION: ADVANCED TOKENIZATION
197(18)
Chained Tokens
198(9)
Case Study Analysis
199(1)
Pattern Identification
200(1)
Differentiation
201(1)
HTML Classification
202(1)
Contextual Analysis
203(1)
Other Uses
204(1)
Administrative Concerns
205(1)
Supporting Data
206(1)
Summary
207(1)
Sparse Binary Polynomial Hashing
207(3)
Supporting Data
209(1)
Summary
210(1)
Karnaugh Mapping
210(3)
Final Thoughts
213(2)
12 FIFTH-ORDER MARKOVIAN DISCRIMINATION
215(12)
Markov's Great Advance
216(2)
Hidden Markov Models (HMMs)
218(1)
Using Markov Models to Model Text
219(3)
Classic Bayesian Spam Filter
219(3)
Bayesian versus Markovian Classification
222(3)
Storage Concerns
225(1)
Purging Old Data
226(1)
Floating-Point Renormalization and Underflow
226(1)
Final Thoughts
226(1)
13 INTELLIGENT FEATURE SET REDUCTION
227(14)
Calibration Algorithms
228(3)
Bayesian Noise Reduction (BNR)
231(9)
Instantiation Phase
232(1)
Training Phase
233(1)
Dubbing Phase
234(2)
Examples
236(3)
End Result
239(1)
Efficacy
239(1)
Final Thoughts
240(1)
14 COLLABORATIVE ALGORITHMS
241(16)
Message Inoculation
242(5)
Supporting Data
246(1)
External Inoculation
246(1)
Classification Groups
247(1)
Collaborative Neural Meshes
248(2)
Neural Declustering
249(1)
Machine-Automated Blacklists
250(2)
Streamlined Blackhole List
251(1)
Weighted Private Block List
252(1)
Distributed Attacks
252(1)
Filters That Fight Back
252(1)
Fingerprinting
253(1)
Probing
253(1)
Automatic Whitelisting
253(2)
URL Blacklisting
255(1)
Minefields
256(1)
Final Thoughts
256(1)
APPENDIX SHINING EXAMPLES OF FILTERING 257(18)
POPFile: The POP3 Proxy
258(3)
About POPFile
258(1)
Accuracy
259(1)
Interview with the Author
260(1)
SpamProbe: A Modified Approach
261(3)
About SpamProbe
261(1)
Accuracy
262(1)
Interview with the Author
262(2)
TarProxy: IANA Spam Filter
264(2)
About TarProxy
264(1)
Accuracy
264(1)
Interview with the Author
265(1)
DSPAM: A Large-Scale Filter
266(4)
About DSPAM
266(1)
Accuracy
267(1)
Interview with the Author
268(2)
The CRM114 Discriminator
270(5)
About CRMI l4
270(1)
Under the Hood
271(1)
Accuracy
272(1)
Interview with the Author
272(3)
INDEX 275
Jonathan A. Zdziarski has been fighting spam for eight years, and has spent a significant portion of the past two years working on the next generation spam filter DSPAM, with up to 99.985% accuracy. Zdziarski lectures widely on the topic of spam.