Filtering image spam with near-duplicate detection

Zhe Wang, William Josephson, Qin Lv, Moses Charikar, Kai Li

Research output: Contribution to conferencePaperpeer-review

72 Scopus citations

Abstract

A new trend in email spam is the emergence of image spam. Although current anti-spam technologies are quite successful in filtering text-based spam emails, the new image spams are substantially more difficult to detect, as they employ a variety of image creation and randomization algorithms. Spam image creation algorithms are designed to defeat well-known vision algorithms such as optical character recognition (OCR) algorithms whereas randomization techniques ensure the uniqueness of each image. We observe that image spam is often sent in batches that consist of visually similar images that differ only due to the application of randomization algorithms. Based on this observation, we propose an image spam detection system that uses near-duplicate detection to detect spam images. We rely on traditional anti-spam methods to detect a subset of spam images and then use multiple image spam filters to detect all the spam images that "look" like the spam caught by traditional methods. We have implemented a prototype system to achieve high detection rate while having a less than 0.001% false positive rate.

Original languageEnglish (US)
StatePublished - 2007
Event4th Conference on Email and Anti-Spam, CEAS 2007 - Mountain View, CA, United States
Duration: Aug 2 2007Aug 3 2007

Other

Other4th Conference on Email and Anti-Spam, CEAS 2007
Country/TerritoryUnited States
CityMountain View, CA
Period8/2/078/3/07

All Science Journal Classification (ASJC) codes

  • Software

Fingerprint

Dive into the research topics of 'Filtering image spam with near-duplicate detection'. Together they form a unique fingerprint.

Cite this