在Ubuntu 9.10上与FuzzyOCR和SpamAssassin打斗图像垃圾邮件

使用FuzzyOCR和SpamAssassin在Ubuntu 9.10上打击图像垃圾邮件

本教程介绍如何在Ubuntu 9.10服务器上使用FuzzyOCR扫描图像垃圾邮件。 FuzzyOCR是SpamAssassin的一个插件,其针对的是包含图像作为主要内容载体的未经请求的批量邮件。 使用不同的方法,它分析图像的内容和属性,以区分正常的邮件(火腿)和垃圾邮件。 FuzzyOCR尝试通过仅扫描尚未被SpamAssassin分类为垃圾邮件的邮件来保持系统负载低,从而避免不必要的工作。



在本文中,我将使用Ubuntu 9.10作为基础系统。

我假设SpamAssassin已经安装并工作,以/ etc / mail / spamassassin /作为其主配置目录。 如果您的目录不同(例如,如果您安装了ISPConfig 2 ,目录是/ home / admispconfig / ispconfig / tools / spamassassin / etc / mail / spamassassin / ),这没有问题。 我会注释在哪里改变什么。

请确保您的SpamAssassin版本适用于FuzzyOCR。 例如,我要在这里安装的FuzzyOCR版本( fuzzyocr-3.5.1 )需要SpamAssassin 3.1.4或更新版本。



aptitude install fuzzyocr netpbm gifsicle libungif-bin gocr ocrad libstring-approx-perl libmldbm-sync-perl imagemagick tesseract-ocr

这将将FuzzyOCR配置文件放在/ etc / mail / spamassassin /目录中。

如果您的SpamAssassin目录不同,例如/ home / admispconfig / ispconfig / tools / spamassassin / etc / mail / spamassassin / ,那么您可以将FuzzyOCR配置文件复制到该目录,如下所示:

cp /etc/mail/spamassassin/FuzzyOcr* /home/admispconfig/ispconfig/tools/spamassassin/etc/mail/spamassassin/



FuzzyOCR的配置文件是/etc/mail/spamassassin/FuzzyOcr.cf 。 在该文件中,几乎所有内容都被注释掉。 我们现在打开该文件并进行一些修改:

vi /etc/mail/spamassassin/FuzzyOcr.cf


focr_global_wordlist /etc/mail/spamassassin/FuzzyOcr.words

/etc/mail/spamassassin/FuzzyOcr.words是FuzzyOCR附带的预定义的单词列表。 如果你喜欢,你可以根据自己的需要进行调整。


# Include additional scanner/preprocessor commands here:
focr_bin_helper pnmnorm, pnminvert,  ppmtopgm
#not available in Debian: pamthreshold,pamtopnm
focr_bin_helper tesseract

# Include additional scanner/preprocessor commands here:
#focr_bin_helper pnmnorm, pnminvert,  ppmtopgm
#not available in Debian: pamthreshold,pamtopnm
#focr_bin_helper tesseract
focr_bin_helper pnmnorm, pnminvert, convert, ppmtopgm, tesseract


# Search path for locating helper applications
focr_path_bin /usr/local/netpbm/bin:/usr/local/bin:/usr/bin
focr_preprocessor_file /etc/mail/spamassassin/FuzzyOcr.preps
focr_scanset_file /etc/mail/spamassassin/FuzzyOcr.scansets
focr_enable_image_hashing 2
focr_digest_db /etc/mail/spamassassin/FuzzyOcr.hashdb
focr_db_hash /etc/mail/spamassassin/FuzzyOcr.db
focr_db_safe /etc/mail/spamassassin/FuzzyOcr.safe.db

使用最后四行可以启用图像散列。 这是FuzzyOCR开发人员关于图像散列的说法:


如果使用/ home / admispconfig / ispconfig / tools / spamassassin / etc / mail / spamassassin而不是/ etc / mail / spamassassin ,FuzzyOCR的配置文件是/ home / admispconfig / ispconfig / tools / spamassassin / etc / mail / spamassassin / FuzzyOcr .cf而不是/etc/mail/spamassassin/FuzzyOcr.cf ,所以编辑一个。 在配置文件中,您现在必须确保使用正确的路径(即/ home / admispconfig / ispconfig / tools / spamassassin / etc / mail / spamassassin )。

这就是FuzzyOCR配置。 现在看看它是否按预期工作。


FuzzyOCR附带样本图像垃圾邮件(在/ usr / share / doc / fuzzyocr / examples /目录中):

ls -l /usr/share/doc/fuzzyocr/examples/


total 156
-rw-r--r-- 1 root root 13633 2008-09-25 22:47 ocr-animated.eml
-rw-r--r-- 1 root root 16108 2008-09-25 22:47 ocr-gif.eml
-rw-r--r-- 1 root root 27506 2008-09-25 22:47 ocr-jpg.eml
-rw-r--r-- 1 root root 27842 2008-09-25 22:47 ocr-multi.eml
-rw-r--r-- 1 root root 24657 2008-09-25 22:47 ocr-obfuscated.eml
-rw-r--r-- 1 root root 18236 2008-09-25 22:47 ocr-png.eml
-rw-r--r-- 1 root root 16113 2008-09-25 22:47 ocr-wrongext.eml
-rw-r--r-- 1 root root  3576 2008-09-25 22:47 README

我们现在可以将这些电子邮件提供给SpamAssassin,看看FuzzyOCR是否正确连接到SpamAssassin中。 找出你的spamassassin可执行文件的位置(通常它在你的PATH中 - 你可以通过运行

which spamassassin



locate spamassassin

如果您使用ISPConfig 2,则spamassassin位于: / home / admispconfig / ispconfig / tools / spamassassin / usr / bin / spamassassin


/path/to/spamassassin --debug FuzzyOcr < /usr/share/doc/fuzzyocr/examples/ocr-gif.eml > /dev/null


/home/admispconfig/ispconfig/tools/spamassassin/usr/bin/spamassassin --debug FuzzyOcr < /usr/share/doc/fuzzyocr/examples/ocr-gif.eml > /dev/null


spamassassin --debug FuzzyOcr < /usr/share/doc/fuzzyocr/examples/ocr-gif.eml > /dev/null


[10025] dbg: FuzzyOcr:
[10025] dbg: FuzzyOcr: Friday Augurt 4, 4:01 pm ET
[10025] dbg: FuzzyOcr: LAS VEGAS, NEVADA--(MARKET WIRE)--Aug 4, 2006 -- auantum Energy, lnc. (OTC
[10025] dbg: FuzzyOcr: BB:aEGY.oB-_-
[10025] dbg: FuzzyOcr: auantum Energy, lnc. is pleased to announce that it has applied to have its shares listed for
[10025] dbg: FuzzyOcr: trading on the Frankfurt Stock Exchange. The company has retained the services ofBaltic
[10025] dbg: FuzzyOcr: lnvestment Group of Hamburg, Germany to assist with the application.
[10025] dbg: FuzzyOcr:
[10025] dbg: FuzzyOcr: _ qEGY,OB "
[10025] dbg: FuzzyOcr:
[10025] dbg: FuzzyOcr: <<=end
[10025] info: FuzzyOcr: Scanset "ocrad" found word "target" with fuzz of 0.0000
[10025] info: FuzzyOcr: line: "short term price target oo"
[10025] info: FuzzyOcr: Scanset "ocrad" found word "service" with fuzz of 0.0000
[10025] info: FuzzyOcr: line: "trading on the frankfurt stock exchange the company has retained the services ofbaltic"
[10025] info: FuzzyOcr: Scanset "ocrad" found word "stock" with fuzz of 0.0000
[10025] info: FuzzyOcr: line: "hot energy stocki"
[10025] info: FuzzyOcr: Scanset "ocrad" found word "stock" with fuzz of 0.0000
[10025] info: FuzzyOcr: line: "trading on the frankfurt stock exchange the company has retained the services ofbaltic"
[10025] info: FuzzyOcr: Scanset "ocrad" found word "price" with fuzz of 0.0000
[10025] info: FuzzyOcr: line: "current price o"
[10025] info: FuzzyOcr: Scanset "ocrad" found word "price" with fuzz of 0.0000
[10025] info: FuzzyOcr: line: "short term price target oo"
[10025] info: FuzzyOcr: Scanset "ocrad" found word "company" with fuzz of 0.0000
[10025] info: FuzzyOcr: line: "trading on the frankfurt stock exchange the company has retained the services ofbaltic"
[10025] info: FuzzyOcr: Scanset "ocrad" found word "recommendation" with fuzz of 0.0000
[10025] info: FuzzyOcr: line: "sboog bup recommendation"
[10025] dbg: FuzzyOcr: Enough OCR Hits without space stripping, skipping second matching pass...
[10025] info: FuzzyOcr: Scanset "ocrad" generates enough hits (8), skipping further scansets...
[10025] info: FuzzyOcr: Message is spam, score = 15.000
[10025] info: FuzzyOcr: Adding Hash to "/etc/mail/spamassassin/FuzzyOcr.db" with score "15.000"
[10025] dbg: FuzzyOcr: Digest: 538584:327:549:7::255:255:255:255:168580::0:0:0:0:9098::0:128:0:75:1086::0:0:128:15:395::128:0:128:53:213::0:0:255:29:115
[10025] info: FuzzyOcr: Words found:
[10025] info: FuzzyOcr: "target" in 1 lines
[10025] info: FuzzyOcr: "service" in 1 lines
[10025] info: FuzzyOcr: "stock" in 2 lines
[10025] info: FuzzyOcr: "price" in 2 lines
[10025] info: FuzzyOcr: "company" in 1 lines
[10025] info: FuzzyOcr: "recommendation" in 1 lines
[10025] info: FuzzyOcr: (12 word occurrences found)
[10025] dbg: FuzzyOcr: Remove DIR: /tmp/.spamassassin10025QnPTq8tmp
[10025] dbg: FuzzyOcr: FuzzyOcr ending successfully...
[10025] dbg: FuzzyOcr: Processed in 2.191381 sec.

如您所见,/ usr/share/doc/fuzzyocr/examples/ocr-gif.eml已被归类为垃圾邮件,得分为15分,因此FuzzyOCR正在运行。



赞(52) 打赏
未经允许不得转载:优客志 » 系统运维


