使用FuzzyOCR和SpamAssassin在Ubuntu 9.10上打击图像垃圾邮件
本教程介绍如何在Ubuntu 9.10服务器上使用FuzzyOCR扫描图像垃圾邮件。 FuzzyOCR是SpamAssassin的一个插件,其针对的是包含图像作为主要内容载体的未经请求的批量邮件。 使用不同的方法,它分析图像的内容和属性,以区分正常的邮件(火腿)和垃圾邮件。 FuzzyOCR尝试通过仅扫描尚未被SpamAssassin分类为垃圾邮件的邮件来保持系统负载低,从而避免不必要的工作。
我不会保证这将为您工作!
1初步说明
在本文中,我将使用Ubuntu 9.10作为基础系统。
我假设SpamAssassin已经安装并工作,以/ etc / mail / spamassassin /
作为其主配置目录。 如果您的目录不同(例如,如果您安装了ISPConfig 2 ,目录是/ home / admispconfig / ispconfig / tools / spamassassin / etc / mail / spamassassin /
),这没有问题。 我会注释在哪里改变什么。
请确保您的SpamAssassin版本适用于FuzzyOCR。 例如,我要在这里安装的FuzzyOCR版本( fuzzyocr-3.5.1
)需要SpamAssassin 3.1.4或更新版本。
2安装FuzzyOCR
FuzzyOCR可以安装如下:
aptitude install fuzzyocr netpbm gifsicle libungif-bin gocr ocrad libstring-approx-perl libmldbm-sync-perl imagemagick tesseract-ocr
这将将FuzzyOCR配置文件放在/ etc / mail / spamassassin /
目录中。
如果您的SpamAssassin目录不同,例如/ home / admispconfig / ispconfig / tools / spamassassin / etc / mail / spamassassin /
,那么您可以将FuzzyOCR配置文件复制到该目录,如下所示:
cp /etc/mail/spamassassin/FuzzyOcr* /home/admispconfig/ispconfig/tools/spamassassin/etc/mail/spamassassin/
所以现在FuzzyOCR已经安装了,现在我们需要配置它。
3配置FuzzyOCR
FuzzyOCR的配置文件是/etc/mail/spamassassin/FuzzyOcr.cf
。 在该文件中,几乎所有内容都被注释掉。 我们现在打开该文件并进行一些修改:
vi /etc/mail/spamassassin/FuzzyOcr.cf
将以下行放入其中以定义FuzzyOCR的垃圾邮件字文件的位置:
[...] focr_global_wordlist /etc/mail/spamassassin/FuzzyOcr.words [...] |
/etc/mail/spamassassin/FuzzyOcr.words
是FuzzyOCR附带的预定义的单词列表。 如果你喜欢,你可以根据自己的需要进行调整。
下一个变化
[...] # Include additional scanner/preprocessor commands here: # focr_bin_helper pnmnorm, pnminvert, ppmtopgm #not available in Debian: pamthreshold,pamtopnm focr_bin_helper tesseract [...] |
至
[...] # Include additional scanner/preprocessor commands here: # #focr_bin_helper pnmnorm, pnminvert, ppmtopgm #not available in Debian: pamthreshold,pamtopnm #focr_bin_helper tesseract focr_bin_helper pnmnorm, pnminvert, convert, ppmtopgm, tesseract [...] |
最后添加/启用以下行:
[...] # Search path for locating helper applications focr_path_bin /usr/local/netpbm/bin:/usr/local/bin:/usr/bin focr_preprocessor_file /etc/mail/spamassassin/FuzzyOcr.preps focr_scanset_file /etc/mail/spamassassin/FuzzyOcr.scansets focr_enable_image_hashing 2 focr_digest_db /etc/mail/spamassassin/FuzzyOcr.hashdb focr_db_hash /etc/mail/spamassassin/FuzzyOcr.db focr_db_safe /etc/mail/spamassassin/FuzzyOcr.safe.db [...] |
使用最后四行可以启用图像散列。 这是FuzzyOCR开发人员关于图像散列的说法:
“图像散列数据库功能允许插件将图像特征向量存储到数据库中,所以当它第二次到达时就知道这个图像(因此不需要再次扫描)。这个功能的特殊之处在于如果它稍稍改变(垃圾邮件发送者完成),它也会再次识别图像。“
如果使用/ home / admispconfig / ispconfig / tools / spamassassin / etc / mail / spamassassin
而不是/ etc / mail / spamassassin
,FuzzyOCR的配置文件是/ home / admispconfig / ispconfig / tools / spamassassin / etc / mail / spamassassin / FuzzyOcr .cf
而不是/etc/mail/spamassassin/FuzzyOcr.cf
,所以编辑一个。 在配置文件中,您现在必须确保使用正确的路径(即/ home / admispconfig / ispconfig / tools / spamassassin / etc / mail / spamassassin
)。
这就是FuzzyOCR配置。 现在看看它是否按预期工作。
4测试模糊OCR
FuzzyOCR附带样本图像垃圾邮件(在/ usr / share / doc / fuzzyocr / examples /
目录中):
ls -l /usr/share/doc/fuzzyocr/examples/
输出应如下所示:
total 156
-rw-r--r-- 1 root root 13633 2008-09-25 22:47 ocr-animated.eml
-rw-r--r-- 1 root root 16108 2008-09-25 22:47 ocr-gif.eml
-rw-r--r-- 1 root root 27506 2008-09-25 22:47 ocr-jpg.eml
-rw-r--r-- 1 root root 27842 2008-09-25 22:47 ocr-multi.eml
-rw-r--r-- 1 root root 24657 2008-09-25 22:47 ocr-obfuscated.eml
-rw-r--r-- 1 root root 18236 2008-09-25 22:47 ocr-png.eml
-rw-r--r-- 1 root root 16113 2008-09-25 22:47 ocr-wrongext.eml
-rw-r--r-- 1 root root 3576 2008-09-25 22:47 README
我们现在可以将这些电子邮件提供给SpamAssassin,看看FuzzyOCR是否正确连接到SpamAssassin中。 找出你的spamassassin可执行
文件的位置(通常它在你的PATH中
- 你可以通过运行
which spamassassin
如果显示结果,则spamassassin
位于PATH
中,您不需要指定spamassassin
的完整路径来运行它。)
如果你不知道spamassassin
在哪里,你可以通过运行找到
updatedb
locate spamassassin
如果您使用ISPConfig 2,则spamassassin
位于: / home / admispconfig / ispconfig / tools / spamassassin / usr / bin / spamassassin
现在你知道spamassassin
在哪里,你可以将垃圾邮件的垃圾邮件提供给垃圾邮件地址,如下所示:
/path/to/spamassassin --debug FuzzyOcr < /usr/share/doc/fuzzyocr/examples/ocr-gif.eml > /dev/null
例如
/home/admispconfig/ispconfig/tools/spamassassin/usr/bin/spamassassin --debug FuzzyOcr < /usr/share/doc/fuzzyocr/examples/ocr-gif.eml > /dev/null
或者,如果spamassassin
在您的路径中
:
spamassassin --debug FuzzyOcr < /usr/share/doc/fuzzyocr/examples/ocr-gif.eml > /dev/null
你现在应该看到很多输出,结束应该是这样的:
[...]
[10025] dbg: FuzzyOcr:
[10025] dbg: FuzzyOcr: Friday Augurt 4, 4:01 pm ET
[10025] dbg: FuzzyOcr: LAS VEGAS, NEVADA--(MARKET WIRE)--Aug 4, 2006 -- auantum Energy, lnc. (OTC
[10025] dbg: FuzzyOcr: BB:aEGY.oB-_-
[10025] dbg: FuzzyOcr: auantum Energy, lnc. is pleased to announce that it has applied to have its shares listed for
[10025] dbg: FuzzyOcr: trading on the Frankfurt Stock Exchange. The company has retained the services ofBaltic
[10025] dbg: FuzzyOcr: lnvestment Group of Hamburg, Germany to assist with the application.
[10025] dbg: FuzzyOcr:
[10025] dbg: FuzzyOcr: _ qEGY,OB "
[10025] dbg: FuzzyOcr:
[10025] dbg: FuzzyOcr: <<=end
[10025] info: FuzzyOcr: Scanset "ocrad" found word "target" with fuzz of 0.0000
[10025] info: FuzzyOcr: line: "short term price target oo"
[10025] info: FuzzyOcr: Scanset "ocrad" found word "service" with fuzz of 0.0000
[10025] info: FuzzyOcr: line: "trading on the frankfurt stock exchange the company has retained the services ofbaltic"
[10025] info: FuzzyOcr: Scanset "ocrad" found word "stock" with fuzz of 0.0000
[10025] info: FuzzyOcr: line: "hot energy stocki"
[10025] info: FuzzyOcr: Scanset "ocrad" found word "stock" with fuzz of 0.0000
[10025] info: FuzzyOcr: line: "trading on the frankfurt stock exchange the company has retained the services ofbaltic"
[10025] info: FuzzyOcr: Scanset "ocrad" found word "price" with fuzz of 0.0000
[10025] info: FuzzyOcr: line: "current price o"
[10025] info: FuzzyOcr: Scanset "ocrad" found word "price" with fuzz of 0.0000
[10025] info: FuzzyOcr: line: "short term price target oo"
[10025] info: FuzzyOcr: Scanset "ocrad" found word "company" with fuzz of 0.0000
[10025] info: FuzzyOcr: line: "trading on the frankfurt stock exchange the company has retained the services ofbaltic"
[10025] info: FuzzyOcr: Scanset "ocrad" found word "recommendation" with fuzz of 0.0000
[10025] info: FuzzyOcr: line: "sboog bup recommendation"
[10025] dbg: FuzzyOcr: Enough OCR Hits without space stripping, skipping second matching pass...
[10025] info: FuzzyOcr: Scanset "ocrad" generates enough hits (8), skipping further scansets...
[10025] info: FuzzyOcr: Message is spam, score = 15.000
[10025] info: FuzzyOcr: Adding Hash to "/etc/mail/spamassassin/FuzzyOcr.db" with score "15.000"
[10025] dbg: FuzzyOcr: Digest: 538584:327:549:7::255:255:255:255:168580::0:0:0:0:9098::0:128:0:75:1086::0:0:128:15:395::128:0:128:53:213::0:0:255:29:115
[10025] info: FuzzyOcr: Words found:
[10025] info: FuzzyOcr: "target" in 1 lines
[10025] info: FuzzyOcr: "service" in 1 lines
[10025] info: FuzzyOcr: "stock" in 2 lines
[10025] info: FuzzyOcr: "price" in 2 lines
[10025] info: FuzzyOcr: "company" in 1 lines
[10025] info: FuzzyOcr: "recommendation" in 1 lines
[10025] info: FuzzyOcr: (12 word occurrences found)
[10025] dbg: FuzzyOcr: Remove DIR: /tmp/.spamassassin10025QnPTq8tmp
[10025] dbg: FuzzyOcr: FuzzyOcr ending successfully...
[10025] dbg: FuzzyOcr: Processed in 2.191381 sec.
如您所见,/ usr/share/doc/fuzzyocr/examples/ocr-gif.eml
已被归类为垃圾邮件,得分为15分,因此FuzzyOCR正在运行。
所以您的SpamAssassin现在能够识别图像垃圾邮件,这得益于FuzzyOCR的帮助。
5链接
- FuzzyOCR: http : //www.fuzzyocr.net/
- SpamAssassin: http : //spamassassin.apache.org/
- Ubuntu: http : //www.ubuntu.com/