用FuzzyOCR和SpamAssassin打击图像垃圾邮件在Fedora 12上
本教程介绍如何使用Fedora 12服务器上的FuzzyOCR扫描图像垃圾邮件。 FuzzyOCR是SpamAssassin的一个插件,其针对的是包含图像作为主要内容载体的未经请求的批量邮件。 使用不同的方法,它分析图像的内容和属性,以区分正常的邮件(火腿)和垃圾邮件。 FuzzyOCR尝试通过仅扫描尚未被SpamAssassin分类为垃圾邮件的邮件来保持系统负载低,从而避免不必要的工作。
我不会保证这将为您工作!
1初步说明
在本文中,我将使用Fedora 12作为基础系统。
我假设SpamAssassin已经安装并工作,以/ etc / mail / spamassassin /
作为其主配置目录。 如果您的目录不同(例如,如果您安装了ISPConfig 2 ,目录是/ home / admispconfig / ispconfig / tools / spamassassin / etc / mail / spamassassin /
),这没有问题。 我会注释在哪里改变什么。
请确保您的SpamAssassin版本适用于FuzzyOCR。 例如,我要在这里安装的FuzzyOCR版本( fuzzyocr-3.5.1-devel.tar.gz
)需要SpamAssassin 3.1.4或更新版本。
2安装FuzzyOCR的先决条件
FuzzyOCR有一些先决条件,如ocrad
和gocr
,我们可以这样安装:
yum install netpbm gifsicle giflib giflib-utils gocr ocrad ImageMagick tesseract perl-String-Approx perl-MLDBM perl-CPAN
我们还需要安装不能用作RPM软件包的MLDBM :: Sync
Perl模块。 打开一个Perl shell
perl -MCPAN -e shell
...并安装模块如下:
install MLDBM::Sync
类型
q
之后离开Perl shell。
3安装FuzzyOCR
接下来我们从http://fuzzyocr.own-hero.net/wiki/Downloads下载并安装最新的FuzzyOCR devel版本。 我们下载devel版本而不是稳定的版本,因为FuzzyOCR开发人员说:
“目前的推荐是开发版本,因为稳定版本缺少功能,而且很旧。”
cd /usr/src/
wget http://users.own-hero.net/~decoder/fuzzyocr/fuzzyocr-3.5.1-devel.tar.gz
然后我们解压缩FuzzyOCR并将所有FuzzyOcr *
文件和FuzzyOcr
目录(它们都在FuzzyOcr-3.5.1 /
目录中)移动到/ etc / mail / spamassassin中
:
tar xvfz fuzzyocr-3.5.1-devel.tar.gz
cd FuzzyOcr-3.5.1/
mv FuzzyOcr* /etc/mail/spamassassin/
如果您的SpamAssassin目录不同,例如/ home / admispconfig / ispconfig / tools / spamassassin / etc / mail / spamassassin /
,那么最后一个命令应该被替换为
mv FuzzyOcr* /home/admispconfig/ispconfig/tools/spamassassin/etc/mail/spamassassin/
不要删除/usr/src/FuzzyOcr-3.5.1/
目录,还有一个目录,其中包含示例图像垃圾邮件( samples /
),以后我们需要测试FuzzyOCR是否按预期工作。
所以现在FuzzyOCR已经安装了,现在我们需要配置它。
4配置FuzzyOCR
FuzzyOCR的配置文件是/etc/mail/spamassassin/FuzzyOcr.cf
。 在该文件中,几乎所有内容都被注释掉。 我们现在打开该文件并进行一些修改:
vi /etc/mail/spamassassin/FuzzyOcr.cf
将以下行放入其中以定义FuzzyOCR的垃圾邮件字文件的位置:
[...] focr_global_wordlist /etc/mail/spamassassin/FuzzyOcr.words [...] |
/etc/mail/spamassassin/FuzzyOcr.words
是FuzzyOCR附带的预定义的单词列表。 如果你喜欢,你可以根据自己的需要进行调整。
下一个变化
[...] # Include additional scanner/preprocessor commands here: # focr_bin_helper pnmnorm, pnminvert, pamthreshold, ppmtopgm, pamtopnm focr_bin_helper tesseract [...] |
至
[...] # Include additional scanner/preprocessor commands here: # focr_bin_helper pnmnorm, pnminvert, convert, ppmtopgm, tesseract [...] |
最后添加/启用以下行:
[...] # Search path for locating helper applications focr_path_bin /usr/local/netpbm/bin:/usr/local/bin:/usr/bin focr_preprocessor_file /etc/mail/spamassassin/FuzzyOcr.preps focr_scanset_file /etc/mail/spamassassin/FuzzyOcr.scansets focr_enable_image_hashing 2 focr_digest_db /etc/mail/spamassassin/FuzzyOcr.hashdb focr_db_hash /etc/mail/spamassassin/FuzzyOcr.db focr_db_safe /etc/mail/spamassassin/FuzzyOcr.safe.db [...] |
使用最后四行可以启用图像散列。 这是FuzzyOCR开发人员关于图像散列的说法:
“图像散列数据库功能允许插件将图像特征向量存储到数据库中,所以当它第二次到达时就知道这个图像(因此不需要再次扫描)。这个功能的特殊之处在于如果它稍稍改变(垃圾邮件发送者完成),它也会再次识别图像。“
如果使用/ home / admispconfig / ispconfig / tools / spamassassin / etc / mail / spamassassin
而不是/ etc / mail / spamassassin
,FuzzyOCR的配置文件是/ home / admispconfig / ispconfig / tools / spamassassin / etc / mail / spamassassin / FuzzyOcr .cf
而不是/etc/mail/spamassassin/FuzzyOcr.cf
,所以编辑一个。 在配置文件中,您现在可以使用/ home / admispconfig / ispconfig / tools / spamassassin / etc / mail / spamassassin
替换所有出现的/ etc
/ mail / spamassassin
, 或者如以前所示将其留下,并从/ etc
创建一个符号链接/ mail / spamassassin
到/ home / admispconfig / ispconfig / tools / spamassassin / etc / mail / spamassassin
像这样:
mkdir /etc/mail/
ln -s /home/admispconfig/ispconfig/tools/spamassassin/etc/mail/spamassassin/ /etc/mail/spamassassin
这就是FuzzyOCR配置。 现在看看它是否按预期工作。
5测试模糊OCR
我之前提到FuzzyOCR附带样本图像垃圾邮件(在samples /
目录中):
ls -l /usr/src/FuzzyOcr-3.5.1/samples/
输出应如下所示:
total 156
-rw-r--r-- 1 1000 users 13633 2007-01-07 12:55 ocr-animated.eml
-rw-r--r-- 1 1000 users 16108 2007-01-07 12:55 ocr-gif.eml
-rw-r--r-- 1 1000 users 27506 2007-01-07 12:55 ocr-jpg.eml
-rw-r--r-- 1 1000 users 27842 2007-01-07 12:59 ocr-multi.eml
-rw-r--r-- 1 1000 users 24657 2007-01-07 12:55 ocr-obfuscated.eml
-rw-r--r-- 1 1000 users 18236 2007-01-07 12:55 ocr-png.eml
-rw-r--r-- 1 1000 users 16113 2007-01-07 12:55 ocr-wrongext.eml
-rw-r--r-- 1 1000 users 3576 2007-01-07 12:55 README
我们现在可以将这些电子邮件提供给SpamAssassin,看看FuzzyOCR是否正确连接到SpamAssassin中。 找出你的spamassassin可执行
文件的位置(通常它在你的PATH中
- 你可以通过运行
which spamassassin
如果显示结果,则spamassassin
位于PATH
中,您不需要指定spamassassin
的完整路径来运行它。)
如果你不知道spamassassin
在哪里,你可以通过运行找到
updatedb
locate spamassassin
如果您使用ISPConfig 2,则spamassassin
位于: / home / admispconfig / ispconfig / tools / spamassassin / usr / bin / spamassassin
现在你知道spamassassin
在哪里,你可以将垃圾邮件的垃圾邮件提供给垃圾邮件地址,如下所示:
/path/to/spamassassin --debug FuzzyOcr < /usr/src/FuzzyOcr-3.5.1/samples/ocr-gif.eml > /dev/null
例如
/home/admispconfig/ispconfig/tools/spamassassin/usr/bin/spamassassin --debug FuzzyOcr < /usr/src/FuzzyOcr-3.5.1/samples/ocr-gif.eml > /dev/null
或者,如果spamassassin
在您的路径中
:
spamassassin --debug FuzzyOcr < /usr/src/FuzzyOcr-3.5.1/samples/ocr-gif.eml > /dev/null
你现在应该看到很多输出,结束应该是这样的:
[...]
[10025] dbg: FuzzyOcr:
[10025] dbg: FuzzyOcr: Friday Augurt 4, 4:01 pm ET
[10025] dbg: FuzzyOcr: LAS VEGAS, NEVADA--(MARKET WIRE)--Aug 4, 2006 -- auantum Energy, lnc. (OTC
[10025] dbg: FuzzyOcr: BB:aEGY.oB-_-
[10025] dbg: FuzzyOcr: auantum Energy, lnc. is pleased to announce that it has applied to have its shares listed for
[10025] dbg: FuzzyOcr: trading on the Frankfurt Stock Exchange. The company has retained the services ofBaltic
[10025] dbg: FuzzyOcr: lnvestment Group of Hamburg, Germany to assist with the application.
[10025] dbg: FuzzyOcr:
[10025] dbg: FuzzyOcr: _ qEGY,OB "
[10025] dbg: FuzzyOcr:
[10025] dbg: FuzzyOcr: <<=end
[10025] info: FuzzyOcr: Scanset "ocrad" found word "target" with fuzz of 0.0000
[10025] info: FuzzyOcr: line: "short term price target oo"
[10025] info: FuzzyOcr: Scanset "ocrad" found word "service" with fuzz of 0.0000
[10025] info: FuzzyOcr: line: "trading on the frankfurt stock exchange the company has retained the services ofbaltic"
[10025] info: FuzzyOcr: Scanset "ocrad" found word "stock" with fuzz of 0.0000
[10025] info: FuzzyOcr: line: "hot energy stocki"
[10025] info: FuzzyOcr: Scanset "ocrad" found word "stock" with fuzz of 0.0000
[10025] info: FuzzyOcr: line: "trading on the frankfurt stock exchange the company has retained the services ofbaltic"
[10025] info: FuzzyOcr: Scanset "ocrad" found word "price" with fuzz of 0.0000
[10025] info: FuzzyOcr: line: "current price o"
[10025] info: FuzzyOcr: Scanset "ocrad" found word "price" with fuzz of 0.0000
[10025] info: FuzzyOcr: line: "short term price target oo"
[10025] info: FuzzyOcr: Scanset "ocrad" found word "company" with fuzz of 0.0000
[10025] info: FuzzyOcr: line: "trading on the frankfurt stock exchange the company has retained the services ofbaltic"
[10025] info: FuzzyOcr: Scanset "ocrad" found word "recommendation" with fuzz of 0.0000
[10025] info: FuzzyOcr: line: "sboog bup recommendation"
[10025] dbg: FuzzyOcr: Enough OCR Hits without space stripping, skipping second matching pass...
[10025] info: FuzzyOcr: Scanset "ocrad" generates enough hits (8), skipping further scansets...
[10025] info: FuzzyOcr: Message is spam, score = 15.000
[10025] info: FuzzyOcr: Adding Hash to "/etc/mail/spamassassin/FuzzyOcr.db" with score "15.000"
[10025] dbg: FuzzyOcr: Digest: 538584:327:549:7::255:255:255:255:168580::0:0:0:0:9098::0:128:0:75:1086::0:0:128:15:395::128:0:128:53:213::0:0:255:29:115
[10025] info: FuzzyOcr: Words found:
[10025] info: FuzzyOcr: "target" in 1 lines
[10025] info: FuzzyOcr: "service" in 1 lines
[10025] info: FuzzyOcr: "stock" in 2 lines
[10025] info: FuzzyOcr: "price" in 2 lines
[10025] info: FuzzyOcr: "company" in 1 lines
[10025] info: FuzzyOcr: "recommendation" in 1 lines
[10025] info: FuzzyOcr: (12 word occurrences found)
[10025] dbg: FuzzyOcr: Remove DIR: /tmp/.spamassassin10025QnPTq8tmp
[10025] dbg: FuzzyOcr: FuzzyOcr ending successfully...
[10025] dbg: FuzzyOcr: Processed in 2.191381 sec.
如您所见,/ usr/src/FuzzyOcr-3.5.1/samples/ocr-gif.eml
已被归类为垃圾邮件,得分为15分,因此FuzzyOCR正在工作。
所以您的SpamAssassin现在能够识别图像垃圾邮件,这得益于FuzzyOCR的帮助。
6链接
- FuzzyOCR: http : //www.fuzzyocr.net/
- SpamAssassin: http : //spamassassin.apache.org/
- Fedora: http : //fedoraproject.org/