战斗图像垃圾邮件与FuzzyOCR和SpamAssassin Fedora 12

用FuzzyOCR和SpamAssassin打击图像垃圾邮件在Fedora 12上

本教程介绍如何使用Fedora 12服务器上的FuzzyOCR扫描图像垃圾邮件。 FuzzyOCR是SpamAssassin的一个插件,其针对的是包含图像作为主要内容载体的未经请求的批量邮件。 使用不同的方法,它分析图像的内容和属性,以区分正常的邮件(火腿)和垃圾邮件。 FuzzyOCR尝试通过仅扫描尚未被SpamAssassin分类为垃圾邮件的邮件来保持系统负载低,从而避免不必要的工作。

我不会保证这将为您工作!

1初步说明

在本文中,我将使用Fedora 12作为基础系统。

我假设SpamAssassin已经安装并工作,以/ etc / mail / spamassassin /作为其主配置目录。 如果您的目录不同(例如,如果您安装了ISPConfig 2 ,目录是/ home / admispconfig / ispconfig / tools / spamassassin / etc / mail / spamassassin / ),这没有问题。 我会注释在哪里改变什么。

请确保您的SpamAssassin版本适用于FuzzyOCR。 例如,我要在这里安装的FuzzyOCR版本( fuzzyocr-3.5.1-devel.tar.gz )需要SpamAssassin 3.1.4或更新版本。

2安装FuzzyOCR的先决条件

FuzzyOCR有一些先决条件,如ocradgocr ,我们可以这样安装:

yum install netpbm gifsicle giflib giflib-utils gocr ocrad ImageMagick tesseract perl-String-Approx perl-MLDBM perl-CPAN

我们还需要安装不能用作RPM软件包的MLDBM :: Sync Perl模块。 打开一个Perl shell

perl -MCPAN -e shell

...并安装模块如下:

install MLDBM::Sync

类型

q

之后离开Perl shell。

3安装FuzzyOCR

接下来我们从http://fuzzyocr.own-hero.net/wiki/Downloads下载并安装最新的FuzzyOCR devel版本。 我们下载devel版本而不是稳定的版本,因为FuzzyOCR开发人员说:

“目前的推荐是开发版本,因为稳定版本缺少功能,而且很旧。”

cd /usr/src/
wget http://users.own-hero.net/~decoder/fuzzyocr/fuzzyocr-3.5.1-devel.tar.gz

然后我们解压缩FuzzyOCR并将所有FuzzyOcr *文件和FuzzyOcr目录(它们都在FuzzyOcr-3.5.1 /目录中)移动到/ etc / mail / spamassassin中

tar xvfz fuzzyocr-3.5.1-devel.tar.gz
cd FuzzyOcr-3.5.1/
mv FuzzyOcr* /etc/mail/spamassassin/

如果您的SpamAssassin目录不同,例如/ home / admispconfig / ispconfig / tools / spamassassin / etc / mail / spamassassin / ,那么最后一个命令应该被替换为

mv FuzzyOcr* /home/admispconfig/ispconfig/tools/spamassassin/etc/mail/spamassassin/

不要删除/usr/src/FuzzyOcr-3.5.1/目录,还有一个目录,其中包含示例图像垃圾邮件( samples / ),以后我们需要测试FuzzyOCR是否按预期工作。

所以现在FuzzyOCR已经安装了,现在我们需要配置它。

4配置FuzzyOCR

FuzzyOCR的配置文件是/etc/mail/spamassassin/FuzzyOcr.cf 。 在该文件中,几乎所有内容都被注释掉。 我们现在打开该文件并进行一些修改:

vi /etc/mail/spamassassin/FuzzyOcr.cf

将以下行放入其中以定义FuzzyOCR的垃圾邮件字文件的位置:

[...]
focr_global_wordlist /etc/mail/spamassassin/FuzzyOcr.words
[...]

/etc/mail/spamassassin/FuzzyOcr.words是FuzzyOCR附带的预定义的单词列表。 如果你喜欢,你可以根据自己的需要进行调整。

下一个变化

[...]
# Include additional scanner/preprocessor commands here:
#
focr_bin_helper pnmnorm, pnminvert, pamthreshold, ppmtopgm, pamtopnm
focr_bin_helper tesseract
[...]

[...]
# Include additional scanner/preprocessor commands here:
#
focr_bin_helper pnmnorm, pnminvert, convert, ppmtopgm, tesseract
[...]

最后添加/启用以下行:

[...]
# Search path for locating helper applications
focr_path_bin /usr/local/netpbm/bin:/usr/local/bin:/usr/bin

focr_preprocessor_file /etc/mail/spamassassin/FuzzyOcr.preps
focr_scanset_file /etc/mail/spamassassin/FuzzyOcr.scansets

focr_enable_image_hashing 2
focr_digest_db /etc/mail/spamassassin/FuzzyOcr.hashdb
focr_db_hash /etc/mail/spamassassin/FuzzyOcr.db
focr_db_safe /etc/mail/spamassassin/FuzzyOcr.safe.db
[...]

使用最后四行可以启用图像散列。 这是FuzzyOCR开发人员关于图像散列的说法:

“图像散列数据库功能允许插件将图像特征向量存储到数据库中,所以当它第二次到达时就知道这个图像(因此不需要再次扫描)。这个功能的特殊之处在于如果它稍稍改变(垃圾邮件发送者完成),它也会再次识别图像。“

如果使用/ home / admispconfig / ispconfig / tools / spamassassin / etc / mail / spamassassin而不是/ etc / mail / spamassassin ,FuzzyOCR的配置文件是/ home / admispconfig / ispconfig / tools / spamassassin / etc / mail / spamassassin / FuzzyOcr .cf而不是/etc/mail/spamassassin/FuzzyOcr.cf ,所以编辑一个。 在配置文件中,您现在可以使用/ home / admispconfig / ispconfig / tools / spamassassin / etc / mail / spamassassin替换所有出现的/ etc / mail / spamassassin或者如以前所示将其留下,并从/ etc创建一个符号链接/ mail / spamassassin/ home / admispconfig / ispconfig / tools / spamassassin / etc / mail / spamassassin像这样:

mkdir /etc/mail/
ln -s /home/admispconfig/ispconfig/tools/spamassassin/etc/mail/spamassassin/ /etc/mail/spamassassin

这就是FuzzyOCR配置。 现在看看它是否按预期工作。

5测试模糊OCR

我之前提到FuzzyOCR附带样本图像垃圾邮件(在samples /目录中):

ls -l /usr/src/FuzzyOcr-3.5.1/samples/

输出应如下所示:

total 156
-rw-r--r-- 1 1000 users 13633 2007-01-07 12:55 ocr-animated.eml
-rw-r--r-- 1 1000 users 16108 2007-01-07 12:55 ocr-gif.eml
-rw-r--r-- 1 1000 users 27506 2007-01-07 12:55 ocr-jpg.eml
-rw-r--r-- 1 1000 users 27842 2007-01-07 12:59 ocr-multi.eml
-rw-r--r-- 1 1000 users 24657 2007-01-07 12:55 ocr-obfuscated.eml
-rw-r--r-- 1 1000 users 18236 2007-01-07 12:55 ocr-png.eml
-rw-r--r-- 1 1000 users 16113 2007-01-07 12:55 ocr-wrongext.eml
-rw-r--r-- 1 1000 users  3576 2007-01-07 12:55 README

我们现在可以将这些电子邮件提供给SpamAssassin,看看FuzzyOCR是否正确连接到SpamAssassin中。 找出你的spamassassin可执行文件的位置(通常它在你的PATH中 - 你可以通过运行

which spamassassin

如果显示结果,则spamassassin位于PATH中,您不需要指定spamassassin的完整路径来运行它。)

如果你不知道spamassassin在哪里,你可以通过运行找到

updatedb
locate spamassassin

如果您使用ISPConfig 2,则spamassassin位于: / home / admispconfig / ispconfig / tools / spamassassin / usr / bin / spamassassin

现在你知道spamassassin在哪里,你可以将垃圾邮件的垃圾邮件提供给垃圾邮件地址,如下所示:

/path/to/spamassassin --debug FuzzyOcr < /usr/src/FuzzyOcr-3.5.1/samples/ocr-gif.eml > /dev/null

例如

/home/admispconfig/ispconfig/tools/spamassassin/usr/bin/spamassassin --debug FuzzyOcr < /usr/src/FuzzyOcr-3.5.1/samples/ocr-gif.eml > /dev/null

或者,如果spamassassin在您的路径中

spamassassin --debug FuzzyOcr < /usr/src/FuzzyOcr-3.5.1/samples/ocr-gif.eml > /dev/null

你现在应该看到很多输出,结束应该是这样的:

[...]
[10025] dbg: FuzzyOcr:
[10025] dbg: FuzzyOcr: Friday Augurt 4, 4:01 pm ET
[10025] dbg: FuzzyOcr: LAS VEGAS, NEVADA--(MARKET WIRE)--Aug 4, 2006 -- auantum Energy, lnc. (OTC
[10025] dbg: FuzzyOcr: BB:aEGY.oB-_-
[10025] dbg: FuzzyOcr: auantum Energy, lnc. is pleased to announce that it has applied to have its shares listed for
[10025] dbg: FuzzyOcr: trading on the Frankfurt Stock Exchange. The company has retained the services ofBaltic
[10025] dbg: FuzzyOcr: lnvestment Group of Hamburg, Germany to assist with the application.
[10025] dbg: FuzzyOcr:
[10025] dbg: FuzzyOcr: _ qEGY,OB "
[10025] dbg: FuzzyOcr:
[10025] dbg: FuzzyOcr: <<=end
[10025] info: FuzzyOcr: Scanset "ocrad" found word "target" with fuzz of 0.0000
[10025] info: FuzzyOcr: line: "short term price target oo"
[10025] info: FuzzyOcr: Scanset "ocrad" found word "service" with fuzz of 0.0000
[10025] info: FuzzyOcr: line: "trading on the frankfurt stock exchange the company has retained the services ofbaltic"
[10025] info: FuzzyOcr: Scanset "ocrad" found word "stock" with fuzz of 0.0000
[10025] info: FuzzyOcr: line: "hot energy stocki"
[10025] info: FuzzyOcr: Scanset "ocrad" found word "stock" with fuzz of 0.0000
[10025] info: FuzzyOcr: line: "trading on the frankfurt stock exchange the company has retained the services ofbaltic"
[10025] info: FuzzyOcr: Scanset "ocrad" found word "price" with fuzz of 0.0000
[10025] info: FuzzyOcr: line: "current price o"
[10025] info: FuzzyOcr: Scanset "ocrad" found word "price" with fuzz of 0.0000
[10025] info: FuzzyOcr: line: "short term price target oo"
[10025] info: FuzzyOcr: Scanset "ocrad" found word "company" with fuzz of 0.0000
[10025] info: FuzzyOcr: line: "trading on the frankfurt stock exchange the company has retained the services ofbaltic"
[10025] info: FuzzyOcr: Scanset "ocrad" found word "recommendation" with fuzz of 0.0000
[10025] info: FuzzyOcr: line: "sboog bup recommendation"
[10025] dbg: FuzzyOcr: Enough OCR Hits without space stripping, skipping second matching pass...
[10025] info: FuzzyOcr: Scanset "ocrad" generates enough hits (8), skipping further scansets...
[10025] info: FuzzyOcr: Message is spam, score = 15.000
[10025] info: FuzzyOcr: Adding Hash to "/etc/mail/spamassassin/FuzzyOcr.db" with score "15.000"
[10025] dbg: FuzzyOcr: Digest: 538584:327:549:7::255:255:255:255:168580::0:0:0:0:9098::0:128:0:75:1086::0:0:128:15:395::128:0:128:53:213::0:0:255:29:115
[10025] info: FuzzyOcr: Words found:
[10025] info: FuzzyOcr: "target" in 1 lines
[10025] info: FuzzyOcr: "service" in 1 lines
[10025] info: FuzzyOcr: "stock" in 2 lines
[10025] info: FuzzyOcr: "price" in 2 lines
[10025] info: FuzzyOcr: "company" in 1 lines
[10025] info: FuzzyOcr: "recommendation" in 1 lines
[10025] info: FuzzyOcr: (12 word occurrences found)
[10025] dbg: FuzzyOcr: Remove DIR: /tmp/.spamassassin10025QnPTq8tmp
[10025] dbg: FuzzyOcr: FuzzyOcr ending successfully...
[10025] dbg: FuzzyOcr: Processed in 2.191381 sec.

如您所见,/ usr/src/FuzzyOcr-3.5.1/samples/ocr-gif.eml已被归类为垃圾邮件,得分为15分,因此FuzzyOCR正在工作。

所以您的SpamAssassin现在能够识别图像垃圾邮件,这得益于FuzzyOCR的帮助。

6链接

赞(52) 打赏
未经允许不得转载:优客志 » 系统运维
分享到:

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏