跳至主要内容

【转】贝叶斯算法在反垃圾邮件中的应用

一、 贝叶斯反垃圾邮件技术介绍

  贝叶斯是基于概率的一种算法,是Thomas Bayes:一位伟大的数学大师所创建的,目前此种算法用于过滤垃圾邮件得到了广泛地好评。贝叶斯过滤器是基于"自我学习"的智能技术,能够使自己适应垃圾邮件制造者的新把戏,同时为合法电子邮件提供保护。在智能邮件过滤技术中,贝叶斯(Bayesian)过滤技术取得了较大的成功,被越来越多地应用在反垃圾邮件的产品中。

二、 贝叶斯过滤算法的基本步骤


  1. 收集大量的垃圾邮件和非垃圾邮件,建立垃圾邮件集和非垃圾邮件集。

  2. 提取邮件主题和邮件体中的独立字符串,例如 ABC32,¥234等作为TOKEN串并统计提取出的TOKEN串出现的次数即字频。按照上述的方法分别处理垃圾邮件集和非垃圾邮件集中的所有邮件。

  3. 每一个邮件集对应一个哈希表,hashtable_good对应非垃圾邮件集而hashtable_bad对应垃圾邮件集。表中存储TOKEN串到字频的映射关系。

  4. 计算每个哈希表中TOKEN串出现的概率P=(某TOKEN串的字频)/(对应哈希表的长度)

  5. 综合考虑hashtable_good和hashtable_bad,推断出当新来的邮件中出现某个TOKEN串时,该新邮件为垃圾邮件的概率。数学表达式为:

  A 事件 ---- 邮件为垃圾邮件;

  t1,t2 …….tn 代表 TOKEN 串

  则 P ( A|ti )表示在邮件中出现 TOKEN 串 ti 时,该邮件为垃圾邮件的概率。

  设

  P1 ( ti ) = ( ti 在 hashtable_good 中的值)

  P2 ( ti ) = ( ti 在 hashtable_ bad 中的值)

  则 P ( A|ti ) =P2 ( ti ) /[ ( P1 ( ti ) +P2 ( ti ) ] ;


  6. 建立新的哈希表hashtable_probability存储TOKEN串ti到P(A|ti)的映射

  7. 至此,垃圾邮件集和非垃圾邮件集的学习过程结束。根据建立的哈希表 hashtable_probability可以估计一封新到的邮件为垃圾邮件的可能性。

  当新到一封邮件时,按照步骤2,生成TOKEN串。查询hashtable_probability得到该TOKEN 串的键值。

  假设由该邮件共得到N个TOKEN 串,t1,t2…….tn,hashtable_probability中对应的值为 P1 , P2 , ……PN , P(A|t1 ,t2, t3……tn) 表示在邮件中同时出现多个TOKEN串t1,t2……tn时,该邮件为垃圾邮件的概率。

  由复合概率公式可得
  P(A|t1 ,t2, t3……tn)=(P1*P2*……PN)/[P1*P2*……PN+(1-P1)*(1-P2)*……(1-PN)]

  当 P(A|t1 ,t2, t3……tn) 超过预定阈值时,就可以判断邮件为垃圾邮件。


三、 贝叶斯过滤算法举例

评论

此博客中的热门博文

【转】smb协议栈使用示例

/*  * * uncdownload.c  * *  * * Utility for downloading files from SMB shares using libsmbclient  * *  * * Copyright(C) 2006 Sophos Plc, Oxford, England.  * *  * * This program is free software; you can redistribute it and/or modify it under the terms of the  * * GNU General Public License Version 2 as published by the Free Software Foundation.  * *  * * This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without  * * even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  * * See the GNU General Public License for more details.  * *  * * You should have received a copy of the GNU General Public License along with this program; if not,  * * write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA  * *  * */ # include < sys / types . h > # include < sys / time . h > # include ...

【转】Ether Types

Ether Types (last updated 2008-09-09) NOTE: Please see [RFC5342] for current information and registration procedures. This registry will be revised soon and will be replaced with up-to-date information. Many of the networks of all classes are Ethernets (10Mb) or Experimental Ethernets (3Mb). These systems use a message "type" field in much the same way the ARPANET uses the "link" field. If you need an Ether Type, contact: IEEE Registration Authority IEEE Standards Department 445 Hoes Lane Piscataway, NJ 08854 Phone +1 732 562 3813 Fax: +1 732 562 1571 Email: <ieee-registration-authority& ieee.org > http://standards.ieee.org/regauth/index.html The following list of EtherTypes is contributed unverified information from various sources. Another list of EtherTypes is maintained by Michael A. Patton and is accessible at: <URL: http://www.cavebear.com/CaveBear/Ethernet/ > <URL: ftp://ftp.cavebear.com/pub/Ethernet-codes > Assign...

【转】tcphdr结构详解

位于:/usr/src/linux/include/linux/tcp.h struct tcphdr { __be16 source; __be16 dest; __be32 seq; __be32 ack_seq; #if defined(__LITTLE_ENDIAN_BITFIELD) __u16   res1:4, doff:4, fin:1, syn:1, rst:1, psh:1, ack:1, urg:1, ece:1, cwr:1; #elif defined(__BIG_ENDIAN_BITFIELD) __u16   doff:4, res1:4, cwr:1, ece:1, urg:1, ack:1, psh:1, rst:1, syn:1, fin:1; #else #error "Adjust your <asm/byteorder.h> defines"