DFA敏感词过滤算法
運(yùn)用DFA算法加密。
首先我先對(duì)敏感詞庫(kù)初始化,若我的敏感詞庫(kù)為
冰毒
白粉
大麻
大壞蛋
初始化之后得到的是下面這樣。:
{冰={毒={isEnd=1}, isEnd=0}, 白={粉={isEnd=1}, isEnd=0}, 大={麻={isEnd=1}, isEnd=0, 壞={蛋={isEnd=1}, isEnd=0}}}。
ok,我把初始化之后的數(shù)據(jù)用A來(lái)表示。假設(shè)待檢測(cè)的文字為:張三是個(gè)大壞蛋,他竟然吸食白粉和冰毒。
后面檢測(cè)文字中是否有敏感詞的時(shí)候,先把要檢測(cè)的文字迭代循環(huán),并轉(zhuǎn)換成charAt值,這樣的話(huà),
如果 A.get(charAt) 為空的話(huà),說(shuō)明這個(gè)字不在敏感詞庫(kù)中,比如 "張","三","是","個(gè)" ........
如果 A.get(charAt) 不為空的話(huà),說(shuō)明這個(gè)字存在敏感詞庫(kù)中,比如 "大","壞","蛋" ...........
假設(shè)我們檢測(cè)到??"大" "壞" 的時(shí)候,發(fā)現(xiàn)這個(gè)字存在于敏感詞庫(kù)中,這個(gè)時(shí)候需要看項(xiàng)目需求,如果只是檢測(cè) 輸入框內(nèi)是否含有敏感詞,
那這里就可以不進(jìn)行判斷了,已經(jīng)含有敏感詞了。
如果要把所有的敏感詞用 "*" 號(hào)替換的話(huà),那就要繼續(xù)往下匹配,判斷該敏感詞是否是最后一個(gè)......
以上就是基本思路了,下面上代碼 ,不懂的可以留言給我。。。
溫馨提示:
初始化敏感詞庫(kù)的時(shí)候
1、加了redis緩存
2、敏感詞庫(kù)我是放在了服務(wù)器下面
3、編碼格式注意,代碼里的編碼格式要與你的敏感詞庫(kù)的編碼格式一致。utf-8或者gbk。(win下把txt另存為可以看到,linux下vim txt,:set fileencoding)
linux下文件編碼格式轉(zhuǎn)換,這里是gbk -> utf-8:iconv -f gb18030 -t utf-8 sensitiveword.txt -o sensitiveword.txt
你們用main方法測(cè)試的時(shí)候,要把緩存注釋掉,敏感詞庫(kù)路徑改為 你們本地。
package com.cmcc.admin.common.sensitive;import java.util.HashSet; import java.util.Iterator; import java.util.Map; import java.util.Set;/*** @Type SensitiveWordFilter.java*/ public class SensitiveWordFilter {@SuppressWarnings("rawtypes")private Map sensitiveWordMap = null;public static int minMatchType = 1; //最小匹配規(guī)則public static int maxMatchType = 2; //最大匹配規(guī)則/*** 構(gòu)造函數(shù),初始化敏感詞庫(kù)* @throws Exception* @since 1.8* @author whb*/public SensitiveWordFilter() throws Exception {sensitiveWordMap = new SensitiveWordInit().initKeyWord();}/*** 檢查文字中敏感詞的長(zhǎng)度* @param txt* @param beginIndex* @param matchType* @return 如果存在,則返回敏感詞字符的長(zhǎng)度,不存在返回0* @since 1.8* @author whb*/@SuppressWarnings("rawtypes")public int checkSensitiveWord(String txt, int beginIndex, int matchType) {Map nowMap = sensitiveWordMap;boolean flag = false; //敏感詞結(jié)束標(biāo)識(shí)位:用于敏感詞只有1位的情況char word = 0;int matchFlag = 0; //匹配標(biāo)識(shí)數(shù)默認(rèn)為0for (int i = beginIndex; i < txt.length(); i++) {word = txt.charAt(i);nowMap = (Map) nowMap.get(word); //獲取指定keyif (nowMap == null) {break;//不存在,直接返回}//輸入的字(排列組合的匹配)出現(xiàn)在敏感詞庫(kù)中,判斷是否為最后一個(gè)matchFlag++; //找到相應(yīng)key,匹配標(biāo)識(shí)+1if (isEnd(nowMap)) { //如果為最后一個(gè)匹配規(guī)則,結(jié)束循環(huán),返回匹配標(biāo)識(shí)數(shù)flag = true; //結(jié)束標(biāo)志位為trueif (SensitiveWordFilter.minMatchType == matchType) {break;//最小規(guī)則,直接返回,最大規(guī)則還需繼續(xù)查找}}}if (matchFlag < 2 || !flag) { //長(zhǎng)度必須大于等于1,為詞matchFlag = 0;}return matchFlag;}/*** 是否包含敏感詞* @param txt* @param matchType* @return true:是;false:否* @since 1.8* @author whb*/public boolean isContaintSensitiveWord(String txt, int matchType) {boolean flag = false;for (int i = 0; i < txt.length(); i++) {int matchFlag = this.checkSensitiveWord(txt, i, matchType);if (matchFlag > 0) {flag = true;}}return flag;}/*** 是否包含敏感詞(重慶項(xiàng)目默認(rèn)值,按最小匹配規(guī)則來(lái),只要有敏感詞就ok)* 如果敏感詞庫(kù)為:* ? ? ? ? ?中* ? ? ? ? ?中國(guó)* ? ? ? ? ?中國(guó)人* ?初始化之后為:{中={isEnd=1, 國(guó)={人={isEnd=1}, isEnd=1}}}* ?測(cè)試的文字為:我是一名中國(guó)人。* ?1、按最小規(guī)則匹配, ?匹配 中 的時(shí)候,就為最后一個(gè)了 直接break。* ?2、按最大規(guī)則匹配, ?匹配 中 的時(shí)候,就為最后一個(gè),繼續(xù)匹配 國(guó),人。* @param txt* @return true:是;false:否* @since 1.8* @author whb*/public boolean isSensitive(String txt) {boolean flag = false;for (int i = 0; i < txt.length(); i++) {int matchFlag = this.checkSensitiveWord(txt, i, 1);if (matchFlag > 0) {flag = true;}}return flag;}/*** 獲取文字中的敏感詞* @param txt* @param matchType* @return* @since 1.8* @author whb*/public Set<String> getSensitiveWord(String txt, int matchType) {Set<String> sensitiveWordList = new HashSet<String>();for (int i = 0; i < txt.length(); i++) {int length = checkSensitiveWord(txt, i, matchType);if (length > 0) { //存在,加入list中sensitiveWordList.add(txt.substring(i, i + length));i = i + length - 1; //減1的原因,是因?yàn)閒or會(huì)自增}}return sensitiveWordList;}/*** 替換敏感字字符* @param txt* @param matchType* @param replaceChar* @return* @since 1.8* @author whb*/public String replaceSensitiveWord(String txt, int matchType, String replaceChar) {String resultTxt = txt;Set<String> set = this.getSensitiveWord(txt, matchType); //獲取所有的敏感詞Iterator<String> iterator = set.iterator();String word = null;String replaceString = null;while (iterator.hasNext()) {word = iterator.next();replaceString = getReplaceChars(replaceChar, word.length());resultTxt = resultTxt.replaceAll(word, replaceString);}return resultTxt;}/*** 獲取替換字符串* @param replaceChar* @param length* @return* @since 1.8* @author whb*/private String getReplaceChars(String replaceChar, int length) {String resultReplace = replaceChar;for (int i = 1; i < length; i++) {resultReplace += replaceChar;}return resultReplace;}/*** 判斷是否為最后一個(gè)* @param nowMap* @return* @since 1.8* @author whb*/@SuppressWarnings("rawtypes")private boolean isEnd(Map nowMap) {boolean flag = false;if ("1".equals(nowMap.get("isEnd"))) {flag = true;}return flag;}public static void main(String[] args) throws Exception {SensitiveWordFilter filter = new SensitiveWordFilter();System.out.println("敏感詞的數(shù)量:" + filter.sensitiveWordMap.size());String string = "王弘博是個(gè)大壞蛋,他竟然吸食白粉和冰毒";System.out.println("待檢測(cè)語(yǔ)句的字?jǐn)?shù):" + string.length());long beginTime = System.currentTimeMillis();Set<String> set = filter.getSensitiveWord(string, 1);String result = filter.replaceSensitiveWord(string, 1, "*");boolean flag = filter.isSensitive(string);System.out.println(flag);long endTime = System.currentTimeMillis();System.out.println("語(yǔ)句中包含敏感詞的個(gè)數(shù)為:" + set.size() + "。包含:" + set);System.out.println("敏感詞處理之后為:"+result);System.out.println("總共消耗時(shí)間為:" + (endTime - beginTime));} }SensitiveWordInit 初始化代碼如下:
package com.cmcc.admin.common.sensitive;import java.io.BufferedReader; import java.io.File; import java.io.FileInputStream; import java.io.InputStreamReader; import java.util.HashMap; import java.util.HashSet; import java.util.Iterator; import java.util.Map; import java.util.Set;import org.springframework.context.ApplicationContext; import org.springframework.context.support.AbstractApplicationContext; import org.springframework.context.support.ClassPathXmlApplicationContext;import com.cmcc.aqb.cache.redis.RedisClient;/*** @Type SensitiveWordInit.java* @Desc* @author whb* @date 2017年8月23日 下午1:57:03* @version*/ public class SensitiveWordInit {private static final String ENCODING = "utf-8"; //字符編碼@SuppressWarnings("rawtypes")public HashMap sensitiveWordMap;public SensitiveWordInit() {super();}static RedisClient redisClient = null;private static String SPILIT = "#";private static int EXPIRE_TIME = 3600;// secondsprivate static String SENSITIVE_WORD = SensitiveWordInit.class.getName();private String sensitiveWordKey(String type) {StringBuilder sb = new StringBuilder();sb.append(type).append(SPILIT).append("sensitiveWordInit");return sb.toString();}/**** @return* @throws Exception* @since 1.8* @author whb*/@SuppressWarnings({ "rawtypes", "resource" })public Map initKeyWord() {try {ApplicationContext ac = new ClassPathXmlApplicationContext(new String[] {"spring/datasource.xml", "spring/cache.xml" });redisClient = (RedisClient) ac.getBean("redisClient");String key = sensitiveWordKey(SENSITIVE_WORD);sensitiveWordMap = redisClient.get(key);if (sensitiveWordMap == null) {Set<String> set = readSensitiveWordFile();addSensitiveWordToHashMap(set);redisClient.put(key, sensitiveWordMap, EXPIRE_TIME);}((AbstractApplicationContext) ac).registerShutdownHook();return sensitiveWordMap;} catch (Exception e) {throw new RuntimeException("初始化敏感詞庫(kù)錯(cuò)誤");}}/*** 讀取敏感詞庫(kù),并把內(nèi)容放到set里* @return* @throws Exception* @since 1.8* @author whb*/private Set<String> readSensitiveWordFile() throws Exception {Set<String> set = null;File file = new File("/home/sensitiveword.txt");try (BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(new FileInputStream(file), ENCODING))) {if (file.isFile() && file.exists()) {set = new HashSet<String>();String txt = null;while ((txt = bufferedReader.readLine()) != null) {set.add(txt);}} else {throw new Exception("敏感詞庫(kù)文件不存在");}} catch (Exception e) {e.printStackTrace();throw e;}return set;}/*** 讀取敏感詞庫(kù),將敏感詞放入HashSet中,構(gòu)建一個(gè)DFA算法模型:<br>* 中 = {* ? ? ?isEnd = 0* ? ? ?國(guó) = {<br>* ? ? ? ? ? isEnd = 1* ? ? ? ? ? 人 = {isEnd = 0* ? ? ? ? ? ? ? ?民 = {isEnd = 1}* ? ? ? ? ? ? ? ?}* ? ? ? ? ? 男 ?= {* ? ? ? ? ? ? ? ? ?isEnd = 0* ? ? ? ? ? ? ? ? ? 人 = {* ? ? ? ? ? ? ? ? ? ? ? ?isEnd = 1* ? ? ? ? ? ? ? ? ? ? ? }* ? ? ? ? ? ? ? }* ? ? ? ? ? }* ? ? ?}* ?五 = {* ? ? ?isEnd = 0* ? ? ?星 = {* ? ? ? ? ?isEnd = 0* ? ? ? ? ?紅 = {* ? ? ? ? ? ? ?isEnd = 0* ? ? ? ? ? ? ?旗 = {* ? ? ? ? ? ? ? ? ? isEnd = 1* ? ? ? ? ? ? ? ? ?}* ? ? ? ? ? ? ?}* ? ? ? ? ?}* ? ? ?}* @param keyWordSet* @since 1.8* @author whb*/@SuppressWarnings({ "rawtypes", "unchecked" })private void addSensitiveWordToHashMap(Set<String> keyWordSet) {sensitiveWordMap = new HashMap(keyWordSet.size()); //初始化敏感詞容器,避免擴(kuò)容操作String key = null;Map nowMap = null;Map<String, String> newWorMap = null;Iterator<String> iterator = keyWordSet.iterator();while (iterator.hasNext()) {key = iterator.next();nowMap = sensitiveWordMap;for (int i = 0; i < key.length(); i++) {char charKey = key.charAt(i); //轉(zhuǎn)換成char型Object wordMap = nowMap.get(charKey);if (wordMap != null) {nowMap = (Map) wordMap; //一個(gè)一個(gè)放進(jìn)Map中} else { //不存在,則構(gòu)建一個(gè)Map,同時(shí)將isEnd設(shè)置為0,因?yàn)樗皇亲詈笠粋€(gè)newWorMap = new HashMap<String, String>();newWorMap.put("isEnd", "0");//不是最后一個(gè)nowMap.put(charKey, newWorMap);//沒(méi)有這個(gè)key,就把(isEnd,0) 放在Map中nowMap = newWorMap;}if (i == key.length() - 1) { //最后一個(gè)nowMap.put("isEnd", "1");}}}}}?
總結(jié)
以上是生活随笔為你收集整理的DFA敏感词过滤算法的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: MFC COMBO-BOX最详细教程
- 下一篇: VS2008制作安装包