當(dāng)前位置：首頁(yè) > 编程资源 > 编程问答 >内容正文

编程问答

敏感词过滤及反垃圾文本的相关知识（欢迎收藏）

發(fā)布時(shí)間：2023/12/20 编程问答 27 豆豆

生活随笔收集整理的這篇文章主要介紹了敏感词过滤及反垃圾文本的相关知识（欢迎收藏）小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

先介紹一下敏感詞詞庫(kù)

：1.funNLP

敏感詞庫(kù)：
2.chat-censorship
與聊天客戶端審查調(diào)查相關(guān)的數(shù)據(jù)，此存儲(chǔ)庫(kù)包含關(guān)鍵字黑名單以及其他內(nèi)容的列表，例如用于觸發(fā)在中國(guó)使用的應(yīng)用程序中的審查制度的URL或圖像（應(yīng)用包括：微博，微信，Line,skype）

3.網(wǎng)上整理的敏感詞庫(kù)及Java實(shí)現(xiàn)的代碼

請(qǐng)移步github

敏感詞過(guò)濾的相關(guān)算法：

1.使用敏感詞過(guò)濾系統(tǒng)。
信息審核工作都是在信息審核平臺(tái)上進(jìn)行的，網(wǎng)站的運(yùn)營(yíng)審核系統(tǒng)中會(huì)預(yù)先設(shè)定一批關(guān)鍵詞庫(kù)并對(duì)詞組進(jìn)行排列組合，這批詞庫(kù)又會(huì)根據(jù)敏感性進(jìn)行分類。系統(tǒng)會(huì)阻止用戶發(fā)布敏感詞匯，或?qū)⒂脩舭l(fā)出來(lái)的含有敏感詞的內(nèi)容直接刪除。對(duì)于某些敏感性較低的詞匯，發(fā)出來(lái)不會(huì)立即刪除，需要經(jīng)過(guò)審核人員過(guò)目進(jìn)行二次審核。
AC自動(dòng)機(jī)算法（原理）

#python實(shí)現(xiàn)， # -*- coding:utf-8 -*-import time time1=time.time()# AC自動(dòng)機(jī)算法 class node(object):def __init__(self):self.next = {}self.fail = Noneself.isWord = Falseself.word = ""class ac_automation(object):def __init__(self):self.root = node()# 添加敏感詞函數(shù)def addword(self, word):temp_root = self.rootfor char in word:if char not in temp_root.next:temp_root.next[char] = node()temp_root = temp_root.next[char]temp_root.isWord = Truetemp_root.word = word# 失敗指針函數(shù)def make_fail(self):temp_que = []temp_que.append(self.root)while len(temp_que) != 0:temp = temp_que.pop(0)p = Nonefor key,value in temp.next.item():if temp == self.root:temp.next[key].fail = self.rootelse:p = temp.failwhile p is not None:if key in p.next:temp.next[key].fail = p.failbreakp = p.failif p is None:temp.next[key].fail = self.roottemp_que.append(temp.next[key])# 查找敏感詞函數(shù)def search(self, content):p = self.rootresult = []currentposition = 0while currentposition < len(content):word = content[currentposition]while word in p.next == False and p != self.root:p = p.failif word in p.next:p = p.next[word]else:p = self.rootif p.isWord:result.append(p.word)p = self.rootcurrentposition += 1return result# 加載敏感詞庫(kù)函數(shù)def parse(self, path):with open(path,encoding='gbk') as f:for keyword in f:self.addword(str(keyword).strip())# 敏感詞替換函數(shù)def words_replace(self, text):""":param ah: AC自動(dòng)機(jī):param text: 文本:return: 過(guò)濾敏感詞之后的文本"""result = list(set(self.search(text)))for x in result:m = text.replace(x, '*' * len(x))text = mreturn textif __name__ == '__main__':ah = ac_automation()path='keywords.txt'ah.parse(path)text1=input('輸入文字：')# text1="shabi操草草得到大大蘇打"text2=ah.words_replace(text1)print(text2)time2 = time.time()print('總共耗時(shí)：' + str(time2 - time1) + 's')

DFA算法（原理）

#python實(shí)現(xiàn) # -*- coding:utf-8 -*- import time time1=time.time() # DFA算法 class DFAFilter():def __init__(self):self.keyword_chains = {}self.delimit = '\x00'def add(self, keyword):keyword = keyword.lower()chars = keyword.strip()if not chars:returnlevel = self.keyword_chainsfor i in range(len(chars)):if chars[i] in level:level = level[chars[i]]else:if not isinstance(level, dict):breakfor j in range(i, len(chars)):level[chars[j]] = {}last_level, last_char = level, chars[j]level = level[chars[j]]last_level[last_char] = {self.delimit: 0}breakif i == len(chars) - 1:level[self.delimit] = 0def parse(self, path):with open(path,encoding='gbk') as f:for keyword in f:self.add(str(keyword).strip())def filter(self, message, repl="*"):message = message.lower()ret = []start = 0while start < len(message):level = self.keyword_chainsstep_ins = 0for char in message[start:]:if char in level:step_ins += 1if self.delimit not in level[char]:level = level[char]else:ret.append(repl * step_ins)start += step_ins - 1breakelse:ret.append(message[start])breakelse:ret.append(message[start])start += 1return ''.join(ret)if __name__ == "__main__":gfw = DFAFilter()path="keywords.txt"gfw.parse(path)text=input("請(qǐng)輸入文字：")# text="新疆騷亂蘋果新品發(fā)布會(huì)雞八，操你媽逼的大傻逼你個(gè)哈哈哈胡愛思"result = gfw.filter(text)# print(text)print(result)time2 = time.time()print('總共耗時(shí)：' + str(time2 - time1) + 's')

3.TTMP網(wǎng)友自創(chuàng)算法（原理，code）

建立反垃圾信息（anti-spam）機(jī)制：**

我們經(jīng)常會(huì)遇到一些垃圾信息，比如郵箱中收到的各種垃圾郵件、新浪微博的僵尸粉以及論壇中層出不窮的廣告貼等等。有人會(huì)不停的去尋找網(wǎng)站的漏洞以及規(guī)則，使用機(jī)器發(fā)布這些垃圾廣告從而達(dá)到營(yíng)利目的。anti-spam主要是指通過(guò)技術(shù)手段對(duì)數(shù)據(jù)進(jìn)行過(guò)濾和篩選，將我們認(rèn)定為不合格的數(shù)據(jù)清理掉，將系統(tǒng)認(rèn)為可疑的信息進(jìn)行提示分類。anti-spam對(duì)審核工作也是一個(gè)相輔相成的內(nèi)容。
先看看幾個(gè)例子：

Facebook反垃圾實(shí)踐
知乎反作弊垃圾文本識(shí)別
文本反垃圾在花椒直播中的應(yīng)用概述
【NLP文本分類】文本分類算法集錦，從入門到精通 ?

關(guān)于敏感詞相關(guān)的github項(xiàng)目:

1.ToolGood.Words

2.text-antispam

3.textfilter

優(yōu)質(zhì)中文NLP資源集合：
包括語(yǔ)言檢測(cè)、中外手機(jī)/電話歸屬地/運(yùn)營(yíng)商查詢、名字推斷性別、手機(jī)號(hào)抽取、身份證抽取、郵箱抽取，關(guān)于BERT的相關(guān)資源等等
https://github.com/fighting41love/funNLP

打開之后就會(huì)發(fā)現(xiàn)你需要的寶藏！

總結(jié)

以上是生活随笔為你收集整理的敏感词过滤及反垃圾文本的相关知识（欢迎收藏）的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇： Hatching shader
下一篇： Android5.0 Telephony