當前位置：首頁 > 编程语言 > python >内容正文

python

python实现多语言语种识别_用Python进行语言检测

發布時間：2024/7/23 python 30 豆豆

生活随笔收集整理的這篇文章主要介紹了 python实现多语言语种识别_用Python进行语言检测小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

最近正好碰到這個需求，總結一下用Python進行語言檢測的方法。

1.用unicode編碼檢測

漢字、韓文、日文等都有對應的unicode字符集范圍，只要用正則表達式匹配出來即可。

在判斷的時候，往往需要去掉一些特殊字符，例如中英文標點符號。可以用下列方法去除：

# 方法一，自定義需要去掉的標點符號，注意這個字符串的首尾出現的[]不是標點符號'[]'，

# 而是正則表達式中的中括號，表示定義匹配的字符范圍

remove_nota = u'[’·°–!"#$%&\'()*+,-./:;<=>?@，。?★、…【】()《》？“”‘’！[\\]^_`{|}~]+'

sentence = '測試。，[].?'

print(re.sub(remove_nota, '', sentence))

# 方法二，只能去掉英文標點符號

remove_punctuation_map = dict((ord(char), None) for char in string.punctuation)

print(sentence.translate(remove_punctuation_map))

輸出：

測試

測試。，

還可以把數字也去掉：

# 方法一

sentence = re.sub('[0-9]', '', sentence).strip()

# 方法二

remove_digits = str.maketrans('', '', string.digits)

sentence = sentence.translate(remove_digits)

然后就可以進行語言檢測了。

這里的思路是匹配句子的相應語言字符，然后替換掉，如果替換后字符串為空，表示這個句子是純正的該語言(即不摻雜其它語言)。

也可以用正則表達式查詢出句子中屬于該語言的字符

s = "English Test"

re_words = re.compile(u"[a-zA-Z]")

res = re.findall(re_words, s) # 查詢出所有的匹配字符串

print(res)

res2 = re.sub('[a-zA-Z]', '', s).strip()

print(res2) # 空字符串

if len(res2) <= 0:

print("This is English")

輸出：

['E', 'n', 'g', 'l', 'i', 's', 'h', 'T', 'e', 's', 't']

This is English

匹配英文用u"[a-zA-Z]"

中文用u"[\u4e00-\u9fa5]+"

韓文用u"[\uac00-\ud7ff]+"

日文用u"[\u30a0-\u30ff\u3040-\u309f]+" (包括平假名和片假名)

如果想只保留需要的內容，比如保留中英文及數字：

# 只保留中文、英文、數字(會去掉法語德語韓語日語等)

rule = re.compile(u"[^a-zA-Z0-9\u4e00-\u9fa5]")

sentence = rule.sub('', sentence)

完整代碼：

import re

import string

remove_nota = u'[’·°–!"#$%&\'()*+,-./:;<=>?@，。?★、…【】()《》？“”‘’！[\\]^_`{|}~]+'

remove_punctuation_map = dict((ord(char), None) for char in string.punctuation)

def filter_str(sentence):

sentence = re.sub(remove_nota, '', sentence)

sentence = sentence.translate(remove_punctuation_map)

return sentence.strip()

# 判斷中日韓英

def judge_language(s):

# s = unicode(s) # python2需要將字符串轉換為unicode編碼，python3不需要

s = filter_str(s)

result = []

s = re.sub('[0-9]', '', s).strip()

# unicode english

re_words = re.compile(u"[a-zA-Z]")

res = re.findall(re_words, s) # 查詢出所有的匹配字符串

res2 = re.sub('[a-zA-Z]', '', s).strip()

if len(res) > 0:

result.append('en')

if len(res2) <= 0:

return 'en'

# unicode chinese

re_words = re.compile(u"[\u4e00-\u9fa5]+")

res = re.findall(re_words, s) # 查詢出所有的匹配字符串

res2 = re.sub(u"[\u4e00-\u9fa5]+", '', s).strip()

if len(res) > 0:

result.append('zh')

if len(res2) <= 0:

return 'zh'

# unicode korean

re_words = re.compile(u"[\uac00-\ud7ff]+")

res = re.findall(re_words, s) # 查詢出所有的匹配字符串

res2 = re.sub(u"[\uac00-\ud7ff]+", '', s).strip()

if len(res) > 0:

result.append('ko')

if len(res2) <= 0:

return 'ko'

# unicode japanese katakana and unicode japanese hiragana

re_words = re.compile(u"[\u30a0-\u30ff\u3040-\u309f]+")

res = re.findall(re_words, s) # 查詢出所有的匹配字符串

res2 = re.sub(u"[\u30a0-\u30ff\u3040-\u309f]+", '', s).strip()

if len(res) > 0:

result.append('ja')

if len(res2) <= 0:

return 'ja'

return ','.join(result)

這里的judge_language函數實現的功能是：針對一個字符串，返回其所屬語種，如果存在多種語言，則返回多種語種(只能檢測出中日英韓)

測試一下效果：

s1 = "漢語是世界上最優美的語言，正則表達式是一個很有用的工具"

s2 = "正規表現は非常に役に立つツールテキストを操作することです"

s3 = "あアいイうウえエおオ"

s4 = "?? ???? ?? ??? ?? ???? ???? ????"

s5 = "Regular expression is a powerful tool for manipulating text."

s6 = "Regular expression 正則表達式あアいイうウえエおオ ?? ????"

print(judge_language(s1))

print(judge_language(s2))

print(judge_language(s3))

print(judge_language(s4))

print(judge_language(s5))

print(judge_language(s6))

輸出：

zh,ja

en,zh,ko,ja

因為s2中包括了漢字，所以輸出結果中有zh。

2.用工具包檢測

(1)langdetect

from langdetect import detect

from langdetect import detect_langs

s1 = "漢語是世界上最優美的語言，正則表達式是一個很有用的工具"

s2 = "正規表現は非常に役に立つツールテキストを操作することです"

s3 = "あアいイうウえエおオ"

s4 = "?? ???? ?? ??? ?? ???? ???? ????"

s5 = "Regular expression is a powerful tool for manipulating text."

s6 = "Regular expression 正則表達式あアいイうウえエおオ ?? ????"

print(detect(s1))

print(detect(s2))

print(detect(s3))

print(detect(s4))

print(detect(s5))

print(detect(s6)) # detect()輸出探測出的語言類型

print(detect_langs(s6)) # detect_langs()輸出探測出的所有語言類型及其所占的比例

輸出：

zh-cn

ca # 加泰隆語

[ca:0.7142837837746273, ja:0.2857136751343887]

emmm...最后一句話識別的不準

(2)langid

import langid

s1 = "漢語是世界上最優美的語言，正則表達式是一個很有用的工具"

s2 = "正規表現は非常に役に立つツールテキストを操作することです"

s3 = "あアいイうウえエおオ"

s4 = "?? ???? ?? ??? ?? ???? ???? ????"

s5 = "Regular expression is a powerful tool for manipulating text."

s6 = "Regular expression 正則表達式あアいイうウえエおオ ?? ????"

print(langid.classify(s1))

print(langid.classify(s2))

print(langid.classify(s3))

print(langid.classify(s4))

print(langid.classify(s5))

print(langid.classify(s6))

# langid.classify(s6)輸出探測出的語言類型及其confidence score，

# 其confidence score計算方式方法見：https://jblevins.org/log/log-sum-exp

輸出：

('zh', -370.64875650405884)

('ja', -668.9920794963837)

('ja', -213.35927987098694)

('ko', -494.80780935287476)

('en', -56.482327461242676)

('ja', -502.3459689617157)

兩個包都把最后一句話識別成了英文，他們給出的結果都是ISO 639-1標準的語言代碼。

再來看幾個其他語言的例子：

s = "ру?сский язы?к" # Russian

print(detect(s))

print(langid.classify(s))

s = " " # Arabic

print(detect(s))

print(langid.classify(s))

s = "bonjour" # French

print(detect(s))

print(langid.classify(s))

輸出：

('ru', -194.25553131103516)

('ar', -72.63771915435791)

hr # 克羅地亞語

('en', -22.992373943328857)

法語沒判斷出來。langdetect的判斷結果依舊比較離譜...

沒事可以多玩玩這兩個包，O(∩_∩)O哈哈~

參考資料：

歡迎關注我的微信公眾號~

總結

以上是生活随笔為你收集整理的python实现多语言语种识别_用Python进行语言检测的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： linux 找不到php命令,bash
下一篇： centos8安装文件服务器,cento

3atv精品不卡视频,97人人超碰国产精品最新,中文字幕av一区二区三区人妻少妇,久久久精品波多野结衣,日韩一区二区三区精品

python

python实现多语言语种识别_用Python进行语言检测

總結