【深度学习】使用深度学习阅读和分类扫描文档
收集數據
首先,我們要做的第一件事是創建一個簡單的數據集,這樣我們就可以測試我們工作流程的每一部分。理想情況下,我們的數據集將包含各種易讀性和時間段的掃描文檔,以及每個文檔所屬的高級主題。我找不到具有這些精確規格的數據集,所以我開始構建自己的數據集。我決定的高層次話題是政府、信件、吸煙和專利,隨機的選擇這些主要是因為每個地區都有各種各樣的掃描文件。
我從這些來源中的每一個中挑選了 20 個左右的大小合適的文檔,并將它們放入由主題定義的單獨文件夾中。
經過將近一整天的搜索和編目所有圖像后,我們將它們全部調整為 600x800 并將它們轉換為 PNG 格式。
簡單的調整大小和轉換腳本如下:
from PIL import Imageimg_folder = r'F:\Data\Imagery\OCR' # Folder containing topic folders (i.e "News", "Letters" ..etc.)for subfol in os.listdir(img_folder): # For each of the topic folderssfpath = os.path.join(img_folder, subfol)for imgfile in os.listdir(sfpath): # Get all images in the topicimgpath = os.path.join(sfpath, imgfile)img = Image.open(imgpath) # Read in the image with Pillowimg = img.resize((600,800)) # Resize the imagenewip = imgpath[0:-4] + ".png" # Convert to PNGimg.save(newip) # Save構建OCR管道
光學字符識別是從圖像中提取文字的過程。這通常是通過機器學習模型完成的,最常見的是通過包含卷積神經網絡的管道來完成。雖然我們可以為我們的應用程序訓練自定義 OCR 模型,但它需要更多的訓練數據和計算資源。相反,我們將使用出色的 Microsoft 計算機視覺 API,其中包括專門用于 OCR 的特定模塊。API 調用將使用圖像(作為 PIL 圖像)并輸出幾位信息,包括圖像上文本的位置/方向作為以及文本本身。以下函數將接收一個 PIL 圖像列表并輸出一個大小相等的提取文本列表:
def image_to_text(imglist, ndocs=10):'''Take in a list of PIL images and return a list of extracted text using OCR'''headers = {# Request headers'Content-Type': 'application/octet-stream','Ocp-Apim-Subscription-Key': 'YOUR_KEY_HERE',}params = urllib.parse.urlencode({# Request parameters'language': 'en','detectOrientation ': 'true',})outtext = []docnum = 0for cropped_image in imglist:print("Processing document -- ", str(docnum))# Cropped image must have both height and width > 50 px to run Computer Vision API#if (cropped_image.height or cropped_image.width) < 50:# cropped_images_ocr.append("N/A")# continueocr_image = cropped_imageimgByteArr = io.BytesIO()ocr_image.save(imgByteArr, format='PNG')imgByteArr = imgByteArr.getvalue()try:conn = http.client.HTTPSConnection('westus.api.cognitive.microsoft.com')conn.request("POST", "/vision/v1.0/ocr?%s" % params, imgByteArr, headers)response = conn.getresponse()data = json.loads(response.read().decode("utf-8"))curr_text = []for r in data['regions']:for l in r['lines']:for w in l['words']:curr_text.append(str(w['text']))conn.close()except Exception as e:print("Could not process imageouttext.append(' '.join(curr_text))docnum += 1return(outtext)后期處理
由于在某些情況下我們可能希望在這里結束我們的工作流程,而不是僅僅將提取的文本作為一個巨大的列表保存在內存中,我們還可以將提取的文本寫入與原始輸入文件同名的單個 txt 文件中。微軟的OCR技術雖然不錯,但偶爾也會出錯。????我們可以使用 SpellChecker 模塊減少其中的一些錯誤,以下腳本接受輸入和輸出文件夾,讀取輸入文件夾中的所有掃描文檔,使用我們的 OCR 腳本讀取它們,運行拼寫檢查并糾正拼寫錯誤的單詞,最后將原始txt文件導出目錄。
''' Read in a list of scanned images (as .png files > 50x50px) and output a set of .txt files containing the text content of these scans '''from functions import preprocess, image_to_text from PIL import Image import os from spellchecker import SpellChecker import matplotlib.pyplot as pltINPUT_FOLDER = r'F:\Data\Imagery\OCR2\Images' OUTPUT_FOLDER = r'F:\Research\OCR\Outputs\AllDocuments'## First, read in all the scanned document images into PIL images scanned_docs_path = os.listdir(INPUT_FOLDER) scanned_docs_path = [x for x in scanned_docs_path if x.endswith('.png')] scanned_docs = [Image.open(os.path.join(INPUT_FOLDER, path)) for path in scanned_docs_path]## Second, utilize Microsoft CV API to extract text from these images using OCR scanned_docs_text = image_to_text(scanned_docs)## Third, remove mis-spellings that might have occured from bad OCR readings spell = SpellChecker() for i in range(len(scanned_docs_text)):clean = scanned_docs_text[i]misspelled = spell.unknown(clean)clean = clean.split(" ")for word in range(len(clean)):if clean[word] in misspelled:clean[word] = spell.correction(clean[word])# Get the one `most likely` answerclean = ' '.join(clean)scanned_docs_text[i] = clean## Fourth, write the extracted text to individual .txt files with the same name as input files for k in range(len(scanned_docs_text)): # For each scanned documenttext = scanned_docs_text[k]path = scanned_docs_path[k] # Get the corresponding input filenametext_file_path = path[:-4] + ".txt" # Create the output text filetext_file = open(text_file_path, "wt")n = text_file.write(text) # Write the text to the ouput text filetext_file.close()print("Done")為建模準備文本
如果我們的掃描文檔集足夠大,將它們全部寫入一個大文件夾會使它們難以分類,并且我們可能已經在文檔中進行了某種隱式分組。如果我們大致了解我們擁有多少種不同的“類型”或文檔主題,我們可以使用主題建模來幫助自動識別這些。這將為我們提供基礎架構,以根據文檔內容將 OCR 中識別的文本拆分為單獨的文件夾,我們將使用該主題模型被稱為LDA。為了運行這個模型,我們需要對我們的數據進行更多的預處理和組織,因此為了防止我們的腳本變得冗長和擁擠,我們將假設已經使用上述工作流程讀取了掃描的文檔并將其轉換為 txt 文件. 然后主題模型將讀入這些 txt 文件,將它們分類到我們指定的任意多個主題中,并將它們放入適當的文件夾中。
我們將從一個簡單的函數開始,讀取文件夾中所有輸出的 txt 文件,并將它們讀入包含 (filename, text) 的元組列表。
def read_and_return(foldername, fileext='.txt'):'''Read all text files with fileext from foldername, and place them into a list of tuples as[(filename, text), ... , (filename, text)]'''allfiles = os.listdir(foldername)allfiles = [os.path.join(foldername, f) for f in allfiles if f.endswith(fileext)]alltext = []for filename in allfiles:with open(filename, 'r') as f:alltext.append((filename, f.read()))f.close()return(alltext) # Returns list of tuples [(filename, text), ... (filename,text)]接下來,我們需要確保所有無用的詞(那些不能幫助我們區分特定文檔主題的詞)。我們將使用三種不同的方法來做到這一點:
刪除停用詞
去除標簽、標點、數字和多個空格
TF-IDF 過濾
為了實現所有這些(以及我們的主題模型),我們將使用 Gensim 包。下面的腳本將對文本列表(上述函數的輸出)運行必要的預處理步驟并訓練 LDA 模型。
from gensim import corpora, models, similarities from gensim.parsing.preprocessing import remove_stopwords, preprocess_stringdef preprocess(document):clean = remove_stopwords(document)clean = preprocess_string(document) return(clean)def run_lda(textlist, num_topics=10,preprocess_docs=True):'''Train and return an LDA model against a list of documents'''if preprocess_docs:doc_text = [preprocess(d) for d in textlist]dictionary = corpora.Dictionary(doc_text)corpus = [dictionary.doc2bow(text) for text in doc_text]tfidf = models.tfidfmodel.TfidfModel(corpus)transformed_tfidf = tfidf[corpus]lda = models.ldamulticore.LdaMulticore(transformed_tfidf, num_topics=num_topics, id2word=dictionary)return(lda, dictionary)使用模型對文檔進行分類
一旦我們訓練了我們的 LDA 模型,我們就可以使用它來將我們的訓練文檔集(以及可能出現的未來文檔)分類為主題,然后將它們放入適當的文件夾中。
對新的文本字符串使用經過訓練的 LDA 模型需要一些麻煩,所有的復雜性都包含在下面的函數中:
def find_topic(textlist, dictionary, lda):'''https://stackoverflow.com/questions/16262016/how-to-predict-the-topic-of-a-new-query-using-a-trained-lda-model-using-gensimFor each query ( document in the test file) , tokenize the query, create a feature vector just like how it was done while trainingand create text_corpus'''text_corpus = []for query in textlist:temp_doc = tokenize(query.strip())current_doc = []temp_doc = list(temp_doc)for word in range(len(temp_doc)):current_doc.append(temp_doc[word])text_corpus.append(current_doc)'''For each feature vector text, lda[doc_bow] gives the topicdistribution, which can be sorted in descending order to print the very first topic''' tops = []for text in text_corpus:doc_bow = dictionary.doc2bow(text)topics = sorted(lda[doc_bow],key=lambda x:x[1],reverse=True)[0]tops.append(topics)return(tops)最后,我們需要另一種方法來根據主題索引獲取主題的實際名稱:
def topic_label(ldamodel, topicnum):alltopics = ldamodel.show_topics(formatted=False)topic = alltopics[topicnum]topic = dict(topic[1])return(max(topic, key=lambda key: topic[key]))現在,我們可以將上面編寫的所有函數粘貼到一個接受輸入文件夾、輸出文件夾和主題計數的腳本中。該腳本將讀取輸入文件夾中所有掃描的文檔圖像,將它們寫入txt 文件,構建LDA 模型以查找文檔中的高級主題,并根據文檔主題將輸出的txt 文件歸類到文件夾中。
################################################################# # This script takes in an input folder of scanned documents # # and reads these documents, seperates them into topics # # and outputs raw .txt files into the output folder, seperated # # by topic # #################################################################import os from PIL import Image import base64 import http.client, urllib.request, urllib.parse, urllib.error, base64 import io import json import requests import urllib from gensim import corpora, models, similarities from gensim.utils import tokenize from gensim.parsing.preprocessing import remove_stopwords, preprocess_string import http import shutil import tqdmdef filter_for_english(text):dict_url = 'https://raw.githubusercontent.com/first20hours/' \'google-10000-english/master/20k.txt'dict_words = set(requests.get(dict_url).text.splitlines())english_words = tokenize(text)english_words = [w for w in english_words if w in list(dict_words)]english_words = [w for w in english_words if (len(w)>1 or w.lower()=='i')]return(' '.join(english_words))def preprocess(document):clean = filter_for_english(document)clean = remove_stopwords(clean)clean = preprocess_string(clean) # Remove non-english wordsreturn(clean)def read_and_return(foldername, fileext='.txt', delete_after_read=False):allfiles = os.listdir(foldername)allfiles = [os.path.join(foldername, f) for f in allfiles if f.endswith(fileext)]alltext = []for filename in allfiles:with open(filename, 'r') as f:alltext.append((filename, f.read()))f.close()if delete_after_read:os.remove(filename)return(alltext) # Returns list of tuples [(filename, text), ... (filename,text)]def image_to_text(imglist, ndocs=10):'''Take in a list of PIL images and return a list of extracted text'''headers = {# Request headers'Content-Type': 'application/octet-stream','Ocp-Apim-Subscription-Key': '89279deb653049078dd18b1b116777ea',}params = urllib.parse.urlencode({# Request parameters'language': 'en','detectOrientation ': 'true',})outtext = []docnum = 0for cropped_image in tqdm.tqdm(imglist, total=len(imglist)):# Cropped image must have both height and width > 50 px to run Computer Vision API#if (cropped_image.height or cropped_image.width) < 50:# cropped_images_ocr.append("N/A")# continueocr_image = cropped_imageimgByteArr = io.BytesIO()ocr_image.save(imgByteArr, format='PNG')imgByteArr = imgByteArr.getvalue()try:conn = http.client.HTTPSConnection('westus.api.cognitive.microsoft.com')conn.request("POST", "/vision/v1.0/ocr?%s" % params, imgByteArr, headers)response = conn.getresponse()data = json.loads(response.read().decode("utf-8"))curr_text = []for r in data['regions']:for l in r['lines']:for w in l['words']:curr_text.append(str(w['text']))conn.close()except Exception as e:print("[Errno {0}] {1}".format(e.errno, e.strerror))outtext.append(' '.join(curr_text))docnum += 1return(outtext)def run_lda(textlist, num_topics=10,return_model=False,preprocess_docs=True):'''Train and return an LDA model against a list of documents'''if preprocess_docs:doc_text = [preprocess(d) for d in textlist]dictionary = corpora.Dictionary(doc_text)corpus = [dictionary.doc2bow(text) for text in doc_text]tfidf = models.tfidfmodel.TfidfModel(corpus)transformed_tfidf = tfidf[corpus]lda = models.ldamulticore.LdaMulticore(transformed_tfidf, num_topics=num_topics, id2word=dictionary)input_doc_topics = lda.get_document_topics(corpus)return(lda, dictionary)def find_topic(text, dictionary, lda):'''https://stackoverflow.com/questions/16262016/how-to-predict-the-topic-of-a-new-query-using-a-trained-lda-model-using-gensimFor each query ( document in the test file) , tokenize the query, create a feature vector just like how it was done while trainingand create text_corpus'''text_corpus = []for query in text:temp_doc = tokenize(query.strip())current_doc = []temp_doc = list(temp_doc)for word in range(len(temp_doc)):current_doc.append(temp_doc[word])text_corpus.append(current_doc)'''For each feature vector text, lda[doc_bow] gives the topicdistribution, which can be sorted in descending order to print the very first topic''' tops = []for text in text_corpus:doc_bow = dictionary.doc2bow(text)topics = sorted(lda[doc_bow],key=lambda x:x[1],reverse=True)[0]tops.append(topics)return(tops)def topic_label(ldamodel, topicnum):alltopics = ldamodel.show_topics(formatted=False)topic = alltopics[topicnum]topic = dict(topic[1])import operatorreturn(max(topic, key=lambda key: topic[key]))INPUT_FOLDER = r'F:/Research/OCR/Outputs/AllDocuments' OUTPUT_FOLDER = r'F:/Research/OCR/Outputs/AllDocumentsByTopic' TOPICS = 4if __name__ == '__main__':print("Reading scanned documents")## First, read in all the scanned document images into PIL imagesscanned_docs_fol = r'F:/Research/OCR/Outputs/AllDocuments'scanned_docs_path = os.listdir(scanned_docs_fol)scanned_docs_path = [os.path.join(scanned_docs_fol, p) for p in scanned_docs_path]scanned_docs = [Image.open(x) for x in scanned_docs_path if x.endswith('.png')]## Second, utilize Microsoft CV API to extract text from these images using OCRscanned_docs_text = image_to_text(scanned_docs)print("Post-processing extracted text")## Third, remove mis-spellings that might have occured from bad OCR readingsspell = SpellChecker()for i in range(len(scanned_docs_text)):clean = scanned_docs_text[i]misspelled = spell.unknown(clean)clean = clean.split(" ")for word in range(len(clean)):if clean[word] in misspelled:clean[word] = spell.correction(clean[word])# Get the one `most likely` answerclean = ' '.join(clean)scanned_docs_text[i] = cleanprint("Writing read text into files")## Fourth, write the extracted text to individual .txt files with the same name as input filesfor k in range(len(scanned_docs_text)): # For each scanned documenttext = scanned_docs_text[k]text = filter_for_english(text)path = scanned_docs_path[k] # Get the corresponding input filenamepath = path.split("\\")[-1]text_file_path = OUTPUT_FOLDER + "//" + path[0:-4] + ".txt" # Create the output text filetext_file = open(text_file_path, "wt")n = text_file.write(text) # Write the text to the ouput text filetext_file.close()# First, read all the output .txt filesprint("Reading files")texts = read_and_return(OUTPUT_FOLDER)print("Building LDA topic model")# Second, train the LDA model (pre-processing is internally done)print("Preprocessing Text")textlist = [t[1] for t in texts]ldamodel, dictionary = run_lda(textlist, num_topics=TOPICS)# Third, extract the top topic for each documentprint("Extracting Topics")topics = []for t in texts:topics.append((t[0], find_topic([t[1]], dictionary, ldamodel)))# Convert topics to topic namesfor i in range(len(topics)):topnum = topics[i][1][0][0]#print(topnum)topics[i][1][0] = topic_label(ldamodel, topnum)# [(filename, topic), ..., (filename, topic)]# Create folders for the topicsprint("Copying Documents into Topic Folders")foundtopics = []for t in topics:foundtopics+= t[1]foundtopics = set(foundtopics)topicfolders = [os.path.join(OUTPUT_FOLDER, f) for f in foundtopics]topicfolders = set(topicfolders)[os.makedirs(m) for m in topicfolders]# Copy files into appropriate topic foldersfor t in topics:filename, topic = tsrc = filenamefilename = filename.split("\\")dest = os.path.join(OUTPUT_FOLDER, topic[0])dest = dest + "/" + filename[-1]copystr = "copy " + src + " " + destshutil.copyfile(src, dest)os.remove(src)print("Done")本文代碼Github鏈接:
https://github.com/ShairozS/Scan2Topic
往期精彩回顧適合初學者入門人工智能的路線及資料下載機器學習及深度學習筆記等資料打印機器學習在線手冊深度學習筆記專輯《統計學習方法》的代碼復現專輯 AI基礎下載黃海廣老師《機器學習課程》視頻課黃海廣老師《機器學習課程》711頁完整版課件本站qq群554839127,加入微信群請掃碼:
與50位技術專家面對面20年技術見證,附贈技術全景圖總結
以上是生活随笔為你收集整理的【深度学习】使用深度学习阅读和分类扫描文档的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: vue的mixins属性
- 下一篇: spring boot打包问题,访问问题