Tesseract 3.02 OCR文字识别调查记录
- 安裝使用:
Tesseract下載地址
https://code.google.com/p/tesseract-ocr/
目前最新版本為3.02
windows版下載解壓后,使用命令行,進入解壓后目錄運行
命令格式:
Usage:tesseract.exe imagename outputbase [-l lang] [-psm pagesegmode] e...]pagesegmode values are: 0 = Orientation and script detection (OSD) only. 1 = Automatic page segmentation with OSD. 2 = Automatic page segmentation, but no OSD, or OCR 3 = Fully automatic page segmentation, but no OSD. (Default) 4 = Assume a single column of text of variable sizes. 5 = Assume a single uniform block of vertically aligned text. 6 = Assume a single uniform block of text. 7 = Treat the image as a single text line. 8 = Treat the image as a single word. 9 = Treat the image as a single word in a circle. 10 = Treat the image as a single character. -l lang and/or -psm pagesegmode must occur before anyconfigfile.Single options:-v --version: version info--list-langs: list available languages for tesseract engine命令舉例:
F:\Tesseract-OCR>tesseract.exe 2013-09-05_154628.jpg eng -l?eng -psm 6
相關命令列表:
| 功能 | 命令 |
| ? | ambiguous_words.exe |
| ? | classifier_tester.exe |
| ? | cntraining.exe |
| 整合訓練文件 | combine_tessdata.exe |
| ? | dawg2wordlist.exe |
| ? | mftraining.exe |
| ? | shapeclustering.exe |
| 識別程序 | tesseract.exe |
| ? | unicharset_extractor.exe |
| ? | wordlist2dawg.exe |
?
?
- 字庫訓練
?需要的字庫文件參考代碼:
tesseract-ocr\ccutil\tessdatamanager.h
對字庫相關的配置文件的格式要求:
ASCII or UTF-8 encoding without?BOM
Unix?end-of-line marker?('\n')
The last character must be an end of line marker ('\n'). Some text editors will show this as an empty line at the end of file. If you omit this you will got error message containing "last_char == '\n':Error:Assert failed..."
步驟:
1.生成訓練圖片
幾個原則:
保證每個字符出現的頻率一般10次,常用字20次,不常用字5次;
不能把特殊字符都放在一起,應該用更加接近實際使用的組合;
非常重要:在字符和行之間保持一定的間隔,否則可能導致失敗。(可能在3.0之后的版本修復)
訓練的數據需要以font分組,相同font的文字需要放在同一個tiff文件中,(支持多頁page)
除非字體太小(高度小于15px),沒有必要做不同尺寸的訓練;
絕對不可以在同一個image文件中混雜多種字體
(可以參考下載頁中的boxtiff文件樣例)
Next print and scan (or use some electronic rendering method) to create an image of your training page. Upto 32 training files can be used (of multiple pages). It is best to create a mix of fonts and styles (but in separate files), including italic and bold.
生成tiff文件
2.制作box文件
生成box文件命令:
tesseract [lang].[fontname].exp[num].tif [lang].[fontname].exp[num] batch.nochop makebox例:?
tesseract eng.timesitalic.exp0.tif eng.timesitalic.exp0 batch.nochop makebox?
?
3.得到一個新的字符集
?
- 其他
參考文檔:
解壓后doc目錄中有API說明
?
--end--
轉載于:https://www.cnblogs.com/rakuhin/p/3303720.html
《新程序員》:云原生和全面數字化實踐50位技術專家共同創作,文字、視頻、音頻交互閱讀總結
以上是生活随笔為你收集整理的Tesseract 3.02 OCR文字识别调查记录的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: LINUX内核分析第四周——扒开系统调用
- 下一篇: javaweb数据库操作