openvino系列 15. OpenVINO OCR
openvino系列 15. OpenVINO OCR
此案例主要解釋如何使用 OpenVINO OCR 模型進行字體檢測(detection)和識別(recognition)。總體上嘗試下來的,OpenVINO提供的OCR模塊效果一般,因為這個模塊只能識別數字和字母,遇到特殊字符會影響識別的精度,而且對于文字的角度與分辨率也有一定要求。
- 字體檢測(detection)任務對應模型:horizontal-text-detection-0001。
- 字體識別(recognition)任務對應模型:text-recognition-0014。
環境描述:
- 本案例運行環境:Win10,10代i5筆記本
- IDE:VSCode
- openvino版本:2022.1
- 代碼鏈接,11-OCR
文章目錄
- openvino系列 15. OpenVINO OCR
- 1. 關于模型的使用
- 1.1 字體檢測預訓練模型
- 1.2 FCOS 回顧
- 1.3 PixelLink 算法回顧
- 1.4 字體識別預訓練模型
- 1.5 最終選擇
- 2. 代碼
- 2.1 下載模型
- 2.2 字體檢測模型
- 2.3 字體識別模型
- 3 結果
1. 關于模型的使用
OpenVINO 的 Model Zoo 提供了很多預訓練模型。
1.1 字體檢測預訓練模型
關于字體檢測的模型,Model Zoo 提供了如下幾個:
- horizontal-text-detection-0001
- text-detection-0003
- text-detection-0004
| 說明 | based on FCOS architecture with MobileNetV2-like as a backbone | based on PixelLink architecture with MobileNetV2-like as a backbone | based on PixelLink architecture with MobileNetV2, depth_multiplier=1.4 as a backbone |
| 輸入 | [1,3,704,704],對應 [1,C,H,W] | [1,768,1280,3],對應 [B,H,W,C] | [1,768,1280,3],對應 [B,H,W,C] |
| 輸出1 | boxes:[N,5],其中 N 是檢測到的邊界框的數量。每個檢測框格式為:[x_min,y_min,x_max,y_max,conf] | model/link_logits_/add:[1,192,320,16],logits related to linkage between pixels and their neighbors | model/link_logits_/add:[1,192,320,16],logits related to linkage between pixels and their neighbors |
| 輸出2 | labels:[N],其中 N 是檢測到的邊界框的數量,在文本檢測的情況下,每個檢測到的框的值都等于0。 | model/segm_logits/add:[1,192,320,2],logits related to text/no-text classification for each pixel | model/segm_logits/add:[1,192,320,2],logits related to text/no-text classification for each pixel |
B - batch size;H - image height;W - image width;C - number of channels。
1.2 FCOS 回顧
horizontal-text-detection-0001這個模型是通過FCOS訓練而來的。這里我們對FCOS(Fully Convolutional One-Stage Object Detection)做一個簡單的回顧。
FCOS是一個端到端的anchor-free one-stage 物體識別算法,網絡結構如下圖,由如下三部分組成:
根據FPN,我們在不同層次對特征圖上檢測不同尺寸的物體。具體來說,我們抽出五層特征圖,分別定義為{ P3P_3P3?, P4P_4P4?, P5P_5P5?, P6P_6P6?, P7P_7P7?}。 P3P_3P3?, P4P_4P4?, P5P_5P5? 由主干CNN的特征圖 C3C_3C3?, C4C_4C4?, C5C_5C5? 經過一個1x1卷積橫向連接得到。 P6P_6P6?, P7P_7P7? 分別由 P5P_5P5?, P6P_6P6? 經過一個stride=2的卷積層得到。所以,最后我們得到的 P3P_3P3?, P4P_4P4?, P5P_5P5?, P6P_6P6?, P7P_7P7? 分別對應stride 8,16,32,64,128。
右側的 Head 是 FCOS 的重點部分,可以看到每層 feature 被分為了兩個分支,上面的分支用于做分類,下面的分支用于做目標框位置的回歸。分類的分支還有一個 Center-ness 分支用于做中心點的預測。不同于傳統的中心點 + 寬高或者坐標點的形式,FCOS 通過中心點和一個4D vector(l,t,r,b)來預測物體框的位置。
最后,注意一點,FCOS 中只要 feature map 某個位置的點落入 groundtruth 的 bbox 中就被認為是正樣本,可見用于訓練的正樣本的數量將會非常的多。
Cost Function這里就不贅述了,我們只是在這里回顧一下 FCOS 算法的整體邏輯。
1.3 PixelLink 算法回顧
text-detection-0003和text-detection-0004背后的算法是基于PixelLink: Detecting Scene Text via Instance Segmentation。這里,我們對PixelLink做一個簡單的回顧。
對于一般的基于深度學習的文字檢測模型,其主要的實現步驟是判斷是不是文本,并且給出文本框的位置和角度,如下圖:
上一章節那個 FCOS 模型雖然不是專門檢測文字的,但整體邏輯類似,都是最后有一個回歸,一個分類。
PixelLink主要有兩個部分:Pixel(像素)、Link(連接)。PixelLink主要是基于CNN網絡,做某個像素(pixel)的文本/非文本的分類預測,以及該像素的8個鄰域方向是否存在連接(link)的分類預測(即上圖中虛線框內的八個熱圖,代表八個方向的連接預測)。
PixelLink網絡結構的骨干(backbone)采用VGG16作為特征提取器,將最后的全連接層fc6、fc7替換為卷積層,特征融合和像素預測的方式基于FPN思想(feature pyramid network,金字塔特征網絡),即卷積層的尺寸依次減半,但卷積核的數量依次增倍。該模型結構有兩個獨立的頭,一個用于文本/非文本預測(Text/non-text Prediction),另一個用于連接預測(Link Prediction),這兩者都使用了Softmax,輸出1x2=2通道(文本/非文本的分類)和8x2=16通道(8個鄰域方向是否有連接的分類)。
1.4 字體識別預訓練模型
關于字體識別的模型,Model Zoo 提供了如下幾個:
- text-recognition-0012
- text-recognition-0014
- text-recognition-resnet-fc
| 說明 | VGG16-like backbone and bidirectional LSTM encoder-decoder | ResNext101-like backbone (stage-1-2) and bidirectional LSTM encoder-decoder. | model based on ResNet with Fully Connected text recognition head |
| Accuracy in ICDAR13 Dataset | 0.8818 | 0.8887 | 92.96% |
| 輸入 | [1,32,120,1],對應 [B,H,W,C] | [1,1,32,128],對應 [B,C,H,W] | [1,1,32,100],對應 [B,C,H,W] |
| 注意 | source image should be tight aligned crop with detected text converted to grayscale. | source image should be tight aligned crop with detected text converted to grayscale. | source image should be tight aligned crop with detected text converted to grayscale. Mean values: [127.5, 127.5, 127.5], scale factor for each channel: 127.5. |
| 輸出 | boxes:[30,1,37],對應[W,B,L],L的順序:0123456789abcdefghijklmnopqrstuvwxyz# | [16,1,37],對應[W,B,L],L的順序:#0123456789abcdefghijklmnopqrstuvwxyz | [1,26,37],對應[B,W,L],L的順序:[s]0123456789abcdefghijklmnopqrstuvwxyz |
B - batch size;H - image height;W - image width;C - number of channels;W:output sequence length;L:confidence distribution across alphanumeric symbols。
1.5 最終選擇
最終我們選擇:
- 字體檢測(detection)任務對應模型:horizontal-text-detection-0001。
- 字體識別(recognition)任務對應模型:text-recognition-0014。
2. 代碼
2.1 下載模型
首先,和其他模型一樣,我們還是先下載模型。
import shutil import sys from pathlib import Path import cv2 import matplotlib.pyplot as plt import numpy as np from IPython.display import Markdown, display from PIL import Image from openvino.runtime import Core from yaspin import yaspin import numpy from PIL import Image, ImageOpsie = Core() model_dir = Path("model") precision = "FP16" detection_model = "horizontal-text-detection-0001" recognition_model = "text-recognition-0014" #base_model_dir = Path("~/open_model_zoo_models").expanduser() base_model_dir = Path("./model/open_model_zoo_models").expanduser() #omz_cache_dir = Path("~/open_model_zoo_cache").expanduser() omz_cache_dir = Path("./model/open_model_zoo_cache").expanduser() model_dir.mkdir(exist_ok=True) ''' 下載模型 ''' print("1 - Download text detection model: horizontal-text-detection-0001, and text recognition model: text-recognition-0014 from Open Model Zoo. Both models are already in IR format.") ir_path_detection_model = Path(f"{base_model_dir}/intel/{detection_model}/{precision}/{detection_model}.xml") ir_path_recognition_model = Path(f"{base_model_dir}/intel/{recognition_model}/{precision}/{recognition_model}.xml")if not ir_path_detection_model.exists() and ir_path_recognition_model.exists():download_command = f"omz_downloader " \f"--name {detection_model},{recognition_model} " \f"--output_dir {base_model_dir} " \f"--cache_dir {omz_cache_dir} " \f"--precision {precision}"display(Markdown(f"Download command: `{download_command}`"))with yaspin(text=f"Downloading {detection_model}, {recognition_model}") as sp:download_result = !$download_commandprint(download_result)sp.text = f"Finished downloading {detection_model}, {recognition_model}"sp.ok("?") else:print("IR model already exists.")2.2 字體檢測模型
- 加載檢測模型:horizontal-text-detection-0001;
- 加載圖像,并調整其尺寸使之和模型的輸入尺寸吻合;
- 模型推理,并返回檢測推理結果。
首先,我們加載檢測模型,并且看一下這個模型的輸入輸出:
print("2 - Load detection Model: horizontal-text-detection-0001")detection_model = ie.read_model(model=ir_path_detection_model, weights=ir_path_detection_model.with_suffix(".bin") ) detection_compiled_model = ie.compile_model(model=detection_model, device_name="CPU")detection_input_layer = detection_compiled_model.input(0) detection_output_layer_box = detection_compiled_model.output('boxes') detection_output_layer_label = detection_compiled_model.output('labels')print("- Input of detection model shape: {}".format(detection_input_layer)) print("- Output `box` of detection model shape: {}".format(detection_output_layer_box)) print("- Output `label` of detection model shape: {}".format(detection_output_layer_label))Terminal打印:
2 - Load detection Model. - Input of detection model shape: <ConstOutput: names[image] shape{1,3,704,704} type: f32> - Output `box` of detection model shape: <ConstOutput: names[boxes] shape{..100,5} type: f32> - Output `label` of detection model shape: <ConstOutput: names[labels] shape{..100} type: i64>接下來,我們導入圖片,并調整其尺寸使之和模型的輸入尺寸吻合。
print("3 - Load Image and resize into model input shape.")# Read the image image = cv2.imread("data/label4.png") print("- Input image size: {}".format(image.shape)) # N,C,H,W = batch size, number of channels, height, width N, C, H, W = detection_input_layer.shape# Resize image to meet network expected input sizes resized_image = cv2.resize(image, (W, H))# Reshape to network input shape input_image = np.expand_dims(resized_image.transpose(2, 0, 1), 0) print("- Input image is resized (with padding) into: {}".format(input_image.shape))plt.imshow(cv2.cvtColor(resized_image, cv2.COLOR_BGR2RGB));Terminal打印:
3 - Load Image and resize into model input shape. - Input image size: (256, 644, 3) - Input image is resized (with padding) into: (1, 3, 704, 704)模型推理的代碼如下:
''' ### 模型推理 在圖像中檢測到文本框并以`[100, 5]`形狀的數據塊形式返回。每個檢測描述的格式為 `[x_min, y_min, x_max, y_max, conf]`。 ''' print("4 - Detection model inference.") output_key = detection_compiled_model.output("boxes") boxes = detection_compiled_model([input_image])[output_key]# Remove zero only boxes boxes = boxes[~np.all(boxes == 0, axis=1)] print("- Detect {} boxes.".format(boxes.shape[0]))Terminal打印:
4 - Detection model inference. - Detect 4 boxes.2.3 字體識別模型
文字識別模型和文字檢測模型導入和推理的步驟是類似的,這里我們就直接上代碼了:
def multiply_by_ratio(ratio_x, ratio_y, box):return [max(shape * ratio_y, 10) if idx % 2 else shape * ratio_xfor idx, shape in enumerate(box[:-1])]def run_preprocesing_on_crop(crop, net_shape):temp_img = cv2.resize(crop, net_shape)temp_img = temp_img.reshape((1,) * 2 + temp_img.shape)return temp_imgdef convert_result_to_image(bgr_image, resized_image, boxes, threshold=0.3, conf_labels=True):# Define colors for boxes and descriptionscolors = {"red": (255, 0, 0), "green": (0, 255, 0), "white": (255, 255, 255)}# Fetch image shapes to calculate ratio(real_y, real_x), (resized_y, resized_x) = image.shape[:2], resized_image.shape[:2]ratio_x, ratio_y = real_x / resized_x, real_y / resized_y# Convert base image from bgr to rgb formatrgb_image = cv2.cvtColor(bgr_image, cv2.COLOR_BGR2RGB)# Iterate through non-zero boxesfor box, annotation in boxes:# Pick confidence factor from last place in arrayconf = box[-1]if conf > threshold:# Convert float to int and multiply position of each box by x and y ratio(x_min, y_min, x_max, y_max) = map(int, multiply_by_ratio(ratio_x, ratio_y, box))# Draw box based on position, parameters in rectangle function are: image, start_point, end_point, color, thicknesscv2.rectangle(rgb_image, (x_min, y_min), (x_max, y_max), colors["green"], 3)# Add text to image based on position and confidence, parameters in putText function are: image, text, bottomleft_corner_textfield, font, font_scale, color, thickness, line_typeif conf_labels:# Create background box based on annotation length(text_w, text_h), _ = cv2.getTextSize(f"{annotation}", cv2.FONT_HERSHEY_TRIPLEX, 0.8, 1)image_copy = rgb_image.copy()cv2.rectangle(image_copy,(x_min, y_min - text_h - 10),(x_min + text_w, y_min - 10),colors["white"],-1,)# Add weighted image copy with white boxes under textcv2.addWeighted(image_copy, 0.4, rgb_image, 0.6, 0, rgb_image)cv2.putText(rgb_image,f"{annotation}",(x_min, y_min - 10),cv2.FONT_HERSHEY_SIMPLEX,0.8,colors["red"],1,cv2.LINE_AA,)return rgb_imageprint("5 - Load Recognition Model: text-recognition-0014")recognition_model = ie.read_model(model=ir_path_recognition_model, weights=ir_path_recognition_model.with_suffix(".bin") )recognition_compiled_model = ie.compile_model(model=recognition_model, device_name="CPU")recognition_output_layer = recognition_compiled_model.output(0) recognition_input_layer = recognition_compiled_model.input(0)# Get height and width of input layer _, _, Hrecog, Wrecog = recognition_input_layer.shapeprint("- Input of recognition model shape: {}".format(recognition_input_layer)) print("- Output of recognition model shape: {}".format(recognition_output_layer))''' 模型推理 ''' # Calculate scale for image resizing (real_y, real_x), (resized_y, resized_x) = image.shape[:2], resized_image.shape[:2] ratio_x, ratio_y = real_x / resized_x, real_y / resized_y# Convert image to grayscale for text recognition model grayscale_image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)# Get dictionary to encode output, based on model documentation letters = "~0123456789abcdefghijklmnopqrstuvwxyz"# Prepare empty list for annotations annotations = list() cropped_images = list() # fig, ax = plt.subplots(len(boxes), 1, figsize=(5,15), sharex=True, sharey=True) # For each crop, based on boxes given by detection model we want to get annotations for i, crop in enumerate(boxes):# Get coordinates on corners of crop(x_min, y_min, x_max, y_max) = map(int, multiply_by_ratio(ratio_x, ratio_y, crop))image_crop = run_preprocesing_on_crop(grayscale_image[y_min:y_max, x_min:x_max], (Wrecog, Hrecog))# Run inference with recognition modelresult = recognition_compiled_model([image_crop])[recognition_output_layer]# Squeeze output to remove unnececery dimensionrecognition_results_test = np.squeeze(result)# Read annotation based on probabilities from output layerannotation = list()for letter in recognition_results_test:parsed_letter = letters[letter.argmax()]# 如果我們檢測到數字,都需要-1if parsed_letter.isnumeric():parsed_letter = int(parsed_letter)parsed_letter = parsed_letter + 1if parsed_letter == 10:parsed_letter = 0parsed_letter = str(parsed_letter)# Returning 0 index from argmax signalises end of stringif parsed_letter == letters[0]:continueannotation.append(parsed_letter)annotations.append("".join(annotation))cropped_image = Image.fromarray(image[y_min:y_max, x_min:x_max])cropped_images.append(cropped_image)boxes_with_annotations = list(zip(boxes, annotations))3 結果
我試了幾張圖片,其實效果一般,說實話,還沒有Tesseract好。如下圖:
總結
以上是生活随笔為你收集整理的openvino系列 15. OpenVINO OCR的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 用python判断你是青少年还是老年人
- 下一篇: 考研联系导师全攻略!