ai 图灵测试_适用于现代AI系统的“视觉图灵测试”
ai 圖靈測試
Visual Question Answering (VQA) is a fascinating research field at the intersection of computer vision and language understanding.
視覺問答(VQA)是計算機視覺與語言理解相交的一個有趣的研究領域。
In this post we will elaborate both on existing data sets and examine potential approaches and applications and present a prototype in which the user can choose from images the algorithm has not seen before and asks question accordingly.
在這篇文章中,我們將詳細介紹現有數據集,并研究潛在的方法和應用,并提出一個原型,用戶可以從中選擇算法從未見過的圖像,并提出相應的問題。
什么是VQA? (What is VQA ?)
Visual Question Answering approaches are designed to handle the following tasks: Given an image and a natural language question about the image, the VQA model needs to provide an accurate natural language answer.
視覺問題解答方法旨在處理以下任務:給定圖像和關于圖像的自然語言問題,VQA模型需要提供準確的自然語言答案。
This is by nature a multi-discipline research problem. It consists of the following sub-tasks:· Computer Vision (CV)· Natural Language Processing (NLP)· Knowledge Representation & reasoning
本質上,這是一個多學科研究問題。 它由以下子任務組成:·計算機視覺(CV)·自然語言處理(NLP)·知識表示與推理
That’s why some authors refer to Visual Question Answering as “Visual Turing Test” for modern AI systems.
這就是為什么有些作者將視覺問題解答稱為現代AI系統的“視覺圖靈測試”。
This screenshot from my prototype illustrates how a VQA system works. Be aware of the fact that the user has chosen an image the algorithm has not seen during training and asks question accordingly.
我的原型的屏幕截圖說明了VQA系統是如何工作的。 請注意,用戶選擇了算法在訓練期間未看到的圖像,并提出了相應的問題。
Prototype screenshot原型截圖數據集 (Datasets)
Most of the existing datasets contain triples made of an image, a question and its correct answer. Some publicly available datasets, on the other hand, provides extra information like image captions, image regions represented as bounding boxes or multiple-choice candidate answers.
現有的大多數數據集都包含由圖像,問題及其正確答案組成的三元組。 另一方面,一些公開可用的數據集提供了額外的信息,例如圖像標題,以邊界框表示的圖像區域或多項選擇候選答案。
The available VQA datasets can be categorized based on three factors:· type of images (natural, clip-art, synthetic)· question–answer format (open-ended, multiple-choice)· use of external knowledge
可以根據以下三個因素對可用的VQA數據集進行分類:·圖像類型(自然,剪貼畫,合成)·問答格式(開放式,多項選擇)·外部知識的使用
The following table shows an overview of the available datasets:
下表概述了可用的數據集:
Source: Visual question answering: a state-of-the-art review, Sruthy Manmadhan & Binsu C. Kovoor, published in Artificial Intelligence Review (2020)資料來源:視覺問題解答:最新研究,Sruthy Manmadhan和Binsu C. Kovoor,發表在《人工智能評論》(2020)For our prototype we make use of the VQA dataset with natural images and open-ended questions. It is one of the most popular ones and also used for the annual VQA competition. The dataset we use consists of 443,757 image-questions pairs for training and 214,354 sets for validation. It can be downloaded here.
對于我們的原型,我們使用帶有自然圖像和開放式問題的VQA數據集。 它是最受歡迎的游戲之一,也用于年度VQA競賽。 我們使用的數據集包括用于訓練的443,757個圖像問題對和用于驗證的214,354套數據。 可以在這里下載。
Example of an annotated image-question-pair帶注釋的圖像問題對的示例One special characteristic of VQA dataset is that the annotations, i.e. the answers provided to a specific image-question pair are not unique. The answers have been collected via Amazon Mechanical Turk and for each image-question pair ten answers are supplied, that could be all equal but also different. The screenshot on the left shows an example.
VQA數據集的一個特殊特征是注釋,即提供給特定圖像問題對的答案不是唯一的。 答案是通過Amazon Mechanical Turk收集的,并且為每個圖像問題對提供了十個答案,它們可以相等但也可以不同。 左側的屏幕截圖顯示了一個示例。
方法與架構 (Approaches & Architectures)
The basic architecture as shown below consists of three main elements:· Image feature extraction· Question Feature extraction· Fusion model + classifier to merge the features
如下所示的基本架構由三個主要元素組成:·圖像特征提取·問題特征提取·融合模型+分類器以合并特征
https://arxiv.org/abs/1610.01465https://arxiv.org/abs/1610.01465圖像特征提取 (Image feature extraction)
Image feature extraction describes the method to transform an image to a numerical vector to enable further computational processing.
圖像特征提取描述了將圖像轉換為數值向量以實現進一步計算處理的方法。
Source: Visual question answering: a state-of-the-art review, Sruthy Manmadhan & Binsu C. Kovoor, published in Artificial Intelligence Review (2020)資料來源:視覺問答:最新的評論,Sruthy Manmadhan和Binsu C. Kovoor,發表在《人工智能評論》(2020)中Convolutional neural network (CNN) has established themselves as the state-of-the-art approach. VQA architectures generally use already pre-trained CNN models by applying transfer learning. The chart shows an evaluation of the utilization rates of different architectures in several VQA research papers.
卷積神經網絡(CNN)已將自己確立為最先進的方法。 VQA架構通常通過應用轉移學習來使用已經預先訓練的CNN模型。 該圖表顯示了一些VQA研究論文中對不同體系結構的利用率的評估。
In the prototype we use the VGG16 architecture that uses 224 × 224 pixel images as input and outputs a 4096-dimensional vector.
在原型中,我們使用VGG16架構,該架構使用224×224像素的圖像作為輸入并輸出4096維向量。
問題特征提取 (Question feature extraction)
Source: Visual question answering: a state-of-the-art review, Sruthy Manmadhan & Binsu C. Kovoor, published in Artificial Intelligence Review (2020)資料來源:視覺問答:最新的評論,Sruthy Manmadhan和Binsu C. Kovoor,發表在《人工智能評論》(2020)中To extract question features multiple approaches have been developed ranging from count-based methods like One-hot-encoding, Bag-of-words to text embedding methods like Long-short-term-memory (LSTM) or gated recurrent unit (GRU). The diagram below illustrates the utilization rate of these approaches in the research.
為了提取問題特征,已經開發了多種方法,從基于計數的方法(例如,單次熱編碼,詞袋)到文本嵌入方法(例如,短期記憶(LSTM)或門控循環單元(GRU))。 下圖說明了這些方法在研究中的利用率。
For our prototype we use the most popular LSTM approach with Word2Vec representations of the single words fed into it. The LSTM-model outputs a 512-dimensional vector.
對于我們的原型,我們使用最受歡迎的LSTM方法,將Word2Vec表示形式的單詞輸入其中。 LSTM模型輸出512維向量。
融合模型+分類器 (Fusion model + classifier)
To fusion the two feature vectors several basic approaches exist including point-wise multiplication or addition and concatenation. More advanced architectures use Canonical Correlation Analysis (CCA) or end-to-end models with a Multimodal Compact Bilinear Pooling (MCB) layer.
為了融合兩個特征向量,存在幾種基本方法,包括逐點乘法或加法和級聯。 更高級的體系結構使用規范相關分析(CCA)或具有多模式緊湊型雙線性池(MCB)層的端到端模型。
Coverage of questions by most frequent answers最常見的答案覆蓋問題In our prototype we use simple concatenation followed by a softmax classifier to the 1,000 most common answers. This approach is suitable as more than 95% of the question contain at least one annotation which is covered by the 1,000 most common answers (see graph on the left).
在我們的原型中,我們使用簡單的串聯,然后使用softmax分類器對1,000個最常見的答案進行分類。 這種方法是合適的,因為超過95%的問題包含至少一個注釋,該注釋被1,000個最常見的答案覆蓋(請參見左側的圖表)。
更高級的方法 (More advanced approaches)
In the recent past more sophisticated architectures have been developed with attention-based approaches being the most popular. Here, the idea is to set the focus of the algorithm on the most relevant parts of the input. For example, if the question is “What is the color of the ball?”, the region of the image containing the ball is more relevant than the others. Concerning the question, “color” and “ball” are more informative than the rest of the words.
在最近的過去,已經開發了更復雜的體系結構,其中基于注意力的方法是最流行的。 這里的想法是將算法的焦點設置在輸入的最相關部分。 例如,如果問題是“球的顏色是什么?”,則包含球的圖像區域比其他區域更相關。 關于這個問題,“顏色”和“球”比其余單詞更具參考價值。
The most common choice in VQA is to use spatial attention to generate region specific features to train the Convolutional Neural Network.
VQA中最常見的選擇是利用空間注意力生成特定于區域的特征以訓練卷積神經網絡。
Two common methods to obtain spatial attention are to either project a grid over the image and determine the relevance of each region by the specific question or to automatically generate bounding boxes in the image and utilize the question to determine the relevance of the features for each box.
獲得空間注意力的兩種常用方法是在圖像上投影網格并通過特定問題確定每個區域的相關性,或自動在圖像中生成邊界框并利用該問題來確定每個框的特征的相關性。
The use of an attention-based approach goes beyond the scope of our prototype.
基于注意力的方法的使用超出了我們原型的范圍。
評價 (Evaluation)
Due to the variety of datasets it is not surprising that multiple approaches to evaluate the performance of the algorithms exist. In a multiple-choice setting, there is just a single right answer for every question, so the assessment can be easily quantified by the mean accuracy over test questions. In open-ended setting though, several answers for a particular question could be correct due to synonyms and paraphrasing.
由于數據集的多樣性,因此存在多種評估算法性能的方法就不足為奇了。 在選擇題設置中,每個問題只有一個正確答案,因此可以通過測試題的平均準確性輕松量化評估。 但是在開放式環境中,由于同義詞和措辭的不同,一個特定問題的幾個答案可能是正確的。
In such cases metrics that measure how much a predicted answer differs from the ground truth based on the difference in their semantic meaning could be used. The Wu-Palmer Similarity (WUPS) is such an example.
在這種情況下,可以使用基于其語義含義的差異來衡量預測答案與地面事實有多少不同的度量。 Wu-Palmer相似度(WUPS)就是這樣的一個例子。
As the VQA datasets work with very short answers a consensus metric defined as Accuracy_VQA = min(n/3, 1) is used, i.e. a 100% accuracy is achieved when the predicted answer matches at least 3 out of the 10 annotated answers.
當VQA數據集使用非常短的答案工作時,將使用定義為Accuracy_VQA = min(n / 3,1)的共識度量,即,當預測答案與10個帶注釋的答案中的至少3個匹配時,將達到100%的準確性。
The diagram show the accuracy as defined above for the differen question types:
該圖顯示了以上針對不同問題類型定義的準確性:
Evaluation results on validation set驗證集的評估結果VQA的潛在應用 (Potential applications of VQA)
VQA systems offer a vast number of potential applications. One of the most socially relevant and direct application is to help blind and visually-impaired users to communicate with pictures. Furthermore, it can be integrated in image retrieval system, which can be commercially used on e-commerce sites to attract customers by giving more exact results to their search queries. Incorporation of VQA may also increase the popularity of online educational services by allowing learners to interact with images. Another application of VQA is in the field of data analysis where VQA can help the analyst to summarize the available visual data.
VQA系統提供了大量潛在的應用程序。 與社會最相關且最直接的應用之一是幫助盲人和視障用戶與圖片進行交流。 此外,它可以集成在圖像檢索系統中,該系統可以在電子商務站點上商業使用,通過為客戶的搜索查詢提供更準確的結果來吸引他們。 通過允許學習者與圖像進行交互,VQA的合并也可以增加在線教育服務的普及。 VQA的另一個應用是在數據分析領域,其中VQA可以幫助分析人員總結可用的可視數據。
總結思想 (Closing thoughts)
VQA is a research field that requires the understanding of both text and vision. The current performance of the systems is still lagging behind human decisions, but since deep learning techniques are significantly improving both in Natural Language Processing and Computer Vision, we can reasonably expect VQA to achieve higher and higher accuracy. Progress will be further driven by contests like the VQA challenge hosted on visualqa.org.
VQA是一個需要了解文本和視覺的研究領域。 系統的當前性能仍然落后于人類的決策,但是由于深度學習技術在自然語言處理和計算機視覺方面都得到了顯著改善,因此我們可以合理地期望VQA能夠實現越來越高的準確性。 visualqa.org舉辦的VQA挑戰賽等競賽將進一步推動進步。
If you would like to dive deeper into this topic you can find the code of the prototype on my github repo here. Any feedback to the approach or the code is highy appreciated.
如果您想深入探討這個話題,你可以找到原型的代碼在我的github回購這里 。 高度贊賞對該方法或代碼的任何反饋。
Further recommend readings include:· Visual question answering: a state-of-the-art review, Sruthy Manmadhan & Binsu C. Kovoor Artificial Intelligence Review (2020)· VQA: Visual Question Answering: Aishwarya Agrawal, Jiasen Lu, Stanislaw Antol, Margaret Mitchell, C. Lawrence Zitnick, Dhruv Batra, Devi Parikh
進一步的推薦讀物包括:·視覺問題解答:最新技術評論,Sruthy Manmadhan和Binsu C. Kovoor人工智能評論(2020)·VQA:視覺問題解答:Aishwarya Agrawal,Jiasen Lu,Stanislaw Antol,Margaret Mitchell,C.Lawrence Zitnick,Dhruv Batra,Devi Parikh
翻譯自: https://medium.com/@frank.merwerth/a-visual-turing-test-for-modern-ai-systems-de7530416e57
ai 圖靈測試
總結
以上是生活随笔為你收集整理的ai 图灵测试_适用于现代AI系统的“视觉图灵测试”的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 帐与账使用上的区别
- 下一篇: 英特尔 N100 四小核处理器迷你主机开