语义分析 文本矛盾点解析_关于解析文本的几点思考
語義分析 文本矛盾點解析
Yesterday I wrote about three course modules in Oslo, and the fact that most of the presentation material is online. Today I will be writing about one lesson in the curriculum about ‘Parsing’. First I will share a few general thoughts. Consider this in a format as learning notes, from this presentation.
昨天我在奧斯陸寫了三個課程模塊,而且大多數演示材料都在線。 今天,我將在“解析”課程中寫一堂課。 首先,我將分享一些一般想法。 從此演示文稿中,將其視為學習筆記的格式。
In a discussion on Stackoverflow there were several interesting answers to this:
在關于Stackoverflow的討論中,對此有幾個有趣的答案:
“I’d explain parsing as the process of turning some kind of data into another kind of data. In practice, for me this is almost always turning a string, or binary data, into a data structure inside my Program. For example, turning:
“我將解析解釋為將某種數據轉換為另一種數據的過程。 實際上,對我而言,這幾乎總是將字符串或二進制數據轉換為程序內部的數據結構。 例如,轉向:
":Nick!User@Host PRIVMSG #channel :Hello!"into (C)
進入(C)
struct irc_line {char *nick;
char *user;
char *host;
char *command;
char **arguments;
char *message;
} sample = { "Nick", "User", "Host", "PRIVMSG", { "#channel" }, "Hello!" }
“
“
Another user explained it as:
另一位用戶解釋為:
“Parsing is the process of analyzing text made of a sequence of tokens to determine its grammatical structure with respect to a given (more or less) formal grammar. The parser then builds a data structure based on the tokens. This data structure can then be used by a compiler, interpreter or translator to create an executable program or library.”
“解析 是分析由一系列標記組成的文本以確定相對于給定(或更少)形式語法的語法結構的過程。 然后,解析器基于令牌構建數據結構。 然后,編譯器,解釋器或翻譯器可以使用此數據結構來創建可執行程序或庫。”
Even providing a model:
甚至提供一個模型:
wikimedia.orgwikimedia.orgSlightly more complicated, but perhaps accurate:
稍微復雜一點,但也許準確:
“In computer science, parsing is the process of analysing text to determine if it belongs to a specific language or not (i.e. is syntactically valid for that language’s grammar). It is an informal name for the syntactic analysis process.”
“在計算機科學中,解析是分析文本以確定其是否屬于特定語言(即, 對于該語言的語法在語法上有效 )的過程。 這是句法分析過程的非正式名稱。”
The same user made an argument as to what is not:
同一用戶對什么不是參數提出了爭論:
Parsing is not transform one thing into another. Transforming A into B, is, in essence, what a compiler does. Compiling takes several steps, parsing is only one of them.
解析不是將一件事變成另一件事。 本質上,將A轉換為B是編譯器的工作。 編譯需要幾個步驟,解析只是其中之一。
Parsing is not extracting meaning from a text. That is semantic analysis, a step of the compiling process.
解析不是從文本中提取含義。 那就是語義分析 ,這是編譯過程的一步。
On Wikipedia it is:
在Wikipedia上是:
“Parsing, syntax analysis, or syntactic analysis is the process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar. The term parsing comes from Latin pars (orationis), meaning part (of speech).”
解析 , 語法分析或句法分析是按照自然語法,計算機語言或數據結構分析一串符號的過程,符合形式語法規則。 解析一詞來自拉丁語pars ( orationis ),意思是(語音的一部分)。”
As a general explanation that works, we have several, however we might want to consider this more closely.
作為可行的一般解釋,我們有幾種解釋,但是我們可能需要更仔細地考慮。
There is additionally a video accompanying this, although this video is in Norwegian.
盡管此視頻是挪威語的 ,但還附帶有一個視頻。
依賴解析 (Dependency parsing)
There is a relationships between words, i.e., dependency relations.
單詞之間存在關系,即依賴關系。
A dependency structure can be defined as a labeled, directed graph G.
依賴項結構可以定義為標記的有向圖G。
IN2110IN2110中的演示文稿The Principles outlined in the IN2110 presentation are as follows:
IN2110演示文稿中概述的原理如下:
“Syntactic structure is complete (Connectedness).
“句法結構完整( 連通性 )。
Syntactic structure is hierarchical (Acyclicity).
句法結構是分層的(非 循環性 )。
Every word has at most one syntactic head (Single-Head).”
每個單詞最多具有一個語法頭( Single-Head )。”
Connectedness can be enforced by adding a special root node (node 0).
可以通過添加特殊的根節點(節點0)來增強連接性。
樹庫:普遍依賴 (Treebanks: Universal Dependencies)
“Universal Dependencies (UD) is a framework for consistent annotation of grammar (parts of speech, morphological features, and syntactic dependencies) across different human languages. UD is an open community effort with over 300 contributors producing more than 150 treebanks in 90 languages. If you’re new to UD, you should start by reading the first part of the Short Introduction and then browsing the annotation guidelines.”
“ 通用依賴關系(UD) 是一個框架,用于在不同人類語言之間一致地注釋語法(詞性,詞法特征和句法依賴關系)。 UD是一個開放的社區活動,有300多個貢獻者以90種語言生成了150多個樹庫。 如果您不熟悉UD,則應先閱讀簡短介紹的第一部分,然后瀏覽注釋準則。”
IN2110IN2110中的演示文稿(程度)跨語言一致性 ((Degrees of) Cross-Linguistic Consistency)
On this topic there is an interesting paper that may be worth checking out from Google Research.
關于這個主題,有一篇有趣的論文可能值得從Google Research查閱。
Sentences across certain languages could all for example start with a big letter and end with punctuation.
例如,某些語言的句子都可以以一個大字母開頭并以標點符號結尾。
IN2110IN2110中的演示文稿回顧過去(90年代) (Back in the days (90s))
How were parsing different in the 1990's?
解析在1990年代有何不同?
- Parsers assigned linguistically detailed syntactic structures (based on linguistic theories). 解析器分配了詳細的語言句法結構(基于語言理論)。
- Grammar-driven parsing: possible trees defined by the grammar. 語法驅動的解析:語法定義的可能樹。
- Problems with coverage. 覆蓋問題。
- Only around 70% of all sentences were assigned an analysis. 所有句子中只有大約70%被分配了分析。
- Most sentences were assigned very many analyses by a grammar and there is no way of choosing between them. 大多數句子都被一個語法分配了很多分析,并且沒有辦法在它們之間進行選擇。
輸入數據驅動的(統計)解析 (Enter data-driven (statistical) parsing)
Compared to this what is modern parsing like in 2020?
與此相比,2020年的現代解析是什么樣的?
- Today data-driven/statistical parsing is available for a range of languages and syntactic frameworks. 如今,數據驅動/統計解析可用于多種語言和語法框架。
- Data-driven approaches: possible trees defined by the treebank (may also involve a grammar). 數據驅動的方法:由樹庫定義的可能的樹(也可能涉及語法)。
- Produce one analysis (hopefully the most likely one) for any sentence and get most of them correct. 對任何句子進行一項分析(希望是最有可能的一項分析),并使其大部分正確。
- Still an active field of research, improvements are still possible. 仍然是一個活躍的研究領域,改進仍然是可能的。
Further to this what is data-driven dependency parsing?
除此之外,什么是數據驅動的依賴項解析?
Data-driven dependency parsing
數據驅動的依賴項解析
- M defined by formal conditions on dependency graphs (labeled directed graphs that are): I connected I acyclic I single-head I (projective) M由依存關系圖(帶標簽的有向圖)的形式條件定義:I連接I非循環I單頭I(射影)
- I may be defined in different ways I parsing method (deterministic, non-deterministic) I machine learning algorithm, feature representations. 我可能以不同的方式定義我的解析方法(確定性,非確定性),機器學習算法,特征表示。
Two main approaches:
兩種主要方法:
Transition-based models.
基于過渡的模型。
The IN2110 lecture focus on transition-based approaches.
IN2110講座重點介紹基于過渡的方法。
Transition-based approaches.
基于過渡的方法。
Basic idea: define a transition system for mapping a sentence to its dependency graph.
基本思想 :定義一個將句子映射到其依賴圖的轉換系統。
Learning: induce a model for predicting the next state transition, given the transition history.
學習 :根據給定的轉換歷史,得出一個用于預測下一個狀態轉換的模型。
Parsing: Construct the optimal transition sequence, given the induced model.
解析:給定誘導模型,構造最佳??過渡序列 。
Shift-Reduce解析的改編。 (An Adaptation of Shift–Reduce Parsing.)
- Originally developed for non-ambiguous languages: deterministic. 最初是為非歧義語言開發的:確定性。
- Shift (‘read’) tokens from input buffer, one at a time, left-to-right; 從輸入緩沖區從左到右一次移位(“讀取”)令牌;
- Compare top n symbols on stack against rule RHS: reduce to LHS. 比較規則RHS堆棧上的前n個符號:簡化為LHS。
- Dependencies: create arcs between top of stack and front of buffer. 相關性:在堆棧頂部和緩沖區前端之間創建弧。
Architecture: Stack and Buffer Configurations.
體系結構:堆棧和緩沖區配置。
IN2110IN2110中的演示文稿So within this workspace one has to navigate in parsing:
因此,在此工作空間中,必須進行解析:
- Transition system ensures formal wellformedness of dependency trees; 過渡系統確保依賴樹的形式良好;
- The arc-eager system can generate all projective trees (and only those); 弧線渴望系統可以生成所有投射樹(并且僅生成那些);
- A specific sequence of transitions determines the final parsing result. 特定的過渡順序決定了最終的解析結果。
Towards a Parsing Algorithm:
邁向解析算法:
- Abstract goal: Find transition sequence that yields the ‘correct’ tree. 抽象目標:找到產生“正確”樹的過渡序列。
- Learn from treebanks: output dependency tree with high probability. 向樹庫學習:高概率輸出依賴樹。
- Probability distributions over transitions sequences (rather than trees). 過渡序列(而不是樹)上的概率分布。
架構摘要 (Architecture Summary)
IN2110IN2110中的演示文稿Data is labeled in the test set and attempted predictions are made.
在測試集中標記數據并進行嘗試的預測。
數據驅動的依賴解析器 (Data-driven dependency parsers)
There are a number of freely available dependency parsers:
有許多免費的依賴項解析器:
- Pre-trained models and trainable for any language (given available training data) 預先訓練的模型并且可以針對任何語言進行訓練(如果有可用的訓練數據)
There does however need to be evaluation.
但是,確實需要進行評估。
I wrote about this previously in regards to how it might need to change in NLP:
之前我就如何更改NLP中的內容寫過文章:
However, the status is that currently the metrics are often what counts.
但是,目前的狀態是通常很重要的指標。
UAS: Unlabeled Attachment Score I For each token, does it have correct head (source of incoming edge)?
UAS :未標記的附著分數I對于每個令牌,它是否具有正確的頭(傳入邊緣的來源)?
LAS: Labeled Attachment Score I In addition to the head, is the dependency type (edge label) correct?
LAS :帶標簽的附件分數I除頭部之外,依存關系類型(邊緣標簽)是否正確?
結論 (In Conclusion)
Data-Driven Dependency Parsing:
數據驅動的依賴關系解析:
- No notion of grammaticality (no rules): more or less probable trees. 沒有語法概念(沒有規則):或多或少的可能樹。
- Much room for experimentation: Feature models and types of classifiers. 有很大的實驗空間:特征模型和分類器類型。
- Decent results with Maximum Entropy or Support Vector Machines. 使用最大熵或支持向量機可獲得不錯的結果。
- In recent years, further advances with deep neural network classifiers. 近年來,深度神經網絡分類器取得了進一步的進步。
Variants on Data-Driven Dependency Parsing:
數據驅動的依賴項解析的變體:
- Other transition systems (e.g. arc-standard; like ‘classic’ shift-reduce). 其他過渡系統(例如,弧形標準;例如“經典”移位減少)。
- Different techniques for non-projective trees; e.g. swap transitions. 非投影樹的不同技術; 例如交換過渡。
- Can relax transition system further, to output general, non-tree graphs. 可以進一步放松過渡系統,以輸出一般的非樹圖。
- Beam search: exploring the top-n transitions out of each configuration. 光束搜索:探索每種配置的前n個過渡。
I currently need to work with a corpus of documents and thought it was interesting to consider parsing as a problem.
我目前需要處理大量文檔,并且認為將解析視為問題很有趣。
This is #500daysofAI and you are reading article 433. I am writing one new article about or related to artificial intelligence every day for 500 days.
這是#500daysofAI,您正在閱讀文章433。我連續500天每天都在撰寫一篇有關人工智能或與人工智能有關的新文章。
翻譯自: https://medium.com/swlh/a-few-thoughts-on-parsing-text-b496a0f99dde
語義分析 文本矛盾點解析
總結
以上是生活随笔為你收集整理的语义分析 文本矛盾点解析_关于解析文本的几点思考的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 皇室战争3阶冲4阶卡组是什么
- 下一篇: 我的世界怎么创建房间