小米 pegasus_使用Google的Pegasus库生成摘要
小米 pegasus
PEGASUS stands for Pre-training with Extracted Gap-sentences for Abstractive SUmmarization Sequence-to-sequence models. It uses self-supervised objective Gap Sentences Generation (GSG) to train a transformer encoder-decoder model. The paper can be found on arXiv. In this article, we will only focus on generating state of the art abstractive summaries using Google’s Pegasus library.
PEGASUS表示對于P再培訓為E xtracted g接入點,句子對于A bstractive SU mmarization 小號層序對序列模型。 它使用自我監督的目標間隙句生成(GSG)來訓練變壓器編碼器-解碼器模型。 可以在arXiv上找到該論文。 在本文中,我們將只專注于使用Google的Pegasus庫生成最新的抽象摘要。
As of now, there is no easy way to generate the summaries using Pegasus library. However, Hugging Face is already working on implementing this and they are expecting to release it around September 2020. In the meantime, we can try to follow the steps mentioned Pegasus Github repository and explore Pegasus. So let’s get started.
到目前為止,還沒有使用Pegasus庫生成摘要的簡便方法。 但是, Hugging Face已經在努力實現此功能,他們希望在2020年9月左右發布它。與此同時,我們可以嘗試按照提到的Pegasus Github存儲庫中的步驟進行操作,并探索Pegasus。 因此,讓我們開始吧。
This step will clone the library on GitHub, create /content/pegasus folder, and install requirements.
此步驟將在GitHub上克隆庫,創建/ content / pegasus文件夾,并安裝需求。
Next, follow the instructions to install gsutil. The below steps worked well for me in Colab.
接下來,按照說明安裝gsutil 。 以下步驟在Colab中對我來說效果很好。
This will create a folder named ckpt under /content/pegasus/ and then download all the necessary files (fine-tuned models, vocab etc.) from Google Cloud to /content/pegasus/ckpt.
這將在/ content / pegasus /下創建一個名為ckpt的文件夾 然后將所有必要的文件(微調模型,vocab等)從Google Cloud下載到/ content / pegasus / ckpt 。
If all the above steps completed successfully, we see the below folder structure in Google Colab. Under each downstream dataset, we can see fine-tuned models that we can use for generating extractive/abstractive summaries.
如果上述所有步驟成功完成,我們將在Google Colab中看到以下文件夾結構。 在每個下游數據集下,我們可以看到可用于生成提取/抽象摘要的微調模型。
Though it’s not mentioned in Pegasus Github repository README instruction, below pegasus installation step is necessary otherwise you will run into errors. Also, make sure you are in root folder /content before executing this step.
盡管Pegasus Github存儲庫README指令中未提及,但在飛馬安裝步驟下面是必需的,否則您將遇到錯誤。 另外,在執行此步驟之前,請確保您位于根目錄/ content中 。
Now, let us try to understand about pre-training corpus and downstream datasets of Pegasus. Pegasus is pre-trained on C4 & Hugenews corpora and it is then fine-tuned on 12 downstream datasets. The evaluation results on downstream datasets are mentioned in Github and also in the paper. Some of these datasets are extractive & some are abstractive. So the use of the dataset depends on if we are looking for extractive summaries or abstractive summaries.
現在,讓我們嘗試了解有關Pegasus的預訓練語料庫和下游數據集。 飛馬座在C4和Hugenews語料庫上進行了預訓練,然后在12個下游數據集中進行了微調。 Github和論文中都提到了對下游數據集的評估結果。 這些數據集中有些是可提取的,有些則是抽象的。 因此,數據集的使用取決于我們是在尋找提取摘要還是抽象摘要。
Once all the above steps are taken care of, we can now jump to evaluate.py step mentioned below but it will take longer to complete as it will try to make predictions on all the data which are part of the evaluation set of the respective fine-tuned dataset being used. Since we are interested in summaries of custom text or sample text, we need to make minor changes public_params.py file found under /content/pegasus/pegasus/params/public_params.py as shown below.
完成上述所有步驟后,我們現在可以跳至以下提到的evaluate.py步驟。但是,由于它將嘗試對屬于相應標準的評估集的所有數據進行預測,因此需要更長的時間才能完成調整后的數據集。 由于我們對自定義文本或示例文本的摘要感興趣,因此我們需要對public_params.py下的/content/pegasus/pegasus/params/public_params.py文件進行較小的更改。 如下圖所示。
Here I am making changes to reddit_tifu as I am trying to use reddit_tifu dataset for generating an abstractive summary. In case if you are experimenting with aeslc or other downstream datasets you are requested to make similar changes.
我在這里對reddit_tifu進行更改 當我嘗試使用reddit_tifu數據集生成抽象摘要時。 如果您正在嘗試使用aeslc或其他下游數據集,則需要進行類似的更改。
Here we are passing text from this news article is inp which is then copied to inputs. Note that empty string to passed to targets as this is what we are going to predict. Then both inputs are targets are used to create tfrecord, which pegusus expects.
在這里,我們正在傳遞新聞文章 inp文本,然后將其復制到inputs 。 請注意,傳遞給targets空字符串是我們要預測的。 那么這兩個inputs是targets被用于創建tfrecord,這pegusus預期。
inp = ‘“replace this with text from the above this article’’’
inp ='“用本文 上方的文字 替換 ”
As the final step, when evaluate.py is run, the model makes a prediction or generates a summary of the above news article’s text. This will generate 4 output files in the respective downstream dataset’s folder. In this case input, output, prediction and text_metric text files will be created under reddit_tifu folder.
作為最后一步,當 evaluate.py運行,該模型進行預測或生成上述新聞報道的文字摘要。 這將在相應的下游數據集的文件夾中生成4個輸出文件。 在這種情況下, input , output , prediction 和 text_metric 文本文件將在reddit_tifu文件夾下創建。
Abstractive summary (prediction):“India and Afghanistan on Monday discussed the evolving security situation in the region against the backdrop of a spike in terrorist violence in the country.”
摘要摘要(預測): “印度和阿富汗周一討論了該國恐怖活動激增的背景下該地區不斷發展的安全局勢。”
This looks like a very well generated abstractive summary when we compare with the news article we passed as input for generating the summary. By using different downstream datasets we can generate extractive or abstractive summaries. Also, we can play around with different parameter values and see how it changes summaries.
當我們與作為生成摘要的輸入傳遞的新聞文章進行比較時,這看起來像是生成良好的摘要摘要。 通過使用不同的下游數據集,我們可以生成提取摘要或抽象摘要。 另外,我們可以嘗試使用不同的參數值,并查看其如何更改摘要。
翻譯自: https://towardsdatascience.com/generate-summaries-using-googles-pegasus-library-772633a161c2
小米 pegasus
總結
以上是生活随笔為你收集整理的小米 pegasus_使用Google的Pegasus库生成摘要的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 算法题指南书_分类算法指南
- 下一篇: 数据集准备及数据预处理_1.准备数据集