當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

使用Pytorch进行密集视频字幕

發布時間：2023/12/15 编程问答 34 豆豆

生活随笔收集整理的這篇文章主要介紹了使用Pytorch进行密集视频字幕小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

介紹 (Intro)

The deep learning task, Video Captioning, has been quite popular in the intersection of Computer Vision and Natural Language Processing for the last few years. In particular, Dense Video Captioning, which is a subfield, has been gaining some traction among researchers. Dense Video Captioning is the task of localizing interesting events from an untrimmed video and producing individual textual description for each event.

過去幾年，深度學習任務“ 視頻字幕”在計算機視覺和自然語言處理的交集中非常流行。特別是作為子領域的密集視頻字幕 ( Dense Video Captioning)已在研究人員中引起了一定的關注。 密集視頻字幕的任務是從未修剪的視頻中定位有趣的事件，并為每個事件生成單獨的文本描述。

該模型 (The Model)

Dense Video Captioning is challenging as it requires a strong contextual representation of the video, as well as being able to detect localized events. Most models tackle this problem by decomposing it into two steps: detecting event proposals from the video and then, generate sentences for each event. The current state-of-the-art algorithm Bi-modal Transformer with Proposal Generator (BMT) proposes to combine two channels of inputs in order to conduct dense video captioning: visual and audio information. It achieves state-of-the-art performance on ActivityNet Captions Dataset, which consists of thousands of videos paired with captions associated with certain timeframes.

密集視頻字幕具有挑戰性，因為它需要視頻的強大上下文表示以及能夠檢測到本地事件的能力。大多數模型通過將其分解為兩個步驟來解決此問題：從視頻中檢測事件建議，然后為每個事件生成句子。當前帶有提案生成器(BMT)的最新算法雙模態變壓器建議合并兩個輸入通道以進行密集的視頻字幕：視覺和音頻信息。它在ActivityNet字幕數據集上實現了最先進的性能，該數據集由成千上萬的視頻以及與特定時間范圍關聯的字幕組成。

Source)源 )

The BMT architecture consists of three main components: Bi-modal Encoder, Bi-modal Decoder, and finally the Proposal Generator.

BMT體系結構由三個主要組件組成： 雙模式編碼器 ， 雙模式解碼器 ，最后是提案生成器 。

First, the audio and visual of a video is encoded using VGG and I3D, respectively. After feature extraction, the VGG and I3D features are passed to the bi-modal encoder layers where audio and visual features are encoded to what the paper calls as, audio-attended visual and video-attended audio. These features are then passed to the proposal generator, which takes in information from both modalities and generates event proposals.

首先，使用以下代碼對視頻的音頻和視頻進行編碼 VGG和I3D分別。特征提取后，VGG和I3D特征將傳遞到雙峰編碼器層，在此將音頻和視覺特征編碼為紙張所稱的音頻/視頻和音頻。然后將這些功能傳遞給投標生成器，投標生成器從兩種方式中獲取信息并生成事件投標。

After the event proposal generation, the video is trimmed according to the event proposals. Each short clip is passed into the entire again, starting from feature extraction. Audio and visual features are extracted, passed into the bi-modal encoder, then to the bi-modal decoder layers. Here, the decoder layers take in two inputs: the outputs of the last layer from the bi-modal encoder and also, the GloVe embeddings of the last generated caption sequence. Finally, the decoder decodes the internal representation and generates the next word based on the probability distribution, which is added to the previous caption sequence.

生成事件建議后，將根據事件建議修剪視頻。從特征提取開始，每個短片段都會再次傳遞到整個片段中。提取音頻和視覺特征，將其傳遞到雙模式編碼器，然后傳遞到雙模式解碼器層。在這里，解碼器層接受兩個輸入：雙模式編碼器的最后一層的輸出，以及最后生成的字幕序列的GloVe嵌入 。最后，解碼器對內部表示進行解碼，并根據概率分布生成下一個單詞，并將其添加到前一個字幕序列中。

實作 (Implementation)

In this article, we will show you how to use pre-trained BMT to perform dense video captioning given a video. No model training is needed.

在本文中，我們將向您展示如何使用預訓練的BMT對給定的視頻執行密集的視頻字幕。無需模型訓練。

步驟1：下載存儲庫和設置環境 (Step 1: Download Repo and Setup Environment)

Download the paper’s official repository via:

通過以下網址下載論文的官方資源庫：

git clone --recursive https://github.com/v-iashin/BMT.git
cd BMT/

Download VGG and I3D model and also GloVe embeddings. The script will save them in the ./data and ./.vector_cache folder.

下載VGG和I3D模型以及GloVe嵌入。該腳本會將它們保存在./data和./.vector_cache文件夾中。

bash ./download_data.sh

Setup a conda environment with all the required libraries and dependencies:

使用所有必需的庫和依賴項設置一個conda環境：

conda env create -f ./conda_env.yml
conda activate bmt
python -m spacy download en

步驟2：下載影片 (Step 2: Download Video)

Now, you can get the video that you want. As an example, I will get a short documentary about the recent global coronavirus pandemic from Youtube. I got this one:

現在，您可以獲得所需的視頻。例如，我將從Youtube獲得有關近期全球冠狀病毒大流行的簡短記錄片。我這一個：

Source)來源 )

You can download Youtube videos using online downloaders, but please use it carefully and at your own risk! After downloading it, you can save it somewhere that you like. I created a test folder under the BMT project folder and copied the downloaded video to the test folder.

您可以使用在線下載器下載Youtube視頻，但請謹慎使用，后果自負！下載后，可以將其保存在所需的位置。我在BMT項目文件夾下創建了一個test文件夾，并將下載的視頻復制到該test文件夾。

mkdir test
# copied video to the test directoryAfter Downloading Video (Image by Author)下載視頻后(作者提供的圖像)

步驟3：特征提取(I3D和VGGish) (Step 3: Feature Extraction (I3D and VGGish))

After getting the video, now it’s time to extract I3D features by first creating the conda environment and then running the python script:

獲取視頻，現在是時候來提取I3D首先創建功能后conda環境，然后運行python腳本：

cd ./submodules/video_features
conda env create -f conda_env_i3d.yml
conda activate i3d
python main.py \
--feature_type i3d \
--on_extraction save_numpy \
--device_ids 0 \
--extraction_fps 25 \
--video_paths ../../test/pandemic.mp4 \
--output_path ../../test/

Extract VGGish features using a similar procedure:

使用類似的過程提取VGGish功能：

conda env create -f conda_env_vggish.yml
conda activate vggish
wget https://storage.googleapis.com/audioset/vggish_model.ckpt -P ./models/vggish/checkpoints
python main.py \
--feature_type vggish \
--on_extraction save_numpy \
--device_ids 0 \
--video_paths ../../test/pandemic.mp4 \
--output_path ../../test/

After running the above scripts, the I3D and VGGish features will be saved in the test directory. The saved features include RGB visual features (pandemic_rgb.npy), optical flow features (pandemic_flow.npy), and audio features (pandemic_vggish.npy).

運行上述腳本后，I3D和VGGish功能將保存在test目錄中。保存的功能包括RGB視覺功能( pandemic_rgb.npy )，光流功能( pandemic_flow.npy )和音頻功能( pandemic_vggish.npy )。

After Feature Extraction (Image by Author)特征提取后(作者提供的圖像)

步驟4：在視頻上運行密集視頻字幕 (Step 4: Run Dense Video Captioning on the Video)

Navigate back to the main project folder and then activate the bmt environment which was set up previously. Finally, we can run video captioning using the below command:

瀏覽回到主項目文件夾，然后激活之前設置的bmt環境。最后，我們可以使用以下命令運行視頻字幕：

cd ../../
conda activate bmt
python ./sample/single_video_prediction.py \
--prop_generator_model_path ./sample/best_prop_model.pt \
--pretrained_cap_model_path ./sample/best_cap_model.pt \
--vggish_features_path ./test/pandemic_vggish.npy \
--rgb_features_path ./test/pandemic_rgb.npy \
--flow_features_path ./test/pandemic_flow.npy \
--duration_in_secs 99 \
--device_id 0 \
--max_prop_per_vid 100 \
--nms_tiou_thresh 0.4

The prop_generator_model_path and pretrained_cap_model_path specify the proposal generator model path and the caption model path. Because we are using both pre-trained models, we can directly link it to the path where the pre-trained models were saved before. vggish_features_path, rgb_features_path, and flow_features_path are the paths for which the corresponding features are saved, duration_in_secs is the duration of the video in seconds, device_id is the GPU number to use, max_prop_per_vid is the maximum number of proposals to search for (a hyperparameter), and finally nms_tiou_thresh is the non-maximum suppression threshold parameter.

prop_generator_model_path和pretrained_cap_model_path指定投標生成器模型路徑和字幕模型路徑。因為我們都使用了兩個預訓練模型，所以我們可以將其直接鏈接到之前保存了預訓練模型的路徑。 vggish_features_path ， rgb_features_path和flow_features_path是保存相應功能的路徑， duration_in_secs是視頻的持續時間(以秒為單位)， device_id是要使用的GPU編號， max_prop_per_vid是要搜索的建議的最大數量(超參數)，最后， nms_tiou_thresh是非最大抑制閾值參數。

結果 (Results)

After executing dense video captioning, here is the result that is printed:

執行密集視頻字幕后，將顯示以下結果：

[

]

The result gives a list of start and end times, as well as sentences describing the content of the video within that period of time. After watching and comparing it with the video itself, I must say that the model performs pretty well in terms of understanding the video and taking the audio into account!

結果提供了開始時間和結束時間的列表，以及描述該時間段內視頻內容的語句。在觀看并將其與視頻本身進行比較之后，我必須說，該模型在理解視頻和考慮音頻方面表現良好！

This can be applied to any video so feel free to try it out~! This article is based on the official repository, so make sure to also check it out.

它可以應用于任何視頻，請隨時嘗試?！本文基于官方存儲庫，因此請確保也將其簽出。

結論 (Conclusion)

This article shows you how you can generate multiple captions for different timeframes within a single video using a pre-trained model. The model’s performance is shown to be pretty well and there are more improvements that can possibly be applied to the model and the dense video captioning task in the future.

本文向您展示了如何使用預先訓練的模型為單個視頻中的不同時間范圍生成多個字幕。該模型的性能顯示非常好，將來還有更多改進可以應用到該模型和密集視頻字幕任務中。

For more tutorials regarding deep learning, please feel free to check out:

有關深度學習的更多教程，請隨時查看：

翻譯自: https://towardsdatascience.com/dense-video-captioning-using-pytorch-392ca0d6971a

總結

以上是生活随笔為你收集整理的使用Pytorch进行密集视频字幕的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。