Tensorflow C3D完成视频动作识别
本文是視頻動作識別領域經典的C3D網絡的簡易實現,可以作為動作識別的入門。論文為<Learning Spatiotemporal Features with 3D Convolutional Networks>(ICCV 2015)。
框架:Tensorflow (=1.6)+python(2.7)+slim
數據集:UCF101.?Center for Research in Computer Vision at the University of Central Florida
代碼:2012013382/C3D-Tensorflow-slim
3D卷積的基本概念網上有很多,不再闡述。這里主要說一下輸入幀(圖片)通過網絡之后的變化情況。
C3D的基本網絡結構如圖1所示:
圖1 C3D網絡結構示意圖
細節:
1)輸入clip(視頻段)的shape為[batch_size, frame_length, crop_size, crop_size, channel_num],其中frame_length為16,表示輸入為16幀一個樣本;crop_size為112,channel_num為3,表示每幀的size統一為[112, 112, 3]。
2)每個卷積核的size都是[3, 3, 3],第一維表示時間維,后面兩維表示幀(圖片)上的kernel size;stride都是[1, 1, 1], padding='SAME'。
3)所有的pooling都是3D max pooling,只有第一層pooling的size和stride為[1, 2, 2],其他的均為[2, 2, 2],維數的含義與1)中一致,padding='SAME'。作者稱第一層時間維用1是為了避免時間維度上過早縮小為1。
輸入clip通過網絡的shape變化如下:
設batch_size為10。
Input shape:[10, 16, 112, 112, 3]
After conv1:[10, 16, 112, 112, 64]
After pool1:[10, 16, 56, 56, 64]
After conv2a:[10, 16, 56, 56, 128]
After pool2:[10, 8, 28, 28, 128]
After conv3a:[10, 8, 28, 28, 256]
After conv3b:[10, 8, 28, 28, 256]
After pool3:[10, 4, 14, 14, 256]
After conv4a:[10, 4, 14, 14, 512]
After conv4b:[10, 4, 14, 14, 512]
After pool4:[10, 2, 7, 7, 512]
After conv5a:[10, 2, 7, 7, 512]
After conv5b:[10, 2, 7, 7, 512]
After pool5:[10, 1, 4, 4, 512]
After fc6:[10, 4096]
After fc7:[10, 4096]
out:[10, num_classes](UCF的num_classes為101)
數據預處理
做視頻的工作,數據預處理相對會比較復雜,由于視頻數據集通常較大,我們通常將其先轉為圖片的形式,再每次從硬盤上讀一個batch的數據。下載UCF101數據集之后,將其解壓到項目的根目錄下。創建convert_video_to_images.sh文件,內容為
for folder in $1/* dofor file in "$folder"/*.avidoif [[ ! -d "${file[@]%.avi}" ]]; thenmkdir -p "${file[@]%.avi}"fiffmpeg -i "$file" -vf fps=$2 "${file[@]%.avi}"/%05d.jpgrm "$file"done done執行
sudo ./convert_video_to_images.sh UCF101/ 5表示將視頻每秒取5幀圖片。
之后生成訓練集與測試集。創建convert_images_to_list.sh文件,內容為
> train.list > test.list COUNT=-1 for folder in $1/* doCOUNT=$[$COUNT + 1]for imagesFolder in "$folder"/*doif (( $(jot -r 1 1 $2) > 1 )); thenecho "$imagesFolder" $COUNT >> train.listelseecho "$imagesFolder" $COUNT >> test.listfi done done執行
./convert_images_to_list.sh UCF101/ 4表示1/4的數據為測試集,其余為訓練集。
每次為訓練和測試從硬盤上讀取batch_size大小的數據,具體如下。
from __future__ import absolute_import from __future__ import division from __future__ import print_functionimport PIL.Image as Image import random import numpy as np import os import time CLIP_LENGTH = 16 import cv2 VALIDATION_PRO = 0.2np_mean = np.load('crop_mean.npy').reshape([CLIP_LENGTH, 112, 112, 3]) def get_test_num(filename):lines = open(filename, 'r')return len(list(lines)) def get_video_indices(filename):lines = open(filename, 'r')#Shuffle datalines = list(lines)video_indices = range(len(lines))random.seed(time.time())random.shuffle(video_indices)validation_video_indices = video_indices[:int(len(video_indices) * 0.2)]train_video_indices = video_indices[int(len(video_indices) * 0.2):]return train_video_indices, validation_video_indicesdef frame_process(clip, clip_length=CLIP_LENGTH, crop_size=112, channel_num=3):frames_num = len(clip)croped_frames = np.zeros([frames_num, crop_size, crop_size, channel_num]).astype(np.float32)#Crop every frame into shape[crop_size, crop_size, channel_num]for i in range(frames_num):img = Image.fromarray(clip[i].astype(np.uint8))if img.width > img.height:scale = float(crop_size) / float(img.height)img = np.array(cv2.resize(np.array(img), (int(img.width * scale + 1), crop_size))).astype(np.float32)else:scale = float(crop_size) / float(img.width)img = np.array(cv2.resize(np.array(img), (crop_size, int(img.height * scale + 1)))).astype(np.float32)crop_x = int((img.shape[0] - crop_size) / 2)crop_y = int((img.shape[1] - crop_size) / 2)img = img[crop_x: crop_x + crop_size, crop_y : crop_y + crop_size, :]croped_frames[i, :, :, :] = img - np_mean[i]return croped_framesdef convert_images_to_clip(filename, clip_length=CLIP_LENGTH, crop_size=112, channel_num=3):clip = []for parent, dirnames, filenames in os.walk(filename):filenames = sorted(filenames)if len(filenames) < clip_length:for i in range(0, len(filenames)):image_name = str(filename) + '/' + str(filenames[i])img = Image.open(image_name)img_data = np.array(img)clip.append(img_data)for i in range(clip_length - len(filenames)):image_name = str(filename) + '/' + str(filenames[len(filenames) - 1])img = Image.open(image_name)img_data = np.array(img)clip.append(img_data)else:s_index = random.randint(0, len(filenames) - clip_length)for i in range(s_index, s_index + clip_length):image_name = str(filename) + '/' + str(filenames[i])img = Image.open(image_name)img_data = np.array(img)clip.append(img_data)if len(clip) == 0:print(filename)clip = frame_process(clip, clip_length, crop_size, channel_num)return clip#shape[clip_length, crop_size, crop_size, channel_num]def get_batches(filename, num_classes, batch_index, video_indices, batch_size=10, crop_size=112, channel_num=3):lines = open(filename, 'r')clips = []labels = []lines = list(lines)for i in video_indices[batch_index: batch_index + batch_size]:line = lines[i].strip('\n').split()dirname = line[0]label = line[1]i_clip = convert_images_to_clip(dirname, CLIP_LENGTH, crop_size, channel_num)clips.append(i_clip)labels.append(int(label))clips = np.array(clips).astype(np.float32)labels = np.array(labels).astype(np.int64)oh_labels = np.zeros([len(labels), num_classes]).astype(np.int64)for i in range(len(labels)):oh_labels[i, labels[i]] = 1batch_index = batch_index + batch_size#Convert to numpybatch_data = {'clips': clips, 'labels': oh_labels}return batch_data, batch_index這里需要注意的是:為了簡便,我每一個視頻隨機抽取一個連續的16幀組成clip,作為一個樣本,如果batch_size為10,那么就是取了10個視頻,每個視頻隨機取16幀,組成了10 clips,作為每次網絡的輸入。
模型使用slim,因為實現簡單,閱讀容易。
import tensorflow as tf import tensorflow.contrib.slim as slimdef C3D(input, num_classes, keep_pro=0.5):with tf.variable_scope('C3D'):with slim.arg_scope([slim.conv3d],padding='SAME',weights_regularizer=slim.l2_regularizer(0.0005),activation_fn=tf.nn.relu,kernel_size=[3, 3, 3],stride=[1, 1, 1]):net = slim.conv3d(input, 64, scope='conv1')net = slim.max_pool3d(net, kernel_size=[1, 2, 2], stride=[1, 2, 2], padding='SAME', scope='max_pool1')net = slim.conv3d(net, 128, scope='conv2')net = slim.max_pool3d(net, kernel_size=[2, 2, 2], stride=[2, 2, 2], padding='SAME', scope='max_pool2')net = slim.repeat(net, 2, slim.conv3d, 256, scope='conv3')net = slim.max_pool3d(net, kernel_size=[2, 2, 2], stride=[2, 2, 2], padding='SAME', scope='max_pool3')net = slim.repeat(net, 2, slim.conv3d, 512, scope='conv4')net = slim.max_pool3d(net, kernel_size=[2, 2, 2], stride=[2, 2, 2], padding='SAME', scope='max_pool4')net = slim.repeat(net, 2, slim.conv3d, 512, scope='conv5')net = slim.max_pool3d(net, kernel_size=[2, 2, 2], stride=[2, 2, 2], padding='SAME', scope='max_pool5')net = tf.reshape(net, [-1, 512 * 4 * 4])net = slim.fully_connected(net, 4096, weights_regularizer=slim.l2_regularizer(0.0005), scope='fc6')net = slim.dropout(net, keep_pro, scope='dropout1')net = slim.fully_connected(net, 4096, weights_regularizer=slim.l2_regularizer(0.0005), scope='fc7')net = slim.dropout(net, keep_pro, scope='dropout2')out = slim.fully_connected(net, num_classes, weights_regularizer=slim.l2_regularizer(0.0005), \activation_fn=None, scope='out')return out訓練
import tensorflow as tf import numpy as np import C3D_model import time import data_processing import os import os.path from os.path import join TRAIN_LOG_DIR = os.path.join('Log/train/', time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(time.time()))) TRAIN_CHECK_POINT = 'check_point/' TRAIN_LIST_PATH = 'train.list' TEST_LIST_PATH = 'test.list' BATCH_SIZE = 10 NUM_CLASSES = 101 CROP_SZIE = 112 CHANNEL_NUM = 3 CLIP_LENGTH = 16 EPOCH_NUM = 50 INITIAL_LEARNING_RATE = 1e-4 LR_DECAY_FACTOR = 0.5 EPOCHS_PER_LR_DECAY = 2 MOVING_AV_DECAY = 0.9999 #Get shuffle index train_video_indices, validation_video_indices = data_processing.get_video_indices(TRAIN_LIST_PATH)with tf.Graph().as_default():batch_clips = tf.placeholder(tf.float32, [BATCH_SIZE, CLIP_LENGTH, CROP_SZIE, CROP_SZIE, CHANNEL_NUM], name='X')batch_labels = tf.placeholder(tf.int32, [BATCH_SIZE, NUM_CLASSES], name='Y')keep_prob = tf.placeholder(tf.float32)logits = C3D_model.C3D(batch_clips, NUM_CLASSES, keep_prob)with tf.name_scope('loss'):loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=batch_labels))tf.summary.scalar('entropy_loss', loss)with tf.name_scope('accuracy'):accuracy = tf.reduce_mean(tf.cast(tf.equal(tf.argmax(logits, 1), tf.argmax(batch_labels, 1)), np.float32))tf.summary.scalar('accuracy', accuracy)#global_step = tf.Variable(0, name='global_step', trainable=False) #decay_step = EPOCHS_PER_LR_DECAY * len(train_video_indices) // BATCH_SIZElearning_rate = 1e-4#tf.train.exponential_decay(INITIAL_LEARNING_RATE, global_step, decay_step, LR_DECAY_FACTOR, staircase=True)optimizer = tf.train.AdamOptimizer(learning_rate).minimize(loss)#, global_step=global_step)saver = tf.train.Saver()summary_op = tf.summary.merge_all()config = tf.ConfigProto()config.gpu_options.allow_growth = Truewith tf.Session(config=config) as sess:train_summary_writer = tf.summary.FileWriter(TRAIN_LOG_DIR, sess.graph)sess.run(tf.global_variables_initializer())sess.run(tf.local_variables_initializer())step = 0for epoch in range(EPOCH_NUM):accuracy_epoch = 0loss_epoch = 0batch_index = 0for i in range(len(train_video_indices) // BATCH_SIZE):step += 1batch_data, batch_index = data_processing.get_batches(TRAIN_LIST_PATH, NUM_CLASSES, batch_index,train_video_indices, BATCH_SIZE)_, loss_out, accuracy_out, summary = sess.run([optimizer, loss, accuracy, summary_op],feed_dict={batch_clips:batch_data['clips'],batch_labels:batch_data['labels'],keep_prob: 0.5})loss_epoch += loss_outaccuracy_epoch += accuracy_outif i % 10 == 0:print('Epoch %d, Batch %d: Loss is %.5f; Accuracy is %.5f'%(epoch+1, i, loss_out, accuracy_out))train_summary_writer.add_summary(summary, step)print('Epoch %d: Average loss is: %.5f; Average accuracy is: %.5f'%(epoch+1, loss_epoch / (len(train_video_indices) // BATCH_SIZE),accuracy_epoch / (len(train_video_indices) // BATCH_SIZE)))accuracy_epoch = 0loss_epoch = 0batch_index = 0for i in range(len(validation_video_indices) // BATCH_SIZE):batch_data, batch_index = data_processing.get_batches(TRAIN_LIST_PATH, NUM_CLASSES, batch_index,validation_video_indices, BATCH_SIZE)loss_out, accuracy_out = sess.run([loss, accuracy],feed_dict={batch_clips:batch_data['clips'],batch_labels:batch_data['labels'],keep_prob: 1.0})loss_epoch += loss_outaccuracy_epoch += accuracy_outprint('Validation loss is %.5f; Accuracy is %.5f'%(loss_epoch / (len(validation_video_indices) // BATCH_SIZE),accuracy_epoch /(len(validation_video_indices) // BATCH_SIZE)))saver.save(sess, TRAIN_CHECK_POINT + 'train.ckpt', global_step=epoch)這里取訓練集的20%作為驗證集,在訓練集上每跑完一個epoch,就在驗證集上驗證一次。
測試
import tensorflow as tf import numpy as np import C3D_model import data_processing TRAIN_LOG_DIR = 'Log/train/' TRAIN_CHECK_POINT = 'check_point/train.ckpt-36' TEST_LIST_PATH = 'test.list' BATCH_SIZE = 10 NUM_CLASSES = 101 CROP_SZIE = 112 CHANNEL_NUM = 3 CLIP_LENGTH = 16 EPOCH_NUM = 50 test_num = data_processing.get_test_num(TEST_LIST_PATH)test_video_indices = range(test_num)with tf.Graph().as_default():batch_clips = tf.placeholder(tf.float32, [BATCH_SIZE, CLIP_LENGTH, CROP_SZIE, CROP_SZIE, CHANNEL_NUM], name='X')batch_labels = tf.placeholder(tf.int32, [BATCH_SIZE, NUM_CLASSES], name='Y')keep_prob = tf.placeholder(tf.float32)logits = C3D_model.C3D(batch_clips, NUM_CLASSES, keep_prob)accuracy = tf.reduce_mean(tf.cast(tf.equal(tf.argmax(logits, 1), tf.argmax(batch_labels, 1)), np.float32))restorer = tf.train.Saver()config = tf.ConfigProto()config.gpu_options.allow_growth = Truewith tf.Session(config=config) as sess:sess.run(tf.global_variables_initializer())sess.run(tf.local_variables_initializer())restorer.restore(sess, TRAIN_CHECK_POINT)accuracy_epoch = 0batch_index = 0for i in range(test_num // BATCH_SIZE):if i % 10 == 0:print('Testing %d of %d'%(i + 1, test_num // BATCH_SIZE))batch_data, batch_index = data_processing.get_batches(TEST_LIST_PATH, NUM_CLASSES, batch_index,test_video_indices, BATCH_SIZE)accuracy_out = sess.run(accuracy,feed_dict={batch_clips: batch_data['clips'],batch_labels: batch_data['labels'],keep_prob: 1.0})accuracy_epoch += accuracy_outprint('Test accuracy is %.5f' % (accuracy_epoch / (test_num // BATCH_SIZE)))實驗結果
我在訓練集上跑了36個epoches,最終模型在測試集上的結果為72%左右。
參考
hx173149/C3D-tensorflow?github.com
總結
以上是生活随笔為你收集整理的Tensorflow C3D完成视频动作识别的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 第四章文件管理
- 下一篇: 计算机网络(第四章网络层)