3D-Inflated ResNet-50 Trained on Kinetics 400 Data

Identify the main action in a video

This model applies a 3D-inflation technique to bootstrap the kernels of a 3D convolutional network from a 2D ResNet-50 architecture, directly leveraging years of progress on the image domain architectures for video applications. The weights of the 3D convolutional filters were initialized by replicating the 2D filters of ResNet-50 along the time dimension, which can be seen as an implicit pre-training on a video dataset consisting of static ImageNet images replicated across time.

Training Set Information

Kinetics-400 human action video data, consisting of four hundred human action classes, with at least four hundred video clips for each action. Each clip lasts around 10 seconds and is taken from a different YouTube video.

Model Information

Examples

Download Example Notebook

Open in Wolfram Cloud

Resource retrieval

Get the pre-trained net:

In[1]:=

Out[1]=

Basic usage

Classify a video:

In[2]:=

Out[2]=

In[3]:=

Out[3]=

Obtain the probabilities predicted by the net:

In[4]:=

Out[4]=

Feature extraction

Remove the last two layers of the trained net so that the net produces a vector representation of an image:

In[5]:=

Out[5]=

Get a set of videos:

In[6]:=

Visualize the features of a set of videos:

In[7]:=

FeatureSpacePlot[videos, FeatureExtractor -> (extractor[{#, #}] &), LabelingFunction -> (Placed[Thumbnail@VideoFrameList[#1, 1][[1]], Center] &), LabelingSize -> 50, ImageSize -> 500, Method -> "TSNE"]

Out[7]=

Transfer learning

Use the pre-trained model to build a classifier for telling apart images from two action classes not present in the dataset. Create a test set and a training set:

In[8]:=

videos = <|
VideoTrim[ResourceData["Sample Video: Wild Ducks in the Park"], 10] -> "Wild Ducks in the Park", VideoTrim[ResourceData["Sample Video: Freezing Bubble"], 10] -> "Freezing Bubble"|>;

In[9]:=

frameRate = 8;
sampleRate = 8;
maxFrameNumber = frameRate*sampleRate;

In[10]:=

dataset = Join @@ KeyValueMap[
Table[VideoTrim[#1, {Quantity[i, "Frames"], Quantity[i + maxFrameNumber - 1, "Frames"]}] -> #2, {i, 1, Information[#1, "FrameCount"][[1]] - Mod[Information[#1, "FrameCount"][[1]], maxFrameNumber], Round[maxFrameNumber/4]}] &, videos];

In[11]:=

Remove the last two layers from the pre-trained net:

In[12]:=

Out[12]=

Create a new net composed of the pre-trained net followed by a linear layer, an aggregation layer and a softmax layer:

In[13]:=

newNet = NetJoin[tempNet, NetChain[{"Linear" -> LinearLayer[2, "Input" -> {2048}], "SoftMax" -> SoftmaxLayer[]}], "Output" -> NetDecoder[{"Class", {"Freezing Bubble", "Wild Ducks in the Park"}}]];

Train on the dataset, freezing all the weights except for those in the "Linear" layer (use TargetDevice -> "GPU" for training on a GPU):

In[14]:=

$trainedNet = NetTrain[newNet, train, LearningRateMultipliers -> {"Linear" -> 1, _ -> 0}, ValidationSet -> Scaled[0.1], MaxTrainingRounds -> 2]$

Out[14]=

Perfect accuracy is obtained on the test set:

In[15]:=

Out[15]=

Net information

Inspect the number of parameters of all arrays in the net:

In[16]:=

Out[16]=

Obtain the total number of parameters:

In[17]:=

Out[17]=

Obtain the layer type counts:

In[18]:=

Out[18]=

Display the summary graphic:

In[19]:=

Out[19]=

Resource History

Date Created: 4 January 2023

Reference

J. Carreira, A. Zisserman, "Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset," arXiv:1705.07750v3 (2018)
Available from: https://github.com/facebookresearch/pytorchvideo
Rights: Apache License 2.0