Spatiotemporal ResNet-18 for Action Recognition Trained on Kinetics-400 Data

Identify the main action in a video

Released in 2018, this family of models is obtained by splitting the 3D convolutional filters into distinct spatial and temporal components yielding a significant increase in accuracy.

Number of models: 3

Training Set Information

Kinetics-400 Human Action Video Data, consisting of four hundred human action classes, with at least four hundred video clips for each action. Each clip lasts around 10 seconds and is taken from a different YouTube video.

Performance

The models achieve the following 5-crop accuracies for clip length 16 (16×112×112) on the Kinetics-400 dataset.

Examples

Download Example Notebook

Open in Wolfram Cloud

Resource retrieval

Get the pre-trained net:

In[1]:=

Out[2]=

NetModel parameters

This model consists of a family of individual nets, each identified by a specific parameter combination. Inspect the available parameters:

In[3]:=

Out[4]=

Pick a non-default net by specifying the parameters:

In[5]:=

Out[6]=

Pick a non-default uninitialized net:

In[7]:=

Out[8]=

Basic usage

Get a video:

In[9]:=

Show some of the video frames:

In[10]:=

Out[10]=

Identify the main action in a video:

In[11]:=

Out[12]=

Obtain the probabilities of the 10 most likely entities predicted by the net:

In[13]:=

Out[14]=

An activity outside the list of the Kinetics-400 classes will be misidentified:

In[15]:=

Out[16]=

In[17]:=

Out[18]=

Obtain the list of names of all available classes:

In[19]:=

NetExtract[
NetModel[
"Spatiotemporal ResNet-18 for Action Recognition Trained on Kinetics-400 Data"], {"Output", "Labels"}]

Out[20]=

Identify the main action of the video over the moving frames:

In[21]:=

VideoMapList[
RandomChoice[#Image] -> NetModel[
"Spatiotemporal ResNet-18 for Action Recognition Trained on Kinetics-400 Data"][#Image] &, video, Quantity[16, "Frames"], Quantity[16, "Frames"]]

Out[22]=

Visualize convolutional weights

Extract the weights of the first convolutional layer in the trained net:

In[23]:=

Out[24]=

In[25]:=

weights = NetExtract[
NetModel[
"Spatiotemporal ResNet-18 for Action Recognition Trained on Kinetics-400 Data"], {"stem", "conv1", "Weights"}];

Show the dimensions of the weights:

In[26]:=

Out[26]=

Extract the kernels corresponding to the receptive fields:

In[27]:=

Out[27]=

Visualize the weights as a list of 45 images of size 7⨯7:

In[28]:=

Out[28]=

Network architecture

3D convolutional neural networks (3DCNNs) preserve temporal information, with filters over both time and space. "3D" architecture has all the convolutional blocks as 3DCNN. 3D kernels have a kernel size of L×H×W, where L denotes the temporal extent of the filter and H×W are the height and width of the filter. Extract the kernel size of the first convolutional layer from each of the convolutional blocks:

In[29]:=

Table[blockName = StringJoin["block", ToString[Ceiling[i/2]], If[OddQ[i], "a", "b"]];
blockName -> NetExtract[
NetModel[{"Spatiotemporal ResNet-18 for Action Recognition Trained on Kinetics-400 Data", "Convolution" -> "3D"}], {blockName, "conv1", "KernelSize"}],
{i, 1, 8}
]

Out[30]=

"Mixed" 3D convolutional neural networks (3DCNNs) use 3D convolutions for early layers and 2D convolutions for later layers. In the later blocks, L=1, which implies that different frames are processed independently:

In[31]:=

Out[32]=

Another way to approach this problem would be to replace 3D kernels of size L×H×W with a "(2+1)D" block consisting of spatial 2D convolutional filters of size 1×H×W and temporal convolutional filters of size L×1×1. Extract the first two convolution kernels from each block to explore the alternate spatial and temporal convolutions:

In[33]:=

Out[34]=

The summary graphs for spatiotemporal ResNet-18 architectures of "(2+1)D", "3D" and "Mixed" models are presented below:

Out[35]=

Advanced usage

Recommended evaluation of the network is time consuming:

In[36]:=

AbsoluteTiming@
NetModel[
"Spatiotemporal ResNet-18 for Action Recognition Trained on Kinetics-400 Data"][video]

Out[37]=

Evaluation speed scales almost linearly with "TargetLength". The encoder can be tuned appropriately for speed vs. performance:

In[38]:=

$enc[nframes_] := NetEncoder[{"VideoFrames", {112, 112}, "MeanImage" -> {0.43216, 0.394666, 0.37645}, "VarianceImage" -> {0.22803, 0.22145, 0.216989}, "ColorSpace" -> "RGB", "TargetLength" -> nframes, FrameRate -> Inherited}]; newEncoder = enc[5]$

Out[39]=

Replace the encoder in the original NetModel:

In[40]:=

newNet = NetReplacePart[
NetModel[
"Spatiotemporal ResNet-18 for Action Recognition Trained on Kinetics-400 Data"], "Input" -> newEncoder]

Out[41]=

Evaluate on the original video. The time of evaluation is reduced to a third, while the evaluation remains correct:

In[42]:=

Out[42]=

Feature extraction

Remove the last two layers of the trained net so that the net produces a vector representation of an image:

In[43]:=

extractor = NetTake[NetModel[
"Spatiotemporal ResNet-18 for Action Recognition Trained on Kinetics-400 Data"], {1, -2}]

Out[43]=

Get a set of videos:

In[44]:=

Visualize the features of a set of videos:

In[45]:=

FeatureSpacePlot[videos, FeatureExtractor -> extractor, LabelingFunction -> (Placed[Thumbnail@VideoFrameList[#1, 1][[1]], Center] &), LabelingSize -> 100, ImageSize -> 600]

Out[45]=

Net information

Inspect the number of parameters of all arrays in the net:

In[46]:=

Information[
NetModel[
"Spatiotemporal ResNet-18 for Action Recognition Trained on Kinetics-400 Data"], "ArraysElementCounts"]

Out[47]=

Obtain the total number of parameters:

In[48]:=

Information[
NetModel[
"Spatiotemporal ResNet-18 for Action Recognition Trained on Kinetics-400 Data"], "ArraysTotalElementCount"]

Out[49]=

Obtain the layer type counts:

In[50]:=

Information[
NetModel[
"Spatiotemporal ResNet-18 for Action Recognition Trained on Kinetics-400 Data"], "LayerTypeCounts"]

Out[51]=

Display the summary graphic:

In[52]:=

Information[
NetModel[
"Spatiotemporal ResNet-18 for Action Recognition Trained on Kinetics-400 Data"], "SummaryGraphic"]

Out[53]=

Export to ONNX

Export the net to the ONNX format:

In[54]:=

Out[55]=

Get the size of the ONNX file:

In[56]:=

Out[56]=

The size is similar to the byte count of the resource object:

In[57]:=

Out[58]=

Check some metadata of the ONNX model:

In[59]:=

Out[59]=

Import the model back into the Wolfram Language. However, the NetEncoder and NetDecoder will be absent because they are not supported by ONNX:

In[60]:=

Out[60]=

Construction Notebook

Download Construction Notebook

Open in Wolfram Cloud

Resource History

Date Created: 16 September 2021

Reference

D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, M. Paluri, "A Closer Look at Spatiotemporal Convolutions for Action Recognition," arXiv:1711.11248 (2018)
Available from:
- https://github.com/pytorch/vision/blob/master/torchvision/models/video/resnet.py
Rights: BSD 3-Clause License