Spatiotemporal ResNet-18 for Action Recognition Trained on Kinetics-400 Data

Identify the main action in a video

Released in 2018, this family of models is obtained by splitting the 3D convolutional filters into distinct spatial and temporal components yielding a significant increase in accuracy.

Number of models: 3

Training Set Information

Performance

Examples

Resource retrieval

Get the pre-trained net:

In[1]:=
NetModel["Spatiotemporal ResNet-18 for Action Recognition Trained on Kinetics-400 Data"]
Out[2]=

NetModel parameters

This model consists of a family of individual nets, each identified by a specific parameter combination. Inspect the available parameters:

In[3]:=
NetModel["Spatiotemporal ResNet-18 for Action Recognition Trained on Kinetics-400 Data", "ParametersInformation"]
Out[4]=

Pick a non-default net by specifying the parameters:

In[5]:=
NetModel[{"Spatiotemporal ResNet-18 for Action Recognition Trained on Kinetics-400 Data", "Convolution" -> "3D"}]
Out[6]=

Pick a non-default uninitialized net:

In[7]:=
NetModel[{"Spatiotemporal ResNet-18 for Action Recognition Trained on Kinetics-400 Data", "Convolution" -> "3D"}, "UninitializedEvaluationNet"]
Out[8]=

Basic usage

Get a video:

In[9]:=
video = ResourceData["Sample Video: Barbecuing"];

Show some of the video frames:

In[10]:=
VideoFrameList[video, 3]
Out[10]=

Identify the main action in a video:

In[11]:=
NetModel[
  "Spatiotemporal ResNet-18 for Action Recognition Trained on Kinetics-400 Data"][video]
Out[12]=

Obtain the probabilities of the 10 most likely entities predicted by the net:

In[13]:=
NetModel[
  "Spatiotemporal ResNet-18 for Action Recognition Trained on Kinetics-400 Data"][video, {"TopProbabilities", 10}]
Out[14]=

An activity outside the list of the Kinetics-400 classes will be misidentified:

In[15]:=
ducks = ResourceData["Sample Video: Wild Ducks in the Park"];
VideoFrameList[ducks, 3]
Out[16]=
In[17]:=
NetModel[
  "Spatiotemporal ResNet-18 for Action Recognition Trained on Kinetics-400 Data"][ducks]
Out[18]=

Obtain the list of names of all available classes:

In[19]:=
NetExtract[
 NetModel[
  "Spatiotemporal ResNet-18 for Action Recognition Trained on Kinetics-400 Data"], {"Output", "Labels"}]
Out[20]=

Identify the main action of the video over the moving frames:

In[21]:=
VideoMapList[
 RandomChoice[#Image] -> NetModel[
     "Spatiotemporal ResNet-18 for Action Recognition Trained on Kinetics-400 Data"][#Image] &, video, Quantity[16, "Frames"], Quantity[16, "Frames"]]
Out[22]=

Visualize convolutional weights

Extract the weights of the first convolutional layer in the trained net:

In[23]:=
NetModel["Spatiotemporal ResNet-18 for Action Recognition Trained on Kinetics-400 Data"]
Out[24]=
In[25]:=
weights = NetExtract[
   NetModel[
    "Spatiotemporal ResNet-18 for Action Recognition Trained on Kinetics-400 Data"], {"stem", "conv1", "Weights"}];

Show the dimensions of the weights:

In[26]:=
Dimensions[weights]
Out[26]=

Extract the kernels corresponding to the receptive fields:

In[27]:=
nweights = weights[[All, All, 1, All, All]]
Out[27]=

Visualize the weights as a list of 45 images of size 7⨯7:

In[28]:=
ImageAdjust[Image[#, Interleaving -> False]] & /@ Normal[nweights]
Out[28]=

Network architecture

3D convolutional neural networks (3DCNNs) preserve temporal information, with filters over both time and space. "3D" architecture has all the convolutional blocks as 3DCNN. 3D kernels have a kernel size of L×H×W, where L denotes the temporal extent of the filter and H×W are the height and width of the filter. Extract the kernel size of the first convolutional layer from each of the convolutional blocks:

In[29]:=
Table[blockName = StringJoin["block", ToString[Ceiling[i/2]], If[OddQ[i], "a", "b"]];
 blockName -> NetExtract[
   NetModel[{"Spatiotemporal ResNet-18 for Action Recognition Trained on Kinetics-400 Data", "Convolution" -> "3D"}], {blockName, "conv1", "KernelSize"}],
 {i, 1, 8}
 ]
Out[30]=

"Mixed" 3D convolutional neural networks (3DCNNs) use 3D convolutions for early layers and 2D convolutions for later layers. In the later blocks, L=1, which implies that different frames are processed independently:

In[31]:=
Table[blockName = StringJoin["block", ToString[Ceiling[i/2]], If[OddQ[i], "a", "b"]];
 blockName -> NetExtract[
   NetModel[{"Spatiotemporal ResNet-18 for Action Recognition Trained on Kinetics-400 Data", "Convolution" -> "Mixed"}], {blockName, "conv1", "KernelSize"}],
 {i, 1, 8}
 ]
Out[32]=

Another way to approach this problem would be to replace 3D kernels of size L×H×W with a "(2+1)D" block consisting of spatial 2D convolutional filters of size 1×H×W and temporal convolutional filters of size L×1×1. Extract the first two convolution kernels from each block to explore the alternate spatial and temporal convolutions:

In[33]:=
Table[blockName = StringJoin["block", ToString[Ceiling[i/2]], If[OddQ[i], "a", "b"]];
 blockName -> NetExtract[
   NetModel[
    "Spatiotemporal ResNet-18 for Action Recognition Trained on Kinetics-400 Data"], {{blockName, "conv1", "KernelSize"}, {blockName, "conv1", "KernelSize"}}],
 {i, 8}
 ]
Out[34]=

The summary graphs for spatiotemporal ResNet-18 architectures of "(2+1)D", "3D" and "Mixed" models are presented below:

Out[35]=

Advanced usage

Recommended evaluation of the network is time consuming:

In[36]:=
AbsoluteTiming@
 NetModel[
   "Spatiotemporal ResNet-18 for Action Recognition Trained on Kinetics-400 Data"][video]
Out[37]=

Evaluation speed scales almost linearly with "TargetLength". The encoder can be tuned appropriately for speed vs. performance:

In[38]:=
enc[nframes_] := NetEncoder[{"VideoFrames", {112, 112}, "MeanImage" -> {0.43216, 0.394666, 0.37645}, "VarianceImage" -> {0.22803, 0.22145, 0.216989}, "ColorSpace" -> "RGB", "TargetLength" -> nframes, FrameRate -> Inherited}];
newEncoder = enc[5]
Out[39]=

Replace the encoder in the original NetModel:

In[40]:=
newNet = NetReplacePart[
  NetModel[
   "Spatiotemporal ResNet-18 for Action Recognition Trained on Kinetics-400 Data"], "Input" -> newEncoder]
Out[41]=

Evaluate on the original video. The time of evaluation is reduced to a third, while the evaluation remains correct:

In[42]:=
AbsoluteTiming@newNet[ResourceData["Sample Video: Barbecuing"]]
Out[42]=

Feature extraction

Remove the last two layers of the trained net so that the net produces a vector representation of an image:

In[43]:=
extractor = NetTake[NetModel[
   "Spatiotemporal ResNet-18 for Action Recognition Trained on Kinetics-400 Data"], {1, -2}]
Out[43]=

Get a set of videos:

In[44]:=
videos = Join[ResourceData["Cheerleading Video Samples"], ResourceData["Tooth Brushing Video Samples"]];

Visualize the features of a set of videos:

In[45]:=
FeatureSpacePlot[videos, FeatureExtractor -> extractor, LabelingFunction -> (Placed[Thumbnail@VideoFrameList[#1, 1][[1]], Center] &), LabelingSize -> 100, ImageSize -> 600]
Out[45]=

Net information

Inspect the number of parameters of all arrays in the net:

In[46]:=
Information[
 NetModel[
  "Spatiotemporal ResNet-18 for Action Recognition Trained on Kinetics-400 Data"], "ArraysElementCounts"]
Out[47]=

Obtain the total number of parameters:

In[48]:=
Information[
 NetModel[
  "Spatiotemporal ResNet-18 for Action Recognition Trained on Kinetics-400 Data"], "ArraysTotalElementCount"]
Out[49]=

Obtain the layer type counts:

In[50]:=
Information[
 NetModel[
  "Spatiotemporal ResNet-18 for Action Recognition Trained on Kinetics-400 Data"], "LayerTypeCounts"]
Out[51]=

Display the summary graphic:

In[52]:=
Information[
 NetModel[
  "Spatiotemporal ResNet-18 for Action Recognition Trained on Kinetics-400 Data"], "SummaryGraphic"]
Out[53]=

Export to ONNX

Export the net to the ONNX format:

In[54]:=
onnxFile = Export[FileNameJoin[{$TemporaryDirectory, "net.onnx"}], NetModel[
   "Spatiotemporal ResNet-18 for Action Recognition Trained on Kinetics-400 Data"]]
Out[55]=

Get the size of the ONNX file:

In[56]:=
FileByteCount[onnxFile]
Out[56]=

The size is similar to the byte count of the resource object:

In[57]:=
ResourceObject[
   "Spatiotemporal ResNet-18 for Action Recognition Trained on Kinetics-400 Data"]["ByteCount"]["EvaluationNet:(2+1)D"]
Out[58]=

Check some metadata of the ONNX model:

In[59]:=
{OpsetVersion, IRVersion} = {Import[onnxFile, "OperatorSetVersion"], Import[onnxFile, "IRVersion"]}
Out[59]=

Import the model back into the Wolfram Language. However, the NetEncoder and NetDecoder will be absent because they are not supported by ONNX:

In[60]:=
Import[onnxFile]
Out[60]=

Resource History

Reference