# Wolfram Neural Net Repository

Immediate Computable Access to Neural Net Models

Identify the main action in a video

Released in 2018, this family of models is obtained by splitting the 3D convolutional filters into distinct spatial and temporal components yielding a significant increase in accuracy.

Number of models: 3

- Kinetics-400 Human Action Video Data, consisting of four hundred human action classes, with at least four hundred video clips for each action. Each clip lasts around 10 seconds and is taken from a different YouTube video.

The models achieve the following 5-crop accuracies for clip length 16 (16×112×112) on the Kinetics-400 dataset.

Get the pre-trained net:

In[1]:= |

Out[2]= |

This model consists of a family of individual nets, each identified by a specific parameter combination. Inspect the available parameters:

In[3]:= |

Out[4]= |

Pick a non-default net by specifying the parameters:

In[5]:= |

Out[6]= |

Pick a non-default uninitialized net:

In[7]:= |

Out[8]= |

Get a video:

In[9]:= |

Show some of the video frames:

In[10]:= |

Out[10]= |

Identify the main action in a video:

In[11]:= |

Out[12]= |

Obtain the probabilities of the 10 most likely entities predicted by the net:

In[13]:= |

Out[14]= |

An activity outside the list of the Kinetics-400 classes will be misidentified:

In[15]:= |

Out[16]= |

In[17]:= |

Out[18]= |

Obtain the list of names of all available classes:

In[19]:= |

Out[20]= |

Identify the main action of the video over the moving frames:

In[21]:= |

Out[22]= |

Extract the weights of the first convolutional layer in the trained net:

In[23]:= |

Out[24]= |

In[25]:= |

Show the dimensions of the weights:

In[26]:= |

Out[26]= |

Extract the kernels corresponding to the receptive fields:

In[27]:= |

Out[27]= |

Visualize the weights as a list of 45 images of size 7⨯7:

In[28]:= |

Out[28]= |

3D convolutional neural networks (3DCNNs) preserve temporal information, with filters over both time and space. "3D" architecture has all the convolutional blocks as 3DCNN. 3D kernels have a kernel size of *L*×*H*×*W, *where *L* denotes the temporal extent of the filter and *H*×*W* are the height and width of the filter. Extract the kernel size of the first convolutional layer from each of the convolutional blocks:

In[29]:= |

Out[30]= |

"Mixed" 3D convolutional neural networks (3DCNNs) use 3D convolutions for early layers and 2D convolutions for later layers. In the later blocks, *L*=1, which implies that different frames are processed independently:

In[31]:= |

Out[32]= |

Another way to approach this problem would be to replace 3D kernels of size *L*×*H*×*W* with a "(2+1)D" block consisting of spatial 2D convolutional filters of size 1×*H*×*W* and temporal convolutional filters of size *L*×1×1. Extract the first two convolution kernels from each block to explore the alternate spatial and temporal convolutions:

In[33]:= |

Out[34]= |

The summary graphs for spatiotemporal ResNet-18 architectures of "(2+1)D", "3D" and "Mixed" models are presented below:

Out[35]= |

Recommended evaluation of the network is time consuming:

In[36]:= |

Out[37]= |

Evaluation speed scales almost linearly with "TargetLength". The encoder can be tuned appropriately for speed vs. performance:

In[38]:= |

Out[39]= |

Replace the encoder in the original NetModel:

In[40]:= |

Out[41]= |

Evaluate on the original video. The time of evaluation is reduced to a third, while the evaluation remains correct:

In[42]:= |

Out[42]= |

Remove the last two layers of the trained net so that the net produces a vector representation of an image:

In[43]:= |

Out[44]= |

Get a set of videos:

In[45]:= |

Visualize the features of a set of videos:

In[46]:= |

Out[46]= |

Inspect the number of parameters of all arrays in the net:

In[47]:= |

Out[48]= |

Obtain the total number of parameters:

In[49]:= |

Out[50]= |

Obtain the layer type counts:

In[51]:= |

Out[52]= |

Display the summary graphic:

In[53]:= |

Out[54]= |

Export the net to the ONNX format:

In[55]:= |

Out[56]= |

Get the size of the ONNX file:

In[57]:= |

Out[57]= |

The size is similar to the byte count of the resource object:

In[58]:= |

Out[59]= |

Check some metadata of the ONNX model:

In[60]:= |

Out[60]= |

Import the model back into the Wolfram Language. However, the NetEncoder and NetDecoder will be absent because they are not supported by ONNX:

In[61]:= |

Out[61]= |

- D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, M. Paluri, "A Closer Look at Spatiotemporal Convolutions for Action Recognition," arXiv:1711.11248 (2018)
- Available from:
- Rights: BSD 3-Clause License