Resource retrieval
Get the pre-trained net:
NetModel parameters
This model consists of a family of individual nets, each identified by a specific parameter combination. Inspect the available parameters:
Pick a non-default net by specifying the parameters:
Pick a non-default uninitialized net:
Basic usage
Get a video:
Show some of the video frames:
Identify the main action in a video:
Obtain the probabilities of the 10 most likely entities predicted by the net:
An activity outside the list of the Kinetics-400 classes will be misidentified:
Obtain the list of names of all available classes:
Identify the main action of the video over the moving frames:
Visualize convolutional weights
Extract the weights of the first convolutional layer in the trained net:
Show the dimensions of the weights:
Extract the kernels corresponding to the receptive fields:
Visualize the weights as a list of 45 images of size 7⨯7:
Network architecture
3D convolutional neural networks (3DCNNs) preserve temporal information, with filters over both time and space. "3D" architecture has all the convolutional blocks as 3DCNN. 3D kernels have a kernel size of L×H×W, where L denotes the temporal extent of the filter and H×W are the height and width of the filter. Extract the kernel size of the first convolutional layer from each of the convolutional blocks:
"Mixed" 3D convolutional neural networks (3DCNNs) use 3D convolutions for early layers and 2D convolutions for later layers. In the later blocks, L=1, which implies that different frames are processed independently:
Another way to approach this problem would be to replace 3D kernels of size L×H×W with a "(2+1)D" block consisting of spatial 2D convolutional filters of size 1×H×W and temporal convolutional filters of size L×1×1. Extract the first two convolution kernels from each block to explore the alternate spatial and temporal convolutions:
The summary graphs for spatiotemporal ResNet-18 architectures of "(2+1)D", "3D" and "Mixed" models are presented below:
Advanced usage
Recommended evaluation of the network is time consuming:
Evaluation speed scales almost linearly with "TargetLength". The encoder can be tuned appropriately for speed vs. performance:
Replace the encoder in the original NetModel:
Evaluate on the original video. The time of evaluation is reduced to a third, while the evaluation remains correct:
Feature extraction
Remove the last two layers of the trained net so that the net produces a vector representation of an image:
Get a set of videos:
Visualize the features of a set of videos:
Net information
Inspect the number of parameters of all arrays in the net:
Obtain the total number of parameters:
Obtain the layer type counts:
Display the summary graphic:
Export to ONNX
Export the net to the ONNX format:
Get the size of the ONNX file:
The size is similar to the byte count of the resource object:
Check some metadata of the ONNX model:
Import the model back into the Wolfram Language. However, the NetEncoder and NetDecoder will be absent because they are not supported by ONNX: