Channel-Separated Video Action Classification Net
Trained on
Kinetics-400 Data
Inspired by 2D separable convolutions in image classification, the authors propose 3D Channel-Separated Networks (CSNs), in which all convolutional operations are separated into either pointwise 1×1×1 or depthwise 3×3×3 convolutions, resulting in a significant accuracy improvement on Sports-1M, Kinetics and Something-Something datasets while being two to three times faster.
Examples
Resource retrieval
Get the pre-trained net:
Basic usage
Classify a video:
Obtain the probabilities predicted by the net:
Feature extraction
Remove the last three layers of the trained net so that the net produces a vector representation of an image:
Get a set of videos:
Visualize the features of a set of videos:
Transfer learning
Use the pre-trained model to build a classifier for telling apart images from two action classes not present in the dataset. Create a test set and a training set:
Remove the linear and the softmax layers from the pre-trained net:
Create a new net composed of the pre-trained net followed by a LinearLayer and a SoftmaxLayer:
Train on the dataset, freezing all the weights except for those in the "Linear" layer (use TargetDevice-> "GPU" for training on a GPU):
Perfect accuracy is obtained on the test set:
Net information
Inspect the number of parameters of all arrays in the net:
Obtain the total number of parameters:
Obtain the layer type counts:
Display the summary graphic:
Resource History
Reference