ShuffleNet-3D V1 Trained on Video Datasets

Identify the main action in a video

Released in 2019, this family of nets consists of three-dimensional (3D) versions of the original ShuffleNet V1 architecture for video classification. The ShuffleNet V1 architecture utilizes pointwise group convolutions and channel shuffling, two new operations that greatly reduce computational cost while maintaining accuracy. With the availability of large-scale video datasets such as Jester and Kinetics-600, these models achieve much better accuracies compared to their two-dimensional counterparts for video classification task.

Number of models: 8

Training Set Information

Kinetics-600 dataset, containing six hundred human action classes with at least six hundred video clips for each action. For each class, there are also 50 and one hundred validation and test videos respectively. Jester dataset, containing 148,092 gesture videos under 27 classes.

Performance

The models achieve the following accuracies on the original ImageNet validation set.

Examples

Download Example Notebook

Open in Wolfram Cloud

Resource retrieval

Get the pre-trained net:

In[1]:=

Out[1]=

NetModel parameters

This model consists of a family of individual nets, each identified by a specific parameter combination. Inspect the available parameters:

In[2]:=

Out[2]=

Pick a non-default net by specifying the parameters:

In[3]:=

Out[3]=

Pick a non-default uninitialized net:

In[4]:=

Out[4]=

Basic usage

Identify the main action in a video:

In[5]:=

In[6]:=

Out[6]=

Obtain the probabilities of the 10 most likely entities predicted by the net:

In[7]:=

Out[7]=

Obtain the list of names of all available classes:

In[8]:=

Out[8]=

NetModel architecture

ShuffleNet-3D V1 features an efficient and elegant implementation of a channel shuffle operation: a feature map with g⨯n channels is reshaped to expand the channel dimension to two dimensions of sizes (g, n), then transposed and further flattened back as the input of the next layer:

In[9]:=

$NetExtract[ NetModel["ShuffleNet-3D V1 Trained on Video Datasets"], {"block1a", "channel_shuffle"}]$

Out[9]=

In[10]:=

Out[10]=

Feature extraction

Remove the last two layers of the trained net so that the net produces a vector representation of an image:

In[11]:=

In[12]:=

In[13]:=

Out[13]=

Get a set of videos:

In[14]:=

Visualize the features of a set of videos:

In[15]:=

FeatureSpacePlot[videos, FeatureExtractor -> extractor, LabelingFunction -> (Callout[
Thumbnail@VideoExtractFrames[#1, Quantity[1, "Frames"]]] &), LabelingSize -> 50, ImageSize -> 600]

Out[15]=

Transfer learning

Use the pre-trained model to build a classifier for telling apart images from two action classes not present in the dataset. Create a test set and a training set:

In[16]:=