ShuffleNet-3D V2 Trained on Video Datasets

Identify the main action in a video

Released in 2019, this family of nets consists of three-dimensional (3D) versions of the original ShuffleNet V2 architecture for video classification. The ShuffleNet V2 architecture utilizes a simple channel split operation that splits the input feature channel into two channels, reducing the costs of group convolutions. With the availability of large-scale video datasets such as Jester and Kinetics-600, these models achieve much better accuracies compared to their two-dimensional counterparts for video classification task.

Number of models: 8

Training Set Information

Performance

Examples

Resource retrieval

Get the pre-trained net:

In[1]:=
NetModel["ShuffleNet-3D V2 Trained on Video Datasets"]
Out[1]=

NetModel parameters

This model consists of a family of individual nets, each identified by a specific parameter combination. Inspect the available parameters:

In[2]:=
NetModel["ShuffleNet-3D V2 Trained on Video Datasets", "ParametersInformation"]
Out[2]=

Pick a non-default net by specifying the parameters:

In[3]:=
NetModel[{"ShuffleNet-3D V2 Trained on Video Datasets", "Dataset" -> "Kinetics", "Width" -> 0.25}]
Out[3]=

Pick a non-default uninitialized net:

In[4]:=
NetModel[{"ShuffleNet-3D V2 Trained on Video Datasets", "Dataset" -> "Jester", "Width" -> 2.0}, "UninitializedEvaluationNet"]
Out[4]=

Basic usage

Identify the main action in a video:

In[5]:=
bbq = ResourceData["Sample Video: Barbecuing"];
In[6]:=
NetModel["ShuffleNet-3D V2 Trained on Video Datasets"][bbq]
Out[6]=

Obtain the probabilities of the 10 most likely entities predicted by the net:

In[7]:=
NetModel["ShuffleNet-3D V2 Trained on Video Datasets"][bbq, {"TopProbabilities", 10}]
Out[7]=

Obtain the list of names of all available classes:

In[8]:=
NetExtract[NetModel["ShuffleNet-3D V2 Trained on Video Datasets"], "Output"][["Labels"]]
Out[8]=

Network architecture

In addition to the channel shuffle operations introduced in ShuffleNet-3D V1, the ShuffleNet-3D V2 architecture reduces the costs using a simple operator called "channel split," which splits the input of c feature channels into two branches with c/2 channels. This reduces the costs of group convolutions and creates more "balanced" convolutions (with equal channel width). In addition, elementwise operations like ReLU and depthwise convolutions exist only in one branch. After convolution, the two branches are concatenated so that the channel numbers are the same:

In[9]:=
NetExtract[
 NetModel["ShuffleNet-3D V2 Trained on Video Datasets"], "block1c"]
Out[9]=

For spatial down sampling the channel split operator is removed. Thus the number of channels on each branch is doubled:

In[10]:=
NetExtract[
 NetModel["ShuffleNet-3D V2 Trained on Video Datasets"], "block1a"]
Out[10]=

Feature extraction

Remove the last two layers of the trained net so that the net produces a vector representation of an image:

In[11]:=
brushing = ResourceData["Tooth Brushing Video Samples"];
In[12]:=
cheerleading = ResourceData["Cheerleading Video Samples"];
In[13]:=
extractor = NetTake[NetModel[
   "ShuffleNet-3D V2 Trained on Video Datasets"], {1, -4}]
Out[13]=

Get a set of videos:

In[14]:=
videos = Join[brushing, cheerleading];

Visualize the features of a set of videos:

In[15]:=
FeatureSpacePlot[videos, FeatureExtractor -> extractor, LabelingFunction -> (Callout[
     Thumbnail@VideoExtractFrames[#1, Quantity[1, "Frames"]]] &), LabelingSize -> 50, ImageSize -> 600]
Out[15]=

Transfer learning

Use the pre-trained model to build a classifier for telling apart images from two action classes not present in the dataset. Create a test set and a training set:

In[16]:=
videos = <|
   ResourceData["Sample Video: Reading a Book"] -> "reading book", ResourceData["Sample Video: Blowing Glitter"] -> "blowing glitter"|>;
In[17]:=
dataset = Join @@ KeyValueMap[
    Thread[
      VideoSplit[#1, Most@Table[
          Quantity[i, "Frames"], {i, 16, Information[#1, "FrameCount"][[1]], 16}]] -> #2] &,
    videos
    ];
In[18]:=
{train, test} = ResourceFunction["TrainTestSplit"][dataset, "TrainingSetSize" -> 0.7];

Remove the linear layer from the pre-trained net:

In[19]:=
tempNet = NetTake[NetModel[
   "ShuffleNet-3D V2 Trained on Video Datasets"], {1, -3}]
Out[19]=

Create a new net composed of the pre-trained net followed by a linear layer and a softmax layer:

In[20]:=
newNet = NetJoin[tempNet, NetChain[{ "Linear" -> LinearLayer[], "Softmax" -> SoftmaxLayer[]}],
   "Output" -> NetDecoder[{"Class", {"blowing glitter", "reading book"}}]]
Out[20]=

Train on the dataset, freezing all the weights except for those in the "Linear" new layer (use TargetDevice -> "GPU" for training on a GPU):

In[21]:=
trainedNet = NetTrain[newNet, train, LearningRateMultipliers -> {"Linear" -> 1, _ -> 0}, ValidationSet -> Scaled[0.1]]
Out[21]=

Perfect accuracy is obtained on the test set:

In[22]:=
ClassifierMeasurements[trainedNet, test, "Accuracy"]
Out[22]=

Net information

Inspect the number of parameters of all arrays in the net:

In[23]:=
Information[
 NetModel["ShuffleNet-3D V2 Trained on Video Datasets"], "ArraysElementCounts"]
Out[23]=

Obtain the total number of parameters:

In[24]:=
Information[
 NetModel["ShuffleNet-3D V2 Trained on Video Datasets"], "ArraysTotalElementCount"]
Out[24]=

Obtain the layer type counts:

In[25]:=
Information[
 NetModel["ShuffleNet-3D V2 Trained on Video Datasets"], "LayerTypeCounts"]
Out[25]=

Display the summary graphic:

In[26]:=
Information[
 NetModel["ShuffleNet-3D V2 Trained on Video Datasets"], "SummaryGraphic"]
Out[26]=

Export to ONNX

Export the net to the ONNX format:

In[27]:=
onnxFile = Export[FileNameJoin[{$TemporaryDirectory, "net.onnx"}], NetModel["ShuffleNet-3D V2 Trained on Video Datasets"]]
Out[27]=

Get the size of the ONNX file:

In[28]:=
FileByteCount[onnxFile]
Out[28]=

Check some metadata of the ONNX model:

In[29]:=
{opsetVersion, irVersion} = {Import[onnxFile, "OperatorSetVersion"], Import[onnxFile, "IRVersion"]}
Out[29]=

Import the model back into the Wolfram Language. However, the NetEncoder and NetDecoder will be absent because they are not supported by ONNX:

In[30]:=
Import[onnxFile]
Out[30]=

Resource History

Reference