MobileNet-3D V1 Trained on Video Datasets

Identify the action in a video

Released in 2019, this family of nets consists of three-dimensional (3D) versions of the original MobileNet V1 architecture for video classification. Using a combination of depthwise separable convolutions and 3D convolutions, these light and efficient models achieve much better video classification accuracies compared to their two-dimensional counterparts.

Number of models: 8

Training Set Information

Performance

Examples

Resource retrieval

Get the pre-trained net:

In[1]:=
NetModel["MobileNet-3D V1 Trained on Video Datasets"]
Out[1]=

NetModel parameters

This model consists of a family of individual nets, each identified by a specific parameter combination. Inspect the available parameters:

In[2]:=
NetModel["MobileNet-3D V1 Trained on Video Datasets", "ParametersInformation"]
Out[2]=

Pick a non-default net by specifying the parameters:

In[3]:=
NetModel[{"MobileNet-3D V1 Trained on Video Datasets", "Dataset" -> "Jester", "Width" -> 1.0}]
Out[3]=

Pick a non-default uninitialized net:

In[4]:=
NetModel[{"MobileNet-3D V1 Trained on Video Datasets", "Dataset" -> "Jester", "Width" -> 1.5}, "UninitializedEvaluationNet"]
Out[4]=

Basic usage

Identify the main action in a video:

In[5]:=
yoga = ResourceData["Sample Video: Practicing Yoga"];
In[6]:=
NetModel["MobileNet-3D V1 Trained on Video Datasets"][yoga]
Out[6]=

Obtain the probabilities of the 10 most likely entities predicted by the net:

In[7]:=
NetModel["MobileNet-3D V1 Trained on Video Datasets"][yoga, {"TopProbabilities", 10}]
Out[7]=

Obtain the list of names of all available classes:

In[8]:=
NetExtract[NetModel["MobileNet-3D V1 Trained on Video Datasets"], "Output"][["Labels"]]
Out[8]=

Network architecture

MobileNet-3D V1 is characterized by depthwise separable convolutions. The full convolutional operator is split into two layers. The first layer is called a depthwise convolution, which performs lightweight filtering by applying a single convolutional filter per input channel. This is realized by having a "ChannelGroups" setting equal to the number of input channels:

In[9]:=
NetExtract[
 NetModel["MobileNet-3D V1 Trained on Video Datasets"], {"block1", "conv1"}]
Out[9]=

The second layer is a 1⨯1⨯1 convolution called a pointwise convolution, which is responsible for building new features through computing linear combinations of the input channels:

In[10]:=
NetExtract[
 NetModel["MobileNet-3D V1 Trained on Video Datasets"], {"block1", "conv2"}]
Out[10]=

Feature extraction

Remove the last layers of the trained net so that the net produces a vector representation of an image:

In[11]:=
extractor = NetTake[NetModel[
   "MobileNet-3D V1 Trained on Video Datasets"], {1, -5}]
Out[11]=

Get a set of videos:

In[12]:=
videos = Join[ResourceData["Tooth Brushing Video Samples"], ResourceData["Cheerleading Video Samples"]];

Visualize the features of a set of videos:

In[13]:=
FeatureSpacePlot[videos, FeatureExtractor -> extractor, LabelingFunction -> (Callout[
     Thumbnail@VideoExtractFrames[#1, Quantity[1, "Frames"]]] &), LabelingSize -> 50, ImageSize -> 600]
Out[13]=

Transfer learning

Use the pre-trained model to build a classifier for telling apart images from two action classes not present in the dataset. Create a test set and a training set:

In[14]:=
videos = <|
   ResourceData["Sample Video: Reading a Book"] -> "reading book", ResourceData["Sample Video: Blowing Glitter"] -> "blowing glitter"|>;
In[15]:=
dataset = Join @@ KeyValueMap[
    Thread[
      VideoSplit[#1, Most@Table[
          Quantity[i, "Frames"], {i, 16, Information[#1, "FrameCount"][[1]], 16}]] -> #2] &,
    videos
    ];
In[16]:=
{train, test} = ResourceFunction["TrainTestSplit"][dataset, "TrainingSetSize" -> 0.7];

Remove the last layers from the pre-trained net:

In[17]:=
tempNet = NetTake[NetModel[
   "MobileNet-3D V1 Trained on Video Datasets"], {1, -3}]
Out[17]=

Create a new net composed of the pre-trained net followed by a linear layer and a softmax layer:

In[18]:=
newNet = NetJoin[tempNet, NetChain[{"Linear" -> LinearLayer[], "SoftMax" -> SoftmaxLayer[]}], "Output" -> NetDecoder[{"Class", {"blowing glitter", "reading book"}}]]
Out[18]=

Train on the dataset, freezing all the weights except for those in the "Linear" new layer (use TargetDevice -> "GPU" for training on a GPU):

In[19]:=
trainedNet = NetTrain[newNet, train, LearningRateMultipliers -> {"Linear" -> 1, _ -> 0}, ValidationSet -> Scaled[0.1]]
Out[19]=

Perfect accuracy is obtained on the test set:

In[20]:=
ClassifierMeasurements[trainedNet, test, "Accuracy"]
Out[20]=

Net information

Inspect the number of parameters of all arrays in the net:

In[21]:=
Information[
 NetModel["MobileNet-3D V1 Trained on Video Datasets"], "ArraysElementCounts"]
Out[21]=

Obtain the total number of parameters:

In[22]:=
Information[
 NetModel["MobileNet-3D V1 Trained on Video Datasets"], "ArraysTotalElementCount"]
Out[22]=

Obtain the layer type counts:

In[23]:=
Information[
 NetModel["MobileNet-3D V1 Trained on Video Datasets"], "LayerTypeCounts"]
Out[23]=

Display the summary graphic:

In[24]:=
Information[
 NetModel["MobileNet-3D V1 Trained on Video Datasets"], "SummaryGraphic"]
Out[24]=

Export to ONNX

Export the net to the ONNX format:

In[25]:=
onnxFile = Export[FileNameJoin[{$TemporaryDirectory, "net.onnx"}], NetModel["MobileNet-3D V1 Trained on Video Datasets"]]
Out[26]=

Get the size of the ONNX file:

In[27]:=
FileByteCount[onnxFile]
Out[27]=

Check some metadata of the ONNX model:

In[28]:=
{opsetVersion, irVersion} = {Import[onnxFile, "OperatorSetVersion"], Import[onnxFile, "IRVersion"]}
Out[28]=

Import the model back into the Wolfram Language. However, the NetEncoder and NetDecoder will be absent because they are not supported by ONNX:

In[29]:=
Import[onnxFile]
Out[29]=

Resource History

Reference