SqueezeNet-3D Trained on Video Datasets

Identify the main action in a video

Released in 2019, this family of nets are three-dimensional (3D) versions of the original SqueezeNet architecture. SqueezeNet reduces the number of parameters drastically by using 1⨯1 convolutional filters instead of 3⨯3 and a decreased number of input channels for the 3⨯3 filters. With the availability of large-scale video datasets such as Jester and Kinetics-600, the three-dimensional SqueezeNets achieve much better accuracies compared to their two-dimensional counterparts for video classification tasks.

Number of models: 2

Training Set Information

Performance

Examples

Resource retrieval

Get the pre-trained net:

In[1]:=
NetModel["SqueezeNet-3D Trained on Video Datasets"]
Out[1]=

NetModel parameters

This model consists of a family of individual nets, each identified by a specific parameter combination. Inspect the available parameters:

In[2]:=
NetModel["SqueezeNet-3D Trained on Video Datasets", "ParametersInformation"]
Out[2]=

Pick a non-default net by specifying the parameters:

In[3]:=
NetModel[{"SqueezeNet-3D Trained on Video Datasets", "Dataset" -> "Jester"}]
Out[3]=

Pick a non-default uninitialized net:

In[4]:=
NetModel[{"SqueezeNet-3D Trained on Video Datasets", "Dataset" -> "Jester"}, "UninitializedEvaluationNet"]
Out[4]=

Basic usage

Identify the main action in a video:

In[5]:=
yoga = ResourceData["Sample Video: Practicing Yoga"];
In[6]:=
NetModel["SqueezeNet-3D Trained on Video Datasets"][yoga]
Out[6]=

Obtain the probabilities of the 10 most likely entities predicted by the net:

In[7]:=
NetModel["SqueezeNet-3D Trained on Video Datasets"][yoga, {"TopProbabilities", 10}]
Out[7]=

Obtain the list of names of all available classes:

In[8]:=
NetExtract[NetModel["SqueezeNet-3D Trained on Video Datasets"], "Output"][["Labels"]]
Out[8]=

NetModel architecture

The SqueezeNet architecture uses the "fire module," which features a 1⨯1 "squeeze" convolution followed by 1⨯1 and 3⨯3 "expand" convolutions performed in parallel:

In[9]:=
NetExtract[
 NetModel["SqueezeNet-3D Trained on Video Datasets"], "block1"]
Out[9]=
In[10]:=
AssociationMap[
 NetExtract[
   NetModel["SqueezeNet-3D Trained on Video Datasets"], {"block1", #, "KernelSize"}
   ] &,
 Table["conv" <> ToString[j], {j, 3}]
 ]
Out[10]=

All modules follow this structure:

In[11]:=
Dataset@AssociationMap[
  Function[block,
   AssociationMap[
    NetExtract[
      NetModel["SqueezeNet-3D Trained on Video Datasets"], {block, #, "KernelSize"}] &,
    Table["conv" <> ToString[j], {j, 3}]
    ]
   ],
  Table["block" <> ToString[i], {i, 8}]
  ]
Out[11]=

Alternate modules also feature a residual skip connection:

In[12]:=
NetExtract[
 NetModel["SqueezeNet-3D Trained on Video Datasets"], "block2"]
Out[12]=

Feature extraction

Remove the last two layers of the trained net so that the net produces a vector representation of an image:

In[13]:=
extractor = NetTake[NetModel["SqueezeNet-3D Trained on Video Datasets"], {1, -4}]
Out[13]=

Get a set of videos:

In[14]:=
videos = Join[ResourceData["Tooth Brushing Video Samples"], ResourceData["Cheerleading Video Samples"]];

Visualize the features of a set of videos:

In[15]:=
FeatureSpacePlot[videos, FeatureExtractor -> extractor, LabelingFunction -> (Callout[
     Thumbnail[VideoExtractFrames[#1, Quantity[1, "Frames"]], 20]] &),
  LabelingSize -> 50, ImageSize -> 600]
Out[15]=

Transfer learning

Use the pre-trained model to build a classifier for telling apart images from two action classes not present in the dataset. Create a test set and a training set:

In[16]:=
videos = <|
   ResourceData["Sample Video: Reading a Book"] -> "reading book", ResourceData["Sample Video: Blowing Glitter"] -> "blowing glitter"|>;
In[17]:=
dataset = Join @@ KeyValueMap[
    Thread[
      VideoSplit[#1, Most@Table[
          Quantity[i, "Frames"], {i, 16, Information[#1, "FrameCount"][[1]], 16}]] -> #2] &,
    videos
    ];
In[18]:=
{train, test} = ResourceFunction["TrainTestSplit"][dataset, "TrainingSetSize" -> 0.7];

Remove the last layers from the pre-trained net:

In[19]:=
tempNet = NetTake[NetModel["SqueezeNet-3D Trained on Video Datasets"], {1, -3}]
Out[19]=

Create a new net composed of the pre-trained net followed by a linear layer and a softmax layer:

In[20]:=
newNet = NetJoin[tempNet, NetChain[{ "Linear" -> LinearLayer[], "Softmax" -> SoftmaxLayer[]}],
   "Output" -> NetDecoder[{"Class", {"blowing glitter", "reading book"}}]]
Out[20]=

Train on the dataset, freezing all the weights except for those in the "Linear" new layer (use TargetDevice -> "GPU" for training on a GPU):

In[21]:=
trainedNet = NetTrain[newNet, train, LearningRateMultipliers -> {"Linear" -> 1, _ -> 0}, ValidationSet -> Scaled[0.1]]
Out[21]=

Perfect accuracy is obtained on the test set:

In[22]:=
ClassifierMeasurements[trainedNet, test, "Accuracy"]
Out[22]=

Net information

Inspect the number of parameters of all arrays in the net:

In[23]:=
Information[
 NetModel["SqueezeNet-3D Trained on Video Datasets"], "ArraysElementCounts"]
Out[23]=

Obtain the total number of parameters:

In[24]:=
Information[
 NetModel["SqueezeNet-3D Trained on Video Datasets"], "ArraysTotalElementCount"]
Out[24]=

Obtain the layer type counts:

In[25]:=
Information[
 NetModel["SqueezeNet-3D Trained on Video Datasets"], "LayerTypeCounts"]
Out[25]=

Display the summary graphic:

In[26]:=
Information[
 NetModel["SqueezeNet-3D Trained on Video Datasets"], "SummaryGraphic"]
Out[26]=

Export to ONNX

Export the net to the ONNX format:

In[27]:=
onnxFile = Export[FileNameJoin[{$TemporaryDirectory, "net.onnx"}], NetModel["SqueezeNet-3D Trained on Video Datasets"]]
Out[13]=

Get the size of the ONNX file:

In[28]:=
FileByteCount[onnxFile]
Out[28]=

Check some metadata of the ONNX model:

In[29]:=
{opsetVersion, irVersion} = {Import[onnxFile, "OperatorSetVersion"], Import[onnxFile, "IRVersion"]}
Out[29]=

Import the model back into the Wolfram Language. However, the NetEncoder and NetDecoder will be absent because they are not supported by ONNX:

In[30]:=
Import[onnxFile]
Out[30]=

Resource History

Reference