# Wolfram Neural Net Repository

Immediate Computable Access to Neural Net Models

Identify the main action in a video

Released in 2019, this family of nets are three-dimensional (3D) versions of the original SqueezeNet architecture. SqueezeNet reduces the number of parameters drastically by using 1⨯1 convolutional filters instead of 3⨯3 and a decreased number of input channels for the 3⨯3 filters. With the availability of large-scale video datasets such as Jester and Kinetics-600, the three-dimensional SqueezeNets achieve much better accuracies compared to their two-dimensional counterparts for video classification tasks.

Number of models: 2

- Kinetics-600 dataset, containing six hundred human action classes with at least six hundred video clips for each action. For each class, there are also 50 and one hundred validation and test videos respectively. Jester dataset, containing 148,092 gesture videos under 27 classes.

The models achieve the following accuracies on the original ImageNet validation set.

Get the pre-trained net:

In[1]:= |

Out[1]= |

This model consists of a family of individual nets, each identified by a specific parameter combination. Inspect the available parameters:

In[2]:= |

Out[2]= |

Pick a non-default net by specifying the parameters:

In[3]:= |

Out[3]= |

Pick a non-default uninitialized net:

In[4]:= |

Out[4]= |

Identify the main action in a video:

In[5]:= |

In[6]:= |

Out[6]= |

Obtain the probabilities of the 10 most likely entities predicted by the net:

In[7]:= |

Out[7]= |

Obtain the list of names of all available classes:

In[8]:= |

Out[8]= |

The SqueezeNet architecture uses the "fire module," which features a 1⨯1 "squeeze" convolution followed by 1⨯1 and 3⨯3 "expand" convolutions performed in parallel:

In[9]:= |

Out[9]= |

In[10]:= |

Out[10]= |

All modules follow this structure:

In[11]:= |

Out[11]= |

Alternate modules also feature a residual skip connection:

In[12]:= |

Out[12]= |

Remove the last two layers of the trained net so that the net produces a vector representation of an image:

In[13]:= |

Out[13]= |

Get a set of videos:

In[14]:= |

Visualize the features of a set of videos:

In[15]:= |

Out[15]= |

Use the pre-trained model to build a classifier for telling apart images from two action classes not present in the dataset. Create a test set and a training set:

In[16]:= |

In[17]:= |

In[18]:= |

Remove the last layers from the pre-trained net:

In[19]:= |

Out[19]= |

Create a new net composed of the pre-trained net followed by a linear layer and a softmax layer:

In[20]:= |

Out[20]= |

Train on the dataset, freezing all the weights except for those in the "Linear" new layer (use TargetDevice -> "GPU" for training on a GPU):

In[21]:= |

Out[21]= |

Perfect accuracy is obtained on the test set:

In[22]:= |

Out[22]= |

Inspect the number of parameters of all arrays in the net:

In[23]:= |

Out[23]= |

Obtain the total number of parameters:

In[24]:= |

Out[24]= |

Obtain the layer type counts:

In[25]:= |

Out[25]= |

Display the summary graphic:

In[26]:= |

Out[26]= |

Export the net to the ONNX format:

In[27]:= |

Out[13]= |

Get the size of the ONNX file:

In[28]:= |

Out[28]= |

Check some metadata of the ONNX model:

In[29]:= |

Out[29]= |

Import the model back into the Wolfram Language. However, the NetEncoder and NetDecoder will be absent because they are not supported by ONNX:

In[30]:= |

Out[30]= |

- O. Köpüklü, N. Kose, A. Gunduz, G. Rigoll, "Resource Efficient 3D Convolutional Neural Networks," arXiv:1904.02422 (2019)
- Available from:
- Rights: MIT License