Channel-Separated Video Action Classification Net Trained on Kinetics-400 Data

Identify the main action in a video

Inspired by 2D separable convolutions in image classification, the authors propose 3D Channel-Separated Networks (CSNs), in which all convolutional operations are separated into either pointwise 1×1×1 or depthwise 3×3×3 convolutions, resulting in a significant accuracy improvement on Sports-1M, Kinetics and Something-Something datasets while being two to three times faster.

Training Set Information

Kinetics-400 human action video data, consisting of four hundred human action classes, with at least four hundred video clips for each action. Each clip lasts around 10 seconds and is taken from a different YouTube video.

Model Information

Examples

Download Example Notebook

Open in Wolfram Cloud

Resource retrieval

Get the pre-trained net:

In[1]:=

Out[1]=

Basic usage

Classify a video:

In[2]:=

Out[2]=

In[3]:=

pred = NetModel[
"Channel-Separated Video Action Classification Net Trained on Kinetics-400 Data"][video]

Out[3]=

Obtain the probabilities predicted by the net:

In[4]:=

Out[4]=

Feature extraction

Remove the last three layers of the trained net so that the net produces a vector representation of an image:

In[5]:=

extractor = NetDrop[NetModel[
"Channel-Separated Video Action Classification Net Trained on Kinetics-400 Data"], -2]

Out[5]=

Get a set of videos:

In[6]:=

Visualize the features of a set of videos:

In[7]:=

FeatureSpacePlot[videos, FeatureExtractor -> (extractor[#] &), LabelingFunction -> (Placed[Thumbnail@VideoFrameList[#1, 1][[1]], Center] &), LabelingSize -> 50, ImageSize -> 500, Method -> "TSNE"]

Out[7]=

Transfer learning

Use the pre-trained model to build a classifier for telling apart images from two action classes not present in the dataset. Create a test set and a training set:

In[8]:=

videos = <|
VideoTrim[ResourceData["Sample Video: Wild Ducks in the Park"], 10] -> "Wild Ducks in the Park", VideoTrim[ResourceData["Sample Video: Freezing Bubble"], 10] -> "Freezing Bubble"|>;

In[9]:=

frameRate = 32;
sampleRate = 2;
maxFrameNumber = frameRate*sampleRate;

In[10]:=

dataset = Join @@ KeyValueMap[
Table[VideoTrim[#1, {Quantity[i, "Frames"], Quantity[i + maxFrameNumber - 1, "Frames"]}] -> #2, {i, 1, Information[#1, "FrameCount"][[1]] - Mod[Information[#1, "FrameCount"][[1]], maxFrameNumber], Round[maxFrameNumber/4]}] &, videos];

In[11]:=

Remove the linear and the softmax layers from the pre-trained net:

In[12]:=

tempNet = NetDrop[NetModel[
"Channel-Separated Video Action Classification Net Trained on Kinetics-400 Data"], -2]

Out[12]=

Create a new net composed of the pre-trained net followed by a LinearLayer and a SoftmaxLayer:

In[13]:=

newNet = NetAppend[
tempNet, {"Linear" -> LinearLayer[2, "Input" -> {2048}], "SoftMax" -> SoftmaxLayer[]}, "Output" -> NetDecoder[{"Class", {"Wild Ducks in the Park", "Freezing Bubble"}}]]

Out[14]=

Train on the dataset, freezing all the weights except for those in the "Linear" layer (use TargetDevice-> "GPU" for training on a GPU):

In[15]:=

$trainedNet = NetTrain[newNet, train, LearningRateMultipliers -> {"Linear" -> 1, _ -> 0}, ValidationSet -> Scaled[0.1], MaxTrainingRounds -> 2]$

Out[15]=

Perfect accuracy is obtained on the test set:

In[16]:=

Out[16]=

Net information

Inspect the number of parameters of all arrays in the net:

In[17]:=

Information[
NetModel[
"Channel-Separated Video Action Classification Net Trained on Kinetics-400 Data"], "ArraysElementCounts"]

Out[17]=

Obtain the total number of parameters:

In[18]:=

Information[
NetModel[
"Channel-Separated Video Action Classification Net Trained on Kinetics-400 Data"], "ArraysTotalElementCount"]

Out[18]=

Obtain the layer type counts:

In[19]:=

Information[
NetModel[
"Channel-Separated Video Action Classification Net Trained on Kinetics-400 Data"], "LayerTypeCounts"]

Out[19]=

Display the summary graphic:

In[20]:=

Information[
NetModel[
"Channel-Separated Video Action Classification Net Trained on Kinetics-400 Data"], "SummaryGraphic"]

Out[4]=

Resource History

Date Created: 15 December 2022

Reference

D. Tran, H. Wang, L. Torresani, M. Feiszli, "Video Classification with Channel-Separated Convolutional Networks," arXiv:1904.02811v4 (2019)
Available from: https://github.com/facebookresearch/pytorchvideo
Rights: Apache License 2.0