Wolfram Research

Channel-Separated Video Action Classification Net Trained on Kinetics-400 Data

Identify the main action in a video

Inspired by 2D separable convolutions in image classification, the authors propose 3D Channel-Separated Networks (CSNs), in which all convolutional operations are separated into either pointwise 1×1×1 or depthwise 3×3×3 convolutions, resulting in a significant accuracy improvement on Sports-1M, Kinetics and Something-Something datasets while being two to three times faster.

Training Set Information

Model Information

Examples

Resource retrieval

Get the pre-trained net:

In[1]:=
NetModel["Channel-Separated Video Action Classification Net Trained on Kinetics-400 Data"]
Out[1]=

Basic usage

Classify a video:

In[2]:=
video = ResourceData["Sample Video: Reading a Book"]
Out[2]=
In[3]:=
pred = NetModel[
   "Channel-Separated Video Action Classification Net Trained on Kinetics-400 Data"][video]
Out[3]=

Obtain the probabilities predicted by the net:

In[4]:=
NetModel[
  "Channel-Separated Video Action Classification Net Trained on Kinetics-400 Data"][video, {"TopProbabilities", 5}]
Out[4]=

Feature extraction

Remove the last three layers of the trained net so that the net produces a vector representation of an image:

In[5]:=
extractor = NetDrop[NetModel[
   "Channel-Separated Video Action Classification Net Trained on Kinetics-400 Data"], -2]
Out[5]=

Get a set of videos:

In[6]:=
videos = Join[ResourceData["Cheerleading Video Samples"], ResourceData["Tooth Brushing Video Samples"]];

Visualize the features of a set of videos:

In[7]:=
FeatureSpacePlot[videos, FeatureExtractor -> (extractor[#] &), LabelingFunction -> (Placed[Thumbnail@VideoFrameList[#1, 1][[1]], Center] &), LabelingSize -> 50, ImageSize -> 500, Method -> "TSNE"]
Out[7]=

Transfer learning

Use the pre-trained model to build a classifier for telling apart images from two action classes not present in the dataset. Create a test set and a training set:

In[8]:=
videos = <|
   VideoTrim[ResourceData["Sample Video: Wild Ducks in the Park"], 10] -> "Wild Ducks in the Park", VideoTrim[ResourceData["Sample Video: Freezing Bubble"], 10] -> "Freezing Bubble"|>;
In[9]:=
frameRate = 32;
sampleRate = 2;
maxFrameNumber = frameRate*sampleRate;
In[10]:=
dataset = Join @@ KeyValueMap[
    Table[VideoTrim[#1, {Quantity[i, "Frames"], Quantity[i + maxFrameNumber - 1, "Frames"]}] -> #2, {i, 1, Information[#1, "FrameCount"][[1]] - Mod[Information[#1, "FrameCount"][[1]], maxFrameNumber], Round[maxFrameNumber/4]}] &, videos];
In[11]:=
{train, test} = ResourceFunction["TrainTestSplit"][RandomSample[dataset], "TrainingSetSize" -> 0.7];

Remove the linear and the softmax layers from the pre-trained net:

In[12]:=
tempNet = NetDrop[NetModel[
   "Channel-Separated Video Action Classification Net Trained on Kinetics-400 Data"], -2]
Out[12]=

Create a new net composed of the pre-trained net followed by a LinearLayer and a SoftmaxLayer:

In[13]:=
newNet = NetAppend[
  tempNet, {"Linear" -> LinearLayer[2, "Input" -> {2048}], "SoftMax" -> SoftmaxLayer[]}, "Output" -> NetDecoder[{"Class", {"Wild Ducks in the Park", "Freezing Bubble"}}]]
Out[14]=

Train on the dataset, freezing all the weights except for those in the "Linear" layer (use TargetDevice-> "GPU" for training on a GPU):

In[15]:=
trainedNet = NetTrain[newNet, train, LearningRateMultipliers -> {"Linear" -> 1, _ -> 0}, ValidationSet -> Scaled[0.1], MaxTrainingRounds -> 2]
Out[15]=

Perfect accuracy is obtained on the test set:

In[16]:=
ClassifierMeasurements[trainedNet, test, "Accuracy"]
Out[16]=

Net information

Inspect the number of parameters of all arrays in the net:

In[17]:=
Information[
 NetModel[
  "Channel-Separated Video Action Classification Net Trained on Kinetics-400 Data"], "ArraysElementCounts"]
Out[17]=

Obtain the total number of parameters:

In[18]:=
Information[
 NetModel[
  "Channel-Separated Video Action Classification Net Trained on Kinetics-400 Data"], "ArraysTotalElementCount"]
Out[18]=

Obtain the layer type counts:

In[19]:=
Information[
 NetModel[
  "Channel-Separated Video Action Classification Net Trained on Kinetics-400 Data"], "LayerTypeCounts"]
Out[19]=

Display the summary graphic:

In[20]:=
Information[
 NetModel[
  "Channel-Separated Video Action Classification Net Trained on Kinetics-400 Data"], "SummaryGraphic"]
Out[4]=

Resource History

Reference