SlowFast Video Action Classification Trained on Kinetics-400 Data

Identify the main action in a video

Inspired by human biology, this family of video recognition models was released in 2021 and features a slow pathway, operating at low frame rate, to capture spatial semantics and a fast pathway, operating at high frame rate, to capture motion at fine temporal resolution.

Training Set Information

Model Information

Examples

Resource retrieval

Get the pre-trained net:

In[1]:=
NetModel["SlowFast Video Action Classification Trained on Kinetics-400 Data"]
Out[1]=

NetModel parameters

This model consists of a family of individual nets, each identified by a specific parameter combination. Inspect the available parameters:

In[2]:=
NetModel["SlowFast Video Action Classification Trained on Kinetics-400 Data", "ParametersInformation"]
Out[2]=

Pick a non-default net by specifying the parameters:

In[3]:=
NetModel[{"SlowFast Video Action Classification Trained on Kinetics-400 Data", "Architecture" -> "SlowFast-101", "FrameLength" -> 16}]
Out[3]=

Pick a non-default uninitialized net:

In[4]:=
NetModel[{"SlowFast Video Action Classification Trained on Kinetics-400 Data", "Architecture" -> "Slow-50"}, "UninitializedEvaluationNet"]
Out[4]=

Basic usage

Classify a video:

In[5]:=
video = ResourceData["Sample Video: Practicing Yoga"]
Out[5]=
In[6]:=
pred = NetModel[
   "SlowFast Video Action Classification Trained on Kinetics-400 Data"][video]
Out[6]=

Obtain the probabilities predicted by the net:

In[7]:=
NetModel[
  "SlowFast Video Action Classification Trained on Kinetics-400 Data"][video, {"TopProbabilities", 5}]
Out[7]=

Feature extraction

Remove the last three layers of the trained net so that the net produces a vector representation of an image:

In[8]:=
extractor = NetTake[NetModel[
   "SlowFast Video Action Classification Trained on Kinetics-400 Data"], "Transpose_1"]
Out[8]=

Get a set of videos:

In[9]:=
videos = Join[ResourceData["Cheerleading Video Samples"], ResourceData["Tooth Brushing Video Samples"]];

Visualize the features of a set of videos:

In[10]:=
FeatureSpacePlot[videos, FeatureExtractor -> (extractor[#] &), LabelingFunction -> (Placed[Thumbnail@VideoFrameList[#1, 1][[1]], Center] &), LabelingSize -> 70, ImageSize -> 500, Method -> "TSNE"]
Out[10]=

Transfer learning

Use the pre-trained model to build a classifier for telling apart videos from two action classes not present in the dataset. Create a test set and a training set:

In[11]:=
videos = <|
   VideoTrim[ResourceData["Sample Video: Wild Ducks in the Park"], 10] -> "Wild Ducks in the Park", VideoTrim[ResourceData["Sample Video: Freezing Bubble"], 10] -> "Freezing Bubble"|>;
In[12]:=
frameRate = 32;
sampleRate = 2;
maxFrameNumber = frameRate*sampleRate;
In[13]:=
dataset = Join @@ KeyValueMap[
    Function@With[{frameCount = Information[#1, "FrameCount"][[1]]},
      Table[
       VideoTrim[#1, {Quantity[i, "Frames"], Quantity[i + maxFrameNumber - 1, "Frames"]}] -> #2,
       {i, 1, frameCount - Mod[frameCount, maxFrameNumber], Round[maxFrameNumber/4]}
       ]
      ],
    videos
    ];
In[14]:=
{train, test} = TakeDrop[RandomSample[dataset], Round[Length[dataset]*0.7]];

Remove the last three layers from the pre-trained net:

In[15]:=
tempNet = NetTake[NetModel[
   "SlowFast Video Action Classification Trained on Kinetics-400 Data"], "Transpose_1"]
Out[15]=

Create a new net composed of the pre-trained net followed by a linear layer, an aggregation layer and a softmax layer:

In[16]:=
fc = NetMapThreadOperator[LinearLayer[{2}, "Input" -> {2304}], 3, "Input" -> {1, 2, 2, 2304}];
In[17]:=
newNet = NetJoin[tempNet, NetChain[{"Linear" -> fc, "GlobalAveragePool" -> AggregationLayer[Mean, ;; -2], "SoftMax" -> SoftmaxLayer[]}], "Output" -> NetDecoder[{"Class", {"Freezing Bubble", "Wild Ducks in the Park"}}]];

Train on the dataset, freezing all the weights except for those in the "Linear" layer (use TargetDevice -> "GPU" for training on a GPU):

In[18]:=
trainedNet = NetTrain[newNet, train, LearningRateMultipliers -> {"Linear" -> 1, _ -> 0}, ValidationSet -> Scaled[0.1], MaxTrainingRounds -> 2];

Perfect accuracy is obtained on the test set:

In[19]:=
ClassifierMeasurements[trainedNet, test, "Accuracy"]
Out[19]=

Net information

Inspect the number of parameters of all arrays in the net:

In[20]:=
Information[
 NetModel[
  "SlowFast Video Action Classification Trained on Kinetics-400 Data"], "ArraysElementCounts"]
Out[20]=

Obtain the total number of parameters:

In[21]:=
Information[
 NetModel[
  "SlowFast Video Action Classification Trained on Kinetics-400 Data"], "ArraysTotalElementCount"]
Out[21]=

Obtain the layer type counts:

In[22]:=
Information[
 NetModel[
  "SlowFast Video Action Classification Trained on Kinetics-400 Data"], "LayerTypeCounts"]
Out[22]=

Display the summary graphic:

In[23]:=
Information[
 NetModel[
  "SlowFast Video Action Classification Trained on Kinetics-400 Data"], "SummaryGraphic"]
Out[23]=

Requirements

Wolfram Language 13.1 (June 2022) or above

Resource History

Reference