SlowFast Video Action Classification Trained on Kinetics-400 Data

Identify the main action in a video

Inspired by human biology, this family of video recognition models was released in 2021 and features a slow pathway, operating at low frame rate, to capture spatial semantics and a fast pathway, operating at high frame rate, to capture motion at fine temporal resolution.

Training Set Information

Kinetics-400 human action video data, consisting of four hundred human action classes, with at least four hundred video clips for each action. Each clip lasts around 10 seconds and is taken from a different YouTube video.

Model Information

Examples

Download Example Notebook

Open in Wolfram Cloud

Resource retrieval

Get the pre-trained net:

In[1]:=

Out[1]=

NetModel parameters

This model consists of a family of individual nets, each identified by a specific parameter combination. Inspect the available parameters:

In[2]:=

Out[2]=

Pick a non-default net by specifying the parameters:

In[3]:=

Out[3]=

Pick a non-default uninitialized net:

In[4]:=

Out[4]=

Basic usage

Classify a video:

In[5]:=

Out[5]=

In[6]:=

Out[6]=

Obtain the probabilities predicted by the net:

In[7]:=

Out[7]=

Feature extraction

Remove the last three layers of the trained net so that the net produces a vector representation of an image:

In[8]:=

Out[8]=

Get a set of videos:

In[9]:=

Visualize the features of a set of videos:

In[10]:=

FeatureSpacePlot[videos, FeatureExtractor -> (extractor[#] &), LabelingFunction -> (Placed[Thumbnail@VideoFrameList[#1, 1][[1]], Center] &), LabelingSize -> 70, ImageSize -> 500, Method -> "TSNE"]

Out[10]=

Transfer learning

Use the pre-trained model to build a classifier for telling apart videos from two action classes not present in the dataset. Create a test set and a training set:

In[11]:=

videos = <|
VideoTrim[ResourceData["Sample Video: Wild Ducks in the Park"], 10] -> "Wild Ducks in the Park", VideoTrim[ResourceData["Sample Video: Freezing Bubble"], 10] -> "Freezing Bubble"|>;

In[12]:=

frameRate = 32;
sampleRate = 2;
maxFrameNumber = frameRate*sampleRate;

In[13]:=

dataset = Join @@ KeyValueMap[
Function@With[{frameCount = Information[#1, "FrameCount"][[1]]},
Table[
VideoTrim[#1, {Quantity[i, "Frames"], Quantity[i + maxFrameNumber - 1, "Frames"]}] -> #2,
{i, 1, frameCount - Mod[frameCount, maxFrameNumber], Round[maxFrameNumber/4]}
]
],
videos
];

In[14]:=

Remove the last three layers from the pre-trained net:

In[15]:=

Out[15]=

Create a new net composed of the pre-trained net followed by a linear layer, an aggregation layer and a softmax layer:

In[16]:=

In[17]:=

newNet = NetJoin[tempNet, NetChain[{"Linear" -> fc, "GlobalAveragePool" -> AggregationLayer[Mean, ;; -2], "SoftMax" -> SoftmaxLayer[]}], "Output" -> NetDecoder[{"Class", {"Freezing Bubble", "Wild Ducks in the Park"}}]];

Train on the dataset, freezing all the weights except for those in the "Linear" layer (use TargetDevice -> "GPU" for training on a GPU):

In[18]:=

$trainedNet = NetTrain[newNet, train, LearningRateMultipliers -> {"Linear" -> 1, _ -> 0}, ValidationSet -> Scaled[0.1], MaxTrainingRounds -> 2];$

Perfect accuracy is obtained on the test set:

In[19]:=

Out[19]=

Net information

Inspect the number of parameters of all arrays in the net:

In[20]:=

Out[20]=

Obtain the total number of parameters:

In[21]:=

Out[21]=

Obtain the layer type counts:

In[22]:=

Out[22]=

Display the summary graphic:

In[23]:=

Out[23]=

Construction Notebook

Download Construction Notebook

Open in Wolfram Cloud

Requirements

Wolfram Language 13.1 (June 2022) or above

Resource History

Date Created: 12 July 2023

Reference

C. Feichtenhofer, H. Fan, J. Malik, K. He, "SlowFast Networks for Video Recognition," arXiv:1812.03982v3 (2019)
Available from: https://github.com/facebookresearch/pytorchvideo
Rights: Apache License 2.0