X3D Video Action Classification Trained on Kinetics-400 Data

Identify the main action in a video

X3D is a family of efficient video networks with a focus on low-computation regime in terms of computation/accuracy tradeoff for video recognition. The main idea is to progressively expand a tiny base 2D image architecture into a spatiotemporal one by expanding multiple axes: temporal duration, frame rate, spatial resolution, network width, bottleneck width and depth. X3D achieves state-of-the-art performance while requiring 4.8x and 5.5x fewer multiply-adds and parameters for similar accuracy as previous work.

Training Set Information

Kinetics-400 human action video data, consisting of four hundred human action classes, with at least four hundred video clips for each action. Each clip lasts around 10 seconds and is taken from a different YouTube video.

Model Information

Examples

Download Example Notebook

Open in Wolfram Cloud

Resource retrieval

Get the pre-trained net:

In[1]:=

Out[1]=

NetModel parameters

This model consists of a family of individual nets, each identified by a specific architecture. Inspect the available parameters:

In[2]:=

Out[2]=

Pick a non-default net by specifying the architecture:

In[3]:=

Out[3]=

Pick a non-default uninitialized net:

In[4]:=

Out[4]=

Basic usage

Classify a video:

In[5]:=

Out[5]=

In[6]:=

Out[6]=

Obtain the probabilities predicted by the net:

In[7]:=

Out[7]=

Feature extraction

Remove the last three layers of the trained net so that the net produces a vector representation of an image:

In[8]:=

Out[8]=

Get a set of videos:

In[9]:=

Visualize the features of a set of videos:

In[10]:=

FeatureSpacePlot[videos, FeatureExtractor -> (extractor[#] &), LabelingFunction -> (Placed[Thumbnail@VideoFrameList[#1, 1][[1]], Center] &), LabelingSize -> 70, ImageSize -> 400, Method -> "TSNE"]

Out[10]=

Transfer learning

Use the pre-trained model to build a classifier for telling apart images from two action classes not present in the dataset. Create a test set and a training set:

In[11]:=

videos = <|
VideoTrim[ResourceData["Sample Video: Wild Ducks in the Park"], 10] -> "Wild Ducks in the Park", VideoTrim[ResourceData["Sample Video: Freezing Bubble"], 10] -> "Freezing Bubble"|>;

In[12]:=

frameRate = 32;
frameNumber = 2;
maxFrameNumber = frameRate*frameNumber;

In[13]:=

dataset = Join @@ KeyValueMap[
Table[VideoTrim[#1, {Quantity[i, "Frames"], Quantity[i + maxFrameNumber - 1, "Frames"]}] -> #2, {i, 1, Information[#1, "FrameCount"][[1]] - Mod[Information[#1, "FrameCount"][[1]], maxFrameNumber], Round[maxFrameNumber/4]}] &, videos];

In[14]:=

Remove the LinearLayer, the SoftmaxLayer and the AggregationLayer from the pre-trained net:

In[15]:=

Out[15]=

Create a new net composed of the pre-trained net followed by a LinearLayer, a SoftmaxLayer and an AggregationLayer:

In[16]:=

In[17]:=

newNet = NetAppend[
tempNet, {"Linear" -> fc, "SoftMax" -> SoftmaxLayer[], "GlobalAveragePool" -> AggregationLayer[Mean, ;; -2]}, "Output" -> NetDecoder[{"Class", {"Freezing Bubble", "Wild Ducks in the Park"}}]]

Out[17]=

Train on the dataset, freezing all the weights except for those in the "Linear" layer (use TargetDevice -> "GPU" for training on a GPU):

In[18]:=

$trainedNet = NetTrain[newNet, train, LearningRateMultipliers -> {"Linear" -> 1, _ -> 0}, ValidationSet -> Scaled[0.1], MaxTrainingRounds -> 2]$

Out[18]=

Perfect accuracy is obtained on the test set:

In[19]:=

Out[19]=

Net information

Inspect the number of parameters of all arrays in the net:

In[20]:=

Out[20]=

Obtain the total number of parameters:

In[21]:=

Out[21]=

Obtain the layer type counts:

In[22]:=

Out[22]=

Display the summary graphic:

In[23]:=

Out[23]=

Construction Notebook

Download Construction Notebook

Open in Wolfram Cloud

Resource History

Date Created: 7 November 2022

Reference

C. Feichtenhofer, "X3D: Expanding Architectures for Efficient Video Recognition," arXiv:2004.04730v1 (2020)
Available from: https://github.com/facebookresearch/pytorchvideo
Rights: Apache License 2.0