Wolfram Research

X3D Video Action Classification Trained on Kinetics-400 Data

Identify the main action in a video

X3D is a family of efficient video networks with a focus on low-computation regime in terms of computation/accuracy tradeoff for video recognition. The main idea is to progressively expand a tiny base 2D image architecture into a spatiotemporal one by expanding multiple axes: temporal duration, frame rate, spatial resolution, network width, bottleneck width and depth. X3D achieves state-of-the-art performance while requiring 4.8x and 5.5x fewer multiply-adds and parameters for similar accuracy as previous work.

Training Set Information

Model Information

Examples

Resource retrieval

Get the pre-trained net:

In[1]:=
NetModel["X3D Video Action Classification Trained on Kinetics-400 Data"]
Out[1]=

NetModel parameters

This model consists of a family of individual nets, each identified by a specific architecture. Inspect the available parameters:

In[2]:=
NetModel["X3D Video Action Classification Trained on Kinetics-400 Data", "ParametersInformation"]
Out[2]=

Pick a non-default net by specifying the architecture:

In[3]:=
NetModel[{"X3D Video Action Classification Trained on Kinetics-400 Data", "Architecture" -> "S"}]
Out[3]=

Pick a non-default uninitialized net:

In[4]:=
NetModel[{"X3D Video Action Classification Trained on Kinetics-400 Data", "Architecture" -> "XS"}, "UninitializedEvaluationNet"]
Out[4]=

Basic usage

Classify a video:

In[5]:=
video = ResourceData["Sample Video: Practicing Yoga"]
Out[5]=
In[6]:=
pred = NetModel[
   "X3D Video Action Classification Trained on Kinetics-400 Data"][
  video]
Out[6]=

Obtain the probabilities predicted by the net:

In[7]:=
NetModel[
  "X3D Video Action Classification Trained on Kinetics-400 Data"][video, {"TopProbabilities", 5}]
Out[7]=

Feature extraction

Remove the last three layers of the trained net so that the net produces a vector representation of an image:

In[8]:=
extractor = NetDrop[NetModel[
   "X3D Video Action Classification Trained on Kinetics-400 Data"], -3]
Out[8]=

Get a set of videos:

In[9]:=
videos = Join[ResourceData["Cheerleading Video Samples"], ResourceData["Tooth Brushing Video Samples"]];

Visualize the features of a set of videos:

In[10]:=
FeatureSpacePlot[videos, FeatureExtractor -> (extractor[#] &), LabelingFunction -> (Placed[Thumbnail@VideoFrameList[#1, 1][[1]], Center] &), LabelingSize -> 70, ImageSize -> 400, Method -> "TSNE"]
Out[10]=

Transfer learning

Use the pre-trained model to build a classifier for telling apart images from two action classes not present in the dataset. Create a test set and a training set:

In[11]:=
videos = <|
   VideoTrim[ResourceData["Sample Video: Wild Ducks in the Park"], 10] -> "Wild Ducks in the Park", VideoTrim[ResourceData["Sample Video: Freezing Bubble"], 10] -> "Freezing Bubble"|>;
In[12]:=
frameRate = 32;
frameNumber = 2;
maxFrameNumber = frameRate*frameNumber;
In[13]:=
dataset = Join @@ KeyValueMap[
    Table[VideoTrim[#1, {Quantity[i, "Frames"], Quantity[i + maxFrameNumber - 1, "Frames"]}] -> #2, {i, 1, Information[#1, "FrameCount"][[1]] - Mod[Information[#1, "FrameCount"][[1]], maxFrameNumber], Round[maxFrameNumber/4]}] &, videos];
In[14]:=
{train, test} = ResourceFunction["TrainTestSplit"][RandomSample[dataset], "TrainingSetSize" -> 0.7];

Remove the LinearLayer, the SoftmaxLayer and the AggregationLayer from the pre-trained net:

In[15]:=
tempNet = NetDrop[NetModel[
   "X3D Video Action Classification Trained on Kinetics-400 Data"], -3]
Out[15]=

Create a new net composed of the pre-trained net followed by a LinearLayer, a SoftmaxLayer and an AggregationLayer:

In[16]:=
fc = NetMapThreadOperator[
   NetMapThreadOperator[LinearLayer[{2}, "Input" -> {2048}], 2], 1, "Input" -> {1, 2, 2, 2048}];
In[17]:=
newNet = NetAppend[
  tempNet, {"Linear" -> fc, "SoftMax" -> SoftmaxLayer[], "GlobalAveragePool" -> AggregationLayer[Mean, ;; -2]}, "Output" -> NetDecoder[{"Class", {"Freezing Bubble", "Wild Ducks in the Park"}}]]
Out[17]=

Train on the dataset, freezing all the weights except for those in the "Linear" layer (use TargetDevice -> "GPU" for training on a GPU):

In[18]:=
trainedNet = NetTrain[newNet, train, LearningRateMultipliers -> {"Linear" -> 1, _ -> 0}, ValidationSet -> Scaled[0.1], MaxTrainingRounds -> 2]
Out[18]=

Perfect accuracy is obtained on the test set:

In[19]:=
ClassifierMeasurements[trainedNet, test, "Accuracy"]
Out[19]=

Net information

Inspect the number of parameters of all arrays in the net:

In[20]:=
Information[
 NetModel[
  "X3D Video Action Classification Trained on Kinetics-400 Data"], "ArraysElementCounts"]
Out[20]=

Obtain the total number of parameters:

In[21]:=
Information[
 NetModel[
  "X3D Video Action Classification Trained on Kinetics-400 Data"], "ArraysTotalElementCount"]
Out[21]=

Obtain the layer type counts:

In[22]:=
Information[
 NetModel[
  "X3D Video Action Classification Trained on Kinetics-400 Data"], "LayerTypeCounts"]
Out[22]=

Display the summary graphic:

In[23]:=
Information[
 NetModel[
  "X3D Video Action Classification Trained on Kinetics-400 Data"], "SummaryGraphic"]
Out[23]=

Resource History

Reference