Wolfram Research

3D-Inflated ResNet-50 Trained on Kinetics 400 Data

Identify the main action in a video

This model applies a 3D-inflation technique to bootstrap the kernels of a 3D convolutional network from a 2D ResNet-50 architecture, directly leveraging years of progress on the image domain architectures for video applications. The weights of the 3D convolutional filters were initialized by replicating the 2D filters of ResNet-50 along the time dimension, which can be seen as an implicit pre-training on a video dataset consisting of static ImageNet images replicated across time.

Training Set Information

Model Information

Examples

Resource retrieval

Get the pre-trained net:

In[1]:=
NetModel["3D-Inflated ResNet-50 Trained on Kinetics 400 Data"]
Out[1]=

Basic usage

Classify a video:

In[2]:=
video = ResourceData["Sample Video: Reading a Book"]
Out[2]=
In[3]:=
pred = NetModel["3D-Inflated ResNet-50 Trained on Kinetics 400 Data"][
  video]
Out[3]=

Obtain the probabilities predicted by the net:

In[4]:=
NetModel[
  "3D-Inflated ResNet-50 Trained on Kinetics 400 Data"][video, {"TopProbabilities", 5}]
Out[4]=

Feature extraction

Remove the last two layers of the trained net so that the net produces a vector representation of an image:

In[5]:=
extractor = NetDrop[NetModel[
   "3D-Inflated ResNet-50 Trained on Kinetics 400 Data"], -2]
Out[5]=

Get a set of videos:

In[6]:=
videos = Join[ResourceData["Cheerleading Video Samples"], ResourceData["Tooth Brushing Video Samples"]];

Visualize the features of a set of videos:

In[7]:=
FeatureSpacePlot[videos, FeatureExtractor -> (extractor[{#, #}] &), LabelingFunction -> (Placed[Thumbnail@VideoFrameList[#1, 1][[1]], Center] &), LabelingSize -> 50, ImageSize -> 500, Method -> "TSNE"]
Out[7]=

Transfer learning

Use the pre-trained model to build a classifier for telling apart images from two action classes not present in the dataset. Create a test set and a training set:

In[8]:=
videos = <|
   VideoTrim[ResourceData["Sample Video: Wild Ducks in the Park"], 10] -> "Wild Ducks in the Park", VideoTrim[ResourceData["Sample Video: Freezing Bubble"], 10] -> "Freezing Bubble"|>;
In[9]:=
frameRate = 8;
sampleRate = 8;
maxFrameNumber = frameRate*sampleRate;
In[10]:=
dataset = Join @@ KeyValueMap[
    Table[VideoTrim[#1, {Quantity[i, "Frames"], Quantity[i + maxFrameNumber - 1, "Frames"]}] -> #2, {i, 1, Information[#1, "FrameCount"][[1]] - Mod[Information[#1, "FrameCount"][[1]], maxFrameNumber], Round[maxFrameNumber/4]}] &, videos];
In[11]:=
{train, test} = ResourceFunction["TrainTestSplit"][RandomSample[dataset], "TrainingSetSize" -> 0.7];

Remove the last two layers from the pre-trained net:

In[12]:=
tempNet = NetDrop[NetModel[
   "3D-Inflated ResNet-50 Trained on Kinetics 400 Data"], -2]
Out[12]=

Create a new net composed of the pre-trained net followed by a linear layer, an aggregation layer and a softmax layer:

In[13]:=
newNet = NetJoin[tempNet, NetChain[{"Linear" -> LinearLayer[2, "Input" -> {2048}], "SoftMax" -> SoftmaxLayer[]}], "Output" -> NetDecoder[{"Class", {"Freezing Bubble", "Wild Ducks in the Park"}}]];

Train on the dataset, freezing all the weights except for those in the "Linear" layer (use TargetDevice -> "GPU" for training on a GPU):

In[14]:=
trainedNet = NetTrain[newNet, train, LearningRateMultipliers -> {"Linear" -> 1, _ -> 0}, ValidationSet -> Scaled[0.1], MaxTrainingRounds -> 2]
Out[14]=

Perfect accuracy is obtained on the test set:

In[15]:=
ClassifierMeasurements[trainedNet, test, "Accuracy"]
Out[15]=

Net information

Inspect the number of parameters of all arrays in the net:

In[16]:=
Information[
 NetModel[
  "3D-Inflated ResNet-50 Trained on Kinetics 400 Data"], "ArraysElementCounts"]
Out[16]=

Obtain the total number of parameters:

In[17]:=
Information[
 NetModel[
  "3D-Inflated ResNet-50 Trained on Kinetics 400 Data"], "ArraysTotalElementCount"]
Out[17]=

Obtain the layer type counts:

In[18]:=
Information[
 NetModel[
  "3D-Inflated ResNet-50 Trained on Kinetics 400 Data"], "LayerTypeCounts"]
Out[18]=

Display the summary graphic:

In[19]:=
Information[
 NetModel[
  "3D-Inflated ResNet-50 Trained on Kinetics 400 Data"], "SummaryGraphic"]
Out[19]=

Resource History

Reference