You can start by deciding how a recording will be transformed into something that a neural network can use. The
"AudioMFCC"
net encoder is used, where the signal is split into overlapping partitions and some processing is applied to each to reduce the dimension while preserving information that is important for understanding the signal: