Use the pre-trained speech recognition network from the Wolfram Neural Net Repository to compute the probability that a recording contains a specific word. See the details for this network here. Compute the probabilities of any single letter at all times using the net:
You can partition the probabilities computed by the net to inspect subsets of the signal. The CTC loss can be computed with respect to all of the choices for each partition. This will produce the log-likelihood of a specific choice being the transcription of the specific partition.