Details and Options
The recognized formats for
data are listed in the left-hand column below. The right-hand column shows examples of symbols that take the same data format. Note that
Dataset is currently not supported.
List: {…} | EstimatedDistribution,LinearModelFit,GeneralizedLinearModelFit,NonlinearModelFit,Predict,Classify,NetTrain,SpatialEstimate |
Rule of lists: {…}→ {…} | Predict,Classify,NetTrain,SpatialEstimate |
Association of lists: <|key1→ {…},…|> | NetTrain |
ResourceFunction["CrossValidateModel"] divides the data repeatedly into randomly chosen training and validation sets. Each time, fitfun is applied to the training set. The returned model is then validated against the validation set. The returned result is of the form {<|"FittedModel"→ model1, "ValidationResult" → result1|>, …}.
For any distribution that satisfies
DistributionParameterQ, the syntax for fitting distributions is the same as for
EstimatedDistribution, which is the function that is being used internally to fit the data. The returned values of
EstimatedDistribution are found under the
"FittedModel" keys of the output. The
"ValidationResult" lists the average negative
LogLikelihood on the validation set, which is the default loss function for distribution fits.
If the user does not specify a validation function explicitly using the "ValidationFunction" option, ResourceFunction["CrossValidateModel"] tries to automatically apply a sensible validation loss metric for results returned by fitfun (see the next table). If no usable validation method is found, the validation set itself will be returned so that the user can perform their own validation afterward.
The following table lists the types of models that are recognized, the functions that produce such models and the default loss function applied to that type of model:
An explicit validation function can be provided with the
"ValidationFunction" option. This function takes the fit result as a first argument and a validation set as a second argument. If multiple models are specified as an
Association in the second argument of
ResourceFunction["CrossValidateModel"], different validation functions for each model can be specified by passing an
Association to the
"ValidationFunction" option.
The
Method option can be used to configure how the training and validation sets are generated. The following types of sampling are supported:
"KFold" (default) | splits the dataset into k subsets (default: k=5) and trains the model k times, using each partition as validation set once |
"LeaveOneOut" | fit the data as many times as there are elements in the dataset, using each element for validation once |
"RandomSubSampling" | split the dataset randomly into training and validation sets (default: 80% / 20%) repeatedly (default: five times) or define a custom sampling function |
"BootStrap" | use bootstrap samples (generated with RandomChoice) to fit the model repeatedly without validation |
The default
Method setting uses
k-fold validation with five folds. This means that the dataset is randomly split into five partitions, where each is used as the validation set once. This means that the data gets trained five times on 4/5 of the dataset and then tested on the remaining 1/5. The
"KFold" method has two sub-options:
"Folds" | 5 | number of partitions in which to split the dataset |
"Runs" | 1 | number of times to perform k-fold validation (each time with a new random partitioning of the data) |
The
"LeaveOneOut" method, also known as the jackknife method, is essentially
k-fold validation where the number of folds is equal to the number of data points. Since it can be quite computationally expensive, it is usually a good idea to use parallelization with this method. It does have the
"Runs" sub-option like the
"KFold" method, but for deterministic fitting procedures like
EstimatedDistribution and
LinearModelFit, there is no value in performing more than one run since each run will yield the exact same results (up to a random permutation).
The method "RandomSubSampling" splits the dataset into training/validation sets randomly and has the following sub-options:
"Runs" | 1 | number of times to resample the data into training/validation sets |
ValidationSet | Scaled[1/5] | number of samples to use for validation. When specified as Scaled[f], a fraction f of the dataset will be used for validation |
"SamplingFunction" | Automatic | function that specifies how to sub-sample the data |
For the option
"SamplingFunction", the function
fun[nData,nVal] should return an
Association with the keys "TrainingSet" and "ValidationSet". Each key should contain a list of integers that indicate the indices in the dataset.
Bootstrap sampling is useful to get a sense of the range of possible models that can be fitted to the data. In a bootstrap sample, the original dataset is sampled with replacement (using
RandomChoice), so the bootstrap samples can be larger than the original dataset. No validation sets will be generated when using bootstrap sampling. The following sub-options are supported:
"Runs" | 5 | number of bootstrap samples generated |
"BootStrapSize" | Scaled[1] | number of elements to generate in each bootstrap sample; when specified as Scaled[f], a fraction f of the dataset will be used |
The "ValidationFunction" option can be used to specify a custom function that gets applied to the fit result and the validation data.
The
"ParallelQ" option can be used to parallelize the computation using
ParallelTable. Sub-options for
ParallelTable can be specified as
"ParallelQ"→{True,opts…}.