Details and Options
A data pipeline is a sequence of operations that transforms data. The result of each operation is tested for validity and if any intermediate result is invalid,
ResourceFunction["DataPipeline"] bails out and throws an error. In its simplest form,
ResourceFunction["DataPipeline"] is a sequence of operations like
NetChain, but it can also be used to implement a computational network of operations like in
NetGraph. Unlike
NetChain and
NetGraph,
ResourceFunction["DataPipeline"] works with general WL expressions.
All keys in a computational network must be strings.
ResourceFunction["DataPipeline"] takes the following options:
"FailureDetection" | Automatic | how to determine if a result represents a failure |
"CatchMessages" | True | whether to abort if a message is generated |
Any expression for which
FailureQ is
True will always be considered a failure for the whole pipeline and will returned to the top level, regardless of the setting of the
"FailureDetection" option.
The default value
"FailureDetection"→Automatic also aborts the pipeline if an operation returns
Missing[…],
Indeterminate,
Undefined,
$Canceled,
$Aborted or a type of infinity. Setting
"FailureDetection"→None will not consider these values as failure modes. You can also specify your own test function
"FailureDetection"→fun, which uses
fun to determine if a result represents a failure or not. If the specified function
fun returns
True, for any input, intermediate result or final result, the
ResourceFunction["DataPipeline"] throws an error.
With the default setting
"CatchMessages"→True, the pipeline will be aborted if a message is thrown. Messages are caught by using
ConfirmQuiet.
The outputs of a computational network are those vertices that do not have any edged directed towards any other vertices (i.e., the vertices with
VertexOutDegree equal to 0). If a computational networks has a single output vertex, it will return the value computed at that vertex. If it has multiple output vertices, it will return an
Association with key-value pairs of the computed values at those vertices.
If a computational network has a vertex called "Input", the input to the network will always be supplied to that vertex.
If no "Input" vertex is specified, the input data must be an
Association. The keys in that
Association will specify the input vertices in the network. Keys that are absent from the pipeline, or that have in-degree different from zero will be ignored.
If multiple edges are directed at a single vertex in a computational network, the input to that vertex will be given as a list of the values computed at the vertices pointing into the target vertex. This is similar to the way
NetGraph works.
Multiple values can be directed as a
List to a vertex in the network either by specifying multiple rules
{…,key1→target,key2→target, …} or by specifying a single rule of the form
{…,{key1,key2,…}→target,…}.
If a key
target does not have a specified operator associated to it, the operator will be taken to be
Identity.
Chained rules like
key1→key2→key3→… are allowed for the edge specification of a network. Rules of the form
key1→{target1,target2,…} will be flattened out with
Thread.
It is also possible to specify an input for a vertex as
KeyTake[{key1,key2,…}] → target. In this case, data supplied to the
target operator will be in the form of an
Association with the keys
keyi. This is useful when nesting computational networks inside each other.
In computational networks, the computations are evaluated in the order the edges are listed in.
You can specify default inputs for
ResourceFunction["DataPipeline"] using generator functions. This is useful, for example, if you want the pipeline to automatically pull in data from an external source like a
Databin whenever it is called. In linear pipelines, if the pipeline is called with no arguments, the first function will be evaluated with no arguments to generate the starting value of the pipeline. For computational networks, it is possible to specify multiple default generator functions in the first argument. If the input to the
ResourceFunction["DataPipeline"] has keys corresponding to these generator functions, the specified input values will be used. Otherwise, the generator functions will be evaluated to generate the inputs on-the-fly. See the section Properties and Relations for examples.