Despite being more cumbersome to implement, RDD-based patterns provide more freedom, such as the freedom to implement the actual business logic that leverages Spark operations: joining, filtering, mapping, and so on. These operations not only reduce the metadata, but the actual data as well.
Functional patterns use Spark to parallelize the function applications, but these functions cannot see the RDDs. In contrast, RDD-based patterns have RDDs as parameters and return values for the interfaces that should be implemented.
However, there is one important drawback: compiler implementations must actively support incremental compilation. This means that this key Data Processing Library feature is not fully transparent: compilers shall cooperate and return additional RDDs that contain the information requested by each pattern for the compiler to complete the job and support incremental processing properly. Failing to do so, or returning an RDD that does not follow what the pattern specification expects, results in invalid maps being produced when the compiler is run incrementally.
The expected content of the RDD that your functions should return is described in the sections detail below.
This pattern provides maximum freedom to developers: Spark is directly accessible from the input data RDD. Developers can implement any sort of compilation to return an RDD of the payloads that have to be published. However, this pattern does not support incremental compilation.
Graphical representation of compilation with NonIncrementalCompiler
This is the most general compilation pattern, and has no specific applicability restrictions.
Typical Use Cases
All cases where the developer wants to freely implement a compiler in terms of Spark transformations.
High-level Interface
There is no front-end and back-end distinction. The only function to be implemented is:
The compiler is stateless. State is not present as no incremental compilation is supported in this pattern, by design.
References
TaskBuilder: withNonIncrementalCompiler method
NonIncrementalCompiler: main interface to implement
DepCompiler and IncrementalDepCompiler
These compilation patterns implement a map-reduce compilation, where the reduce function does not actually reduce anything, but simply groups the elements with the same key. This pattern is equivalent to the MapGroup functional compiler.
These two patterns are equivalent. IncrementalDepCompiler is a specialization of DepCompiler and adds additional logic for a more efficient incremental compilation.
This compilation pattern can be used for all cases of MapGroupCompiler.
DepCompiler Logic
Graphical representation of compilation with DepCompiler
Additional Logic Introduced with IncrementalDepCompiler
Adding logic introduced with IncrementalDepCompiler
Typical Use Cases
Use this pattern in cases where the transformation takes into account the input content, such as:
Decoding input partitions and distributing content among different output partitions or as a function of the content.
Creating output layers with a subset of input data.
Distributing objects from input layers to output layers, shifting levels up or down, or as a function of object properties.
Indexing of content, for example by country or by tile.
High-level Interface
T and C are developer-defined types. The dependency graph or its subsets have the form of RDD[(InKey, OutKey)].
CompileIn (non-incremental compiler front-end):
compileIn(inData: RDD[(InKey, InMeta)]) ⇒ dependency graph and RDD[(OutKey, T)]
CompileIn (additions to run the front-end incrementally):
updateDepGraph(inData: RDD[(InKey, InMeta)], inChanges: RDD[(InKey, InChange)], previous dependency graph) ⇒ updated dependency graph and C for next call
compileIn(inData: RDD[(InKey, InMeta)], subset of the dependency graph, C from previous call*) ⇒ RDD[(OutKey, T)]
In DepCompiler, compileIn always runs on the whole set of input catalogs, also in incremental mode. compileOut is run incrementally, on the subset of output partitions that the library detects as a candidate for recompilation.
In IncrementalDepCompiler, compileIn (1st version) runs on the whole set of input catalogs for the non-incremental case. In the incremental case, updateDepGraph and compileIn (2nd version) run instead. Access is provided to both the overall data and the changes. compileOut, as in the previous case, runs only on a subset of output partitions.
References
TaskBuilder: withDepCompiler and withIncrementalDepCompiler methods
DepCompiler: main interface to implement (non-incremental front-end, incremental back-end)
IncrementalDepCompiler: main interface to implement (incremental front-end and back-end