Implementation Guidelines for Compilers

This topic presents a list of important caveats developers should consider when implementing a compiler. It is important that for developers to consider these guidelines, to reduce the chances of low performance or incorrect behavior of compilers.

Set up and Run the Driver

The Driver controls the distributed processing on Spark. Defining the tasks that Spark executes is the main entry point for the developers to the processing library.

To set up a Driver, developers must implement one of the DriverSetup interface's children. This is where the code to instantiate the compilers, to eventually prepare the broadcast variables, and to wire everything together fits.

It is recommended to use a DriverBuilder for this purpose, to implement the DriverSetupWithBuilder interface. Alternatively, developers can configure the driver tasks manually by implementing DriverSetupManual.

To help run the pipeline, the library provides the PipelineRunner trait, which implements the Scala main method that parses the command line and supports seamless integration with the Pipeline API.

Scala developers create one Scala object that mixes in PipelineRunner and the appropriate child of DriverSetup. After implementing the abstract methods coming from the chosen interface, that object can be run directly from the command line either by the Pipeline API or manually.

Java developers use the PipelineRunner from the Java bindings. The current implementation of the does not directly expose the Driver. It is an abstract class with the DriverSetupWithBuilder interface already mixed-in that developers implement.


Spark relies on determinism of functions passed to the various RDD transformations, such as filter, map, groupBy, reduceByKey, and so on. These functions may be applied to the arguments multiple times, such as:

  • when a task fails and it is retried
  • when the same RDD partition is calculated more than once by the task due to lack of persistence or because a previously calculated RDD partition was removed from the cache

To operate properly, Spark requires these functions to behave deterministically, meaning that when functions are applied to the same input parameters, they always return the same result.

Similarly, the Data Processing Library and incremental compilation require data processing to be deterministic: a task should produce exactly the same commit when run multiple times on the same input catalogs at the same input versions. This means that partitions produced and their payloads must be identical.

Catalogs contain checksums of the payloads. So, to properly upload only payloads that have changed, the processing logic needs to be deterministic and produce the same output if the input did not change.

However, many Scala containers do not promise deterministic ordering for their elements. For example, although Seq does promise determinism, containers such as Iterable, Map, or Set, do not. The code processing these containers should not rely on the ordering of elements as it produces the same result no matter the order.

The solution to this challenge is implementation specific, but usually involves a type of stable sorting for container elements or applying a commutative transformation, such as sum.

RDDs Persistence Policy

This applies only to RDD-based Patterns.

Executors and some compilers work at the RDD level, meaning that RDDs are passed back and forth from the functions that each executor or compiler implements. It is important to define a common policy regarding persistence of the RDDs being passed and returned. Otherwise, there is a risk of Spark throwing an exception because some RDDs may be persisted twice with different storage levels.

This policy established is as follows:

  • RDDs that are passed to each execute function are guaranteed to be reusable multiple times efficiently, without any need for the implementations to persist them. Implementations shall not persist RDDs that were passed. These are either already persisted by the library or guaranteed to be reusable multiple times efficiently. Therefore, implementations shall not require or assert that RDDs passed are persisted, although it is guaranteed that they will be, or equivalent.
  • RDDs that are returned by each execute function do not have to be persisted. They may be persisted if it is useful to the implementations, but they do not have to be. The processing library may persist the RDDs once they are returned, if not already persisted.

results matching ""

    No results matching ""