Batch Processing Best Practices
For Spark Developers
This section examines the best practices for developers creating batch processing pipelines for the HERE platform using the Apache Spark framework.
The architecture of a batch pipeline dictates the Pipeline Service's use of the Apache Spark framework to run batch pipeline jobs. When a pipeline is created, its Maven archetype is specifically selected for the Spark or Flink frameworks. Consequently, the same pipeline cannot be run as both a batch or streamed pipeline.
The workflow of algorithmic transformations and manipulations occurs as a sequence of steps that Spark considers driver tasks. This is what makes each pipeline unique. You can see a summary of the many different possible pipeline design patterns in Pipeline Patterns. The Data Processing Library is used by all HERE platform batch pipelines. You can discover more about these many possible permutations in the Data Processing Library Guide, in the article Architecture for Batch Pipelines.
Pipelines and the Data Processing Library
Another useful article in the Data Processing Library Guide examines the relationship between Pipelines and the Data Processing Library.
You can set runtime resource level for a batch pipeline in its Pipeline Template. The same runtime parameters are used as with a stream pipeline, but their meanings are somewhat different.
supervisor_units - This refers to resource allocations for a Spark Master. This value equates to the number of CPU cores allocated by default to the cluster. The range is 1-15.
worker_units - This refers to the number of CPU core resources available to each driver task. The range is 1-15.
workers - Because batch pipeline processing runs as independent sets of processes on a cluster, resources are allocated to execute tasks in executors using available resources through the cluster manager. Executors are allocated one per Worker Node. Thus, the number of workers specified represents the number of worker nodes available, or the number of executors available. The range is 1-15.
You can create templates through the platform portal GUI, the CLI, or at the API level. A good place to start is in the OLP CLI guide under pipeline templates. Use the CLI command
pipeline template create to list these runtime parameters under optional. In the API reference, you will find the same information contained in the default cluster configuration of the PipelineTemplate parameter. For more information, see the API Reference entry for the
Scaling issues are handled by Spark. While you can manipulate pipeline cluster configurations to a limited degree, Spark addresses scaling issues.
For more information, see the Spark documentation on Tuning and Hardware Provisioning.
Spark runtime libraries are automatically included in the HERE platform pipeline service. A maven archetype is available to provide a preconfigured project structure for creating a new stream pipeline. For further information, see the Archetypes documentation.