Partitioners and Spark Additions

The and packages contain implicit classes that provide additional features on top of Spark's standard SparkContext and RDD. These features are heavily used in the processing library and can help you to implement RDD-based compilers patterns.

The package includes partitioners to use with the processing library. Partitioners define how data is split into Spark partitions. Since the Spark Partitioner interface is not type safe, the processing library has a specialized Partitioner[K] type.

Some notable partitioners are:

  • HashPartitioner: partitions keys uniformly by a hashcode that is calculated using the catalog ID, layer ID, and partition name. You can use it as a default partitioner.
  • NameHashPartitioner: calculates partitions based on a hashcode of the partition name only. Partitions with the same name are processed by the same Spark worker node, regardless of their catalog ID and layer ID.
  • LocalityAwarePartitioner: assigns RDD entries identified with HERE tile IDs to Spark partitions to increase data locality. Data locality is achieved when tiles which are geographically close to each other are processed by the same Spark worker node. For example, suppose that worker nodes maintain a cache of additional geospatial content needed to process each input tile. The cache is present based on the assumption that such content is reused across the processed tiles. Under this assumption, the hit or miss ratio of the cache tends to be higher when input tiles close to each other are processed by the same Spark worker. This is because input tiles that are close tend to require the same additional content. In contrast, when this assumption does not hold, then using the LocalityAwarePartitioner has no advantage; it could even be counterproductive.
  • AdapterPartitioner: wraps any Partitioner provided by Spark to support the library's type-safe Partitioner[K] interface.

results matching ""

    No results matching ""