Pipeline Components

Many elements go into making a pipeline. The following list defines the most important components.

  • Pipeline – This is the top-level entity in the HERE platform that groups the work of a user. Users develop their own special purpose Pipelines. For each Pipeline, multiple versions of it can be stored and managed by the HERE platform system (see Pipeline Version below). Developing a pipeline is an iterative process where it is typical to upload new versions of the pipeline code for testing each pipeline with different input and output catalogs or settings. After testing, the pipeline can be put into production with live data.
  • Package – This is an immutable entity that represents a JAR file that has been uploaded onto the HERE platform pipeline. It contains compiled pipeline code and libraries. Because libraries are directly embedded in the JAR file, it is actually a Fat JAR file, but cannot exceed 500MB in file size. It includes metadata like the name of the file next to the actual binary artifact. The filename of the JAR file has a 200-character limit. Best practice for selecting a filename for the JAR file is to use a semantically meaningful name and include some kind of unique versioning system to prevent using the wrong JAR file. The JAR file is identified within the Pipeline Template and is uploaded to the pipeline, where it is assigned a unique Package ID (UUID).
  • Pipeline Template – This is the immutable definition of an executable Pipeline Version and its run-time properties on the pipeline. It holds all the configuration information necessary to access, process, and store data. And, it is a requirement for creating an executable Pipeline Version (see below). The PipelineTemplate defines the actual run-time implementation of the pipeline and the input and output catalogs it will use. One PipelineTemplate can be used by multiple Pipeline Versions at the same time, though the input and output catalogs used can be overridden in some jobs. Each template is assigned a unique Template ID (UUID) by the pipeline when it is created.
  • Pipeline Version – This is an immutable entity that is the executable form of a pipeline within the HERE platform pipeline. Each Pipeline Version is created from a specific Pipeline JAR file and Pipeline Template. Each Pipeline Version is assigned its own Pipeline Version ID (UUID) by the pipeline when created. Multiple PipelineVersions can be defined based on a single Pipeline JAR file. However, two instances of the same Pipeline Version (and Pipeline Version ID) cannot run at the same time.
  • Operation – These are special operational commands that can be submitted to a Pipeline Version. The Pipeline Version can accept them or not, typically because they may be invalid or another Operation is already pending. Operations have a state that can be checked to see if an Operation is still in progress or has been completed with a result.
  • Job – A Job is an immutable entity that represents the one-time configuration parameters and input catalogs that are submitted to the cluster by a running Pipeline Version for that job only. When it specifies the exact versions of the input catalogs to be processed, that information overrides similar information specified in the Pipeline Template. It's possible to obtain the list of running or historical Jobs. Jobs have a state that may change over time as the Job runs and terminates.
  • Stream Job – For Stream processing, a Job is not intended to terminate by itself, but if this happens the platform may automatically restart the processing with a new Job. These jobs typically have a continuing stream of input data for processing.
  • Batch Job – For Batch processing, a Job terminates by either succeeding or failing to process the available input data. These jobs typically are run on one or more finite collections of data.
  • Runtime Configuration – A set of parameters that can be specified at run-time to configure the pipeline's default run-time environment. The PipelineTemplate specifies the default configuration parameters. However, some of these parameters may be respecified in specific job's configuration to override the default parameters of the template. In a Pipeline Version, the runtime configuration specifies the actual configuration passed to Jobs submitted for processing by that Pipeline Version. Custom configuration parameters are placed on the classpath of the pipeline, as application.properties, which you can reference from within the pipeline code. For more information, see the Configuration File Reference.
  • Scheduler ConfigurationSchedulerConfig is a property used in each Pipeline Version. The scheduler controls when a Job is created and submitted to the Flink or Spark cluster for processing. The scheduler may start a new job when the previous one completes as expected, or not as expected. The Scheduler polls or waits for changes from upstream catalogs. It can also operate due to timers or other external triggers. It includes properties like when to start a Job, whether to restart terminated jobs, and polling intervals for upstream catalogs. For more information, see the Configuration File Reference.
  • JAR file – see Package.
  • Cluster Configuration (or ClusterConfig) – This is a property used with the Pipeline Template and Pipeline Version. In a Pipeline Template, it represents the suggested minimum size of the cluster needed to run a Pipeline Version based on that Pipeline Template. It represents the actual size and configuration of the cluster dedicated to the execution of that specific Pipeline Version. The configuration specifies properties like the size of the cluster in processing units of a fixed number of CPUs and of memory. The following table lists the cluster configuration parameters. For more details, see the Quotas & Limits article and the Configuration File Reference.
Config Parameter Meaning
supervisorUnits Number of resource units per Supervisor (Flink JobManager or Spark Driver)
workerUnits Number of resource units per Worker (Flink TaskManager or Spark Executor)
workers Number of Workers (number of Flink TaskManagers or Spark Executors)
  • Input and Output Catalogs – The Input Catalog is the data source for the pipeline. The Output Catalog is the data destination from the pipeline. A pipeline can have more than one input catalog, but can only have one output catalog.

    For a Stream pipeline, you may need to specify catalog versions, depending on the type of catalog layer used (that is, versioned, volatile, or streaming).

    For a Batch pipeline, you can choose to run the pipeline immediately or schedule it to run when the input catalog data is updated. To run the pipeline immediately, you must specify the catalog versions. To schedule the pipeline mode, you do not need to specify the catalog versions. Instead, the pipeline scheduler will check the input and output catalogs every five minutes to capture changes and check for consistent versions for all the catalogs to be processed.

    For example, assume that the pipeline has two input catalogs and one has changes from an upstream catalog version 5 and the other input catalog also includes the same upstream catalog, but has not yet processed version 5. Then, you cannot run the pipeline because the two input catalogs versions are not consistent with the upstream catalog version.

  • pipeline-config.conf – This file lists the parameters describing input catalogs, output catalog, and billing tag. The pipeline uses this information to determine whether the catalogs have changed and the scheduled batch pipeline should be run to process the changes. For more information, see the Configuration File Reference.

  • pipeline-job.conf – This configuration file lists the parameters describing the input catalog versions and processing type. For Batch pipelines that are run using the on-demand mode, the users can choose the default values or provide specific information. For more information, see the Configuration File Reference.

See Also

results matching ""

    No results matching ""