In HERE Workspace, a pipeline is an application that channels data input through a defined sequence of processing that has a single end point. A pipeline can run on a stream processing framework (Apache Flink) or batch processing framework (Apache Spark).
For an introductory overview of pipelines, see the following video.
The pipeline application is compiled into a fat JAR (Java ARchive) file for distribution and use in the HERE platform environment. The pipeline application has two basic components:
The framework interface is an artifact of the data being processed and the selected processing framework. The data ingestion and data output are also artifacts. They can be considered basic components that are required for the pipeline to run within the processing framework. The basics of a pipeline development project are predefined based on Maven archetypes supplied in the HERE Data SDK for Java and Scala.
The data processing workflow consists of the hard-coded data transformation algorithms that make each pipeline unique, especially when using HERE libraries and resources. The specialized algorithms in the pipeline are required to transform the data input into a useful form in the data output.
The workflow results are supplied as output to the data sink for temporary storage. The workflow is designed to execute a unique set of business rules and algorithmic transformations on the data according to its design. Run-time considerations are typically addressed as a set of specific configuration parameters applied to the pipeline when it is deployed and a job is initiated.
The application must be pre-configured exclusively for use in a stream or batch processing environment, but never in both. All pipelines must have at least one data source (catalog) and just one data sink (catalog) that are external to the pipeline itself, as seen in the following image.
Pipelines are implemented as a software pipeline application, in which a series of data processing elements are encapsulated into a reusable pipeline component. Each pipeline processes input data in streams or batches, and outputs the results to a destination catalog. Pipelines can be:
To develop a pipeline in HERE Workspace, you must first create a pipeline JAR file.
For an overview of the pipeline development workflow, see the following video.
Each pipeline must be designed, built, and compiled into a JAR file before being deployed to the platform for execution:
Flink and Spark have unique requirements for their JAR file designs.
A new pipeline is often created using the following process:
An operational requirement describes the individual pipelines and catalogs used and the execution sequencing. The operational requirement is the unique topology that must be deployed, and can include as many individual pipeline stages as the computing environment can support. Pipelines can be designed for reuse where needed due to this flexibility.
For every pipeline there is a data source and an output catalog to contain the data processed by the pipeline. That output catalog must be compatible with the data transformations performed in the pipeline. For information on the range of possible variations found in the input and output catalogs, refer to Pipeline Patterns.
Pipelines are deployed and monitored as they run. During deployment, the data sources and data destinations are defined, which is required to implement more complex topologies. Activity is monitored and logged for later analysis as the pipelines execute. Additional tools can be used to generate alerts based on events logged during data processing.
For an overview of pipeline deployment, see the following video.
When you issue an activate command for a specific Pipeline Version ID, deployed pipelines begin processing data. Only a configured Pipeline Version can process data on HERE Workspace. The following table describes operational commands that can be directed to any running pipeline.
|Pause||Freezes pipeline operation until a resume command is issued.|
|Resume||Restarts a paused pipeline from the point where the execution was paused.|
|Cancel||Terminates an executing pipeline from processing any data.|
|Upgrade||Allows a pipeline version to be replaced by another pipeline version.|
If you plan to start using HERE Workspace, note the following:
HERE Workspace is designed to build distributed processing pipelines for location-related data. Your data is stored in catalogs. Processing is done in pipelines, which are applications written in Java or Scala, and run on an Apache Spark or Apache Flink framework.
HERE Workspace abstracts away the management, provisioning, scaling, configuration, operation, and maintenance of server-side components, and lets you focus on the logic of the application and the data required to develop a use-case. HERE Workspace is intended mainly for developers performing data processing, compilation, and visualization, while allowing data analysts to perform ad-hoc development.
For additional pipeline recommendations, see the following video.