Pipelines

In HERE Workspace, a pipeline is an application that channels data input through a defined sequence of processing that has a single end point. A pipeline can run on a stream processing framework (Apache Flink) or batch processing framework (Apache Spark).

For an introductory overview of pipelines, see the following video.

Pipeline Components

The pipeline application is compiled into a fat JAR (Java ARchive) file for distribution and use in the HERE platform environment. The pipeline application has two basic components:

  • The framework interface is an artifact of the data being processed and the selected processing framework. The data ingestion and data output are also artifacts. They can be considered basic components that are required for the pipeline to run within the processing framework. The basics of a pipeline development project are predefined based on Maven archetypes supplied in the HERE Data SDK for Java and Scala.

  • The data processing workflow consists of the hard-coded data transformation algorithms that make each pipeline unique, especially when using HERE libraries and resources. The specialized algorithms in the pipeline are required to transform the data input into a useful form in the data output.

    The workflow results are supplied as output to the data sink for temporary storage. The workflow is designed to execute a unique set of business rules and algorithmic transformations on the data according to its design. Run-time considerations are typically addressed as a set of specific configuration parameters applied to the pipeline when it is deployed and a job is initiated.

Note

The application must be pre-configured exclusively for use in a stream or batch processing environment, but never in both. All pipelines must have at least one data source (catalog) and just one data sink (catalog) that are external to the pipeline itself, as seen in the following image.

Typical pipeline architecture including a framework interface, a data ingestion interface, a data output interface, and a data processing workflow.
Figure 1. Typical Pipeline

Pipeline Features

Pipelines are implemented as a software pipeline application, in which a series of data processing elements are encapsulated into a reusable pipeline component. Each pipeline processes input data in streams or batches, and outputs the results to a destination catalog. Pipelines can be:

  • Any combination of data processing algorithms in a reusable JAR file.
  • Built using Scala or Java, based on a standard pipeline application template.
  • Compiled and distributed as fat JAR files for ease of management, deployment, and use.
  • Highly specialized or very flexible, based on how the data processing workflow is designed.
  • Deployed with a set of run-time parameters that allow as much pipeline flexibility as needed.
  • Chained by using the output catalog of one pipeline as the input catalog of another pipeline.
  • Deployed and managed from the command line interface (OLP CLI), the portal, and the pipelines API.

Pipeline Development Workflow

To develop a pipeline in HERE Workspace, you must first create a pipeline JAR file.

For an overview of the pipeline development workflow, see the following video.

Each pipeline must be designed, built, and compiled into a JAR file before being deployed to the platform for execution:

  • If a stream environment is selected, the JAR file must be executable on the Apache Flink framework embedded in the pipeline.
  • If a batch environment is selected, the JAR file must be executable on the Apache Spark framework embedded in the pipeline.

Note

Flink and Spark have unique requirements for their JAR file designs.

Pipeline Creation

A new pipeline is often created using the following process:

  1. Create a functional objective or business goal for the pipeline, such as defining a basic data workflow.
  2. Develop a set of algorithms for manipulating the data to achieve an objective, in either Java or Scala, that is compatible with the pipeline templates.
  3. Integrate the implemented algorithms into a pipeline application targeting a streaming or batch processing model. Maven archetypes are then used to build the pipeline.
  4. Define any runtime parameters required by the implemented algorithms during the integration process.
  5. Create and test a fat JAR file. This fat JAR file, along with any associated libraries or other assets, is the deliverable that is deployed onto the pipeline.
Diagram showing the new pipeline development process.
Figure 2. Pipeline development process

Operational Requirement

An operational requirement describes the individual pipelines and catalogs used and the execution sequencing. The operational requirement is the unique topology that must be deployed, and can include as many individual pipeline stages as the computing environment can support. Pipelines can be designed for reuse where needed due to this flexibility.

Catalog Compatibility

For every pipeline there is a data source and an output catalog to contain the data processed by the pipeline. That output catalog must be compatible with the data transformations performed in the pipeline. For information on the range of possible variations found in the input and output catalogs, refer to Pipeline Patterns.

Pipeline Deployment

Pipelines are deployed and monitored as they run. During deployment, the data sources and data destinations are defined, which is required to implement more complex topologies. Activity is monitored and logged for later analysis as the pipelines execute. Additional tools can be used to generate alerts based on events logged during data processing.

For an overview of pipeline deployment, see the following video.

When you issue an activate command for a specific Pipeline Version ID, deployed pipelines begin processing data. Only a configured Pipeline Version can process data on HERE Workspace. The following table describes operational commands that can be directed to any running pipeline.

Command Description
Pause Freezes pipeline operation until a resume command is issued.
Resume Restarts a paused pipeline from the point where the execution was paused.
Cancel Terminates an executing pipeline from processing any data.
Upgrade Allows a pipeline version to be replaced by another pipeline version.

For Developers

If you plan to start using HERE Workspace, note the following:

  • HERE Workspace is designed to build distributed processing pipelines for location-related data. Your data is stored in catalogs. Processing is done in pipelines, which are applications written in Java or Scala, and run on an Apache Spark or Apache Flink framework.

  • HERE Workspace abstracts away the management, provisioning, scaling, configuration, operation, and maintenance of server-side components, and lets you focus on the logic of the application and the data required to develop a use-case. HERE Workspace is intended mainly for developers performing data processing, compilation, and visualization, while allowing data analysts to perform ad-hoc development.

For additional pipeline recommendations, see the following video.

results matching ""

    No results matching ""