A pipeline is simply a Java or Scala application that reads from one or more input sources, applies a bit of processing, and outputs the results to a single end point. In HERE Workspace, you have the opportunity to enrich your data with our map and traffic data sets.
Two of our most common use cases are:
Here are some other scenarios where pipelines might be a good fit for you:
For an introductory overview of pipelines, see the following video.
The pipeline application is compiled into a JAR file for deployment in HERE Workspace. A pipeline can run on a stream processing framework (Apache Flink) or batch processing framework (Apache Spark). The pipeline application has two basic components:
The framework interface This is an artifact of the data being processed and the selected processing framework. The data ingestion and data output are also artifacts. They are basic components required for pipeline execution. The basics of a pipeline development project are predefined based on Maven archetypes supplied in the HERE Data SDK for Java and Scala.
The data processing workflow This consists of the hard-coded data transformation algorithms that make each pipeline unique, especially when using HERE libraries and resources. The specialized algorithms in the pipeline are required to transform the data input into a useful form in the data output.
The workflow results are supplied as output to the data sink for temporary storage. The workflow is designed to execute a unique set of business rules and algorithmic transformations on the data, according to its design. Run-time considerations are typically addressed as a set of specific configuration parameters applied to the pipeline when it is deployed and a job is initiated.
The application must be pre-configured exclusively for use in a stream or batch processing environment, but never in both. All pipelines must have at least the following components that are external to the pipeline itself:
Pipelines go through a design and implementation process before they can be used. After the pipeline is designed, it is implemented as an executable JAR file. These JAR files can be used by HERE Workspace as needed. Each pipeline JAR file must accommodate the design requirements and restrictions of either a stream or a batch execution environment.
Flink and Spark each have specific requirements for their JAR file designs.
A new pipeline is typically created using the following process:
An operational requirement describes the individual pipelines and catalogs to be used and the execution sequencing. This is the unique topology that must be deployed and can include as many individual pipeline stages as the computing environment can support. Pipelines can be designed for either single or multiple deployments.
For every pipeline, there is a data source and an output catalog to contain the data processed by the pipeline. That output catalog must be compatible with the data transformations done in the pipeline. The following shows a range of possible variations in input and output catalogs:
During deployment, the data sources and data destinations are defined, which is required to implement more complex topologies. As the pipelines execute, their activity is monitored and logged for later analysis. Additional tools can be used to generate alerts based on events during data processing.
Only a configured Pipeline Version can process data on HERE Workspace. The following table describes operational commands that can be directed to a pipeline version:
|Activate||Starts data processing on a deployed pipeline version.|
|Delete||Removes a deployed pipeline version.|
|Pause||Freezes pipeline version operation until a resume command is issued.|
|Resume||Restarts a paused pipeline version from the point where execution was paused.|
|Cancel||Terminates an executing pipeline version from processing any data.|
|Upgrade||Replaces a pipeline version with another pipeline version.|
If you are a developer and you want to start using the HERE Workspace, note the following:
HERE Workspace is designed to build distributed processing pipelines for location-related data. Your data is stored in catalogs. Processing is done in pipelines, which are applications written in Java or Scala and run on an Apache Spark or Apache Flink framework.
HERE Workspace abstracts away the management, provisioning, scaling, configuration, operation, and maintenance of server-side components, and lets you focus on the logic of the application and the data required to develop a use case. HERE Workspace is aimed mainly at developers doing data processing, compilation, and visualization, but also allows data analysts to do ad-hoc development.