In the HERE Workspace, a pipeline is an executable data processing software application that is designed to run on a streaming or batched processing framework that is a part of the Pipeline API. This application is commonly referred to as a pipeline because it channels the data input through a defined sequence of manipulations and transformations that have a single end point. The pipeline application is compiled into a fat JAR file for distribution and use in the operational environment provided by the Pipeline API. It has two basic components: the framework interface and the data processing workflow. The data processing workflow is the hard-coded data transformation algorithms that makes each pipeline unique, especially when using Workspace libraries and resources. The specialized algorithms in the pipeline are required to transform the data input into a useful (productized) form in the data output. Execution can occur in a streamed data environment based on Apache Flink or in a batch processed environment based on Apache Spark. But the application must be preconfigured exclusively for use in a streamed or batch processing environment, never in both. All pipelines must have at least one data source (catalog) and just one data sink (catalog) that are external to the pipeline itself as shown below.
The Framework Interface, Data Ingestion and Data Output are all artifacts of the data being processed and the selected processing framework. They can be considered basic components that are required for the pipeline to run within the processing framework. Based on Maven archetypes supplied in the HERE Data SDK for Java & Scala, the basics of a pipeline development project are predefined. The data processing workflow distinguishes a pipeline for specific data manipulations and transformations that are applied to the input from the Data Source. The workflow results are output to the Data Sink for temporary storage. The workflow is designed to execute a unique set of business rules and algorithmic transformations on the processed data according to its design. Run-time considerations are typically addressed as a set of specific configuration parameters applied to the pipeline when it is deployed and a job is initiated.
HERE pipelines are implemented as a classic software pipeline application where a series of data processing elements is encapsulated into a reusable pipeline component. Each pipeline is intended to process input data in streams or batches and outputs the results to a specified destination catalog.
Developing a pipeline in the HERE Workspace is the process of creating a pipeline JAR file.
Pipelines go through a design and implementation process before they can be used operationally. After the pipeline is designed and implemented, it exists as an executable JAR file. These JAR files can be used operationally by the Workspace as needed. Because each pipeline JAR file can only be used in either a streamed or a batch execution environment, its design must accommodate the design requirements and restrictions of that selected environment. If a streamed environment is selected, the JAR file must be executable on the Apache Flink framework embedded in the pipeline. If a batch environment is selected, the JAR file must be executable on the Apache Spark framework embedded in the pipeline. Flink and Spark each have specific requirements for their JAR file designs.
A new pipeline is typically created using the following process:
An operational requirement will describe the individual pipelines and catalogs to be used and the sequencing of their execution. This is the unique topology that must be deployed and can include as few as 1 and as many individual pipeline stages as the computing environment can support. Because of this flexibility, pipelines can be designed for reuse, or not, as needed.
Keep in mind that pipelines do not live in a vacuum. For every pipeline there is a data source and an output catalog to contain the data processed by the pipeline. That output catalog must be compatible with the data transformations done in the pipeline. The range of possible variations in input and output catalogs can be seen in:
Operationally, pipelines are deployed and monitored as they run. The deployment process includes definitions of data sources and data destinations, thus implementing the more complex topologies described in a specific operational requirement. As the pipelines execute, their activity is monitored and logged for later analysis, if required. Additional tools can be used to generate alerts based on events during data processing.
Deployed pipelines begin processing data when an activate command is issued for a specific Pipeline Version ID. Only a configured Pipeline Version can process data on the HERE Workspace. The following operational commands can be directed to any running pipeline:
If you are a developer and you want to start using the HERE Workspace, note the following:
The Workspace is designed to build distributed processing pipelines for location-related data. Your data is stored in catalogs. Processing is done in pipelines, which are applications written in Java or Scala and run on an Apache Spark or Apache Flink framework.
The Workspace abstracts away the management, provisioning, scaling, configuration, operation and maintenance of server side components and let's you focus purely on the logic of the application and the data required to develop a use case. Thus, the Workspace is mainly aimed at developers doing data processing, compilation, visualization, etc., but also allows data analysts to do ad-hoc development.
At this point, go to: