You must design, develop, and test a pipeline before it can be executed on the pipeline service. The end product of this process is a pipeline JAR file that can be deployed to the pipeline service and used to process data. This topic was introduced in Pipeline Lifecycle.
HERE platform Pipelines can be simple or complex, based on the data processing workflow being implemented. The basic structure of the pipeline code is well established and a new build project can be initiated using a Maven Pipeline Template (that is, Maven archetypes) for stream or batch processing workflows. Different templates are used so that the pipeline service instantiates the correct type of pipeline: batch or stream. More advanced developers may decide not to use the Maven archetypes for their new pipeline project. But developers new to the HERE platform are recommended to use the Maven archetypes for their initial projects.
A Maven Pipeline Template is a reusable definition of a Pipeline that includes the implementation and all the information needed to make it executable, including:
the entry point, which is the name of the main class in the JAR file code
the definition and schema of the input and output catalogs that the implementation needs to be connected to
the type of runtime framework required
the default runtime configuration and parameters
any special runtime requirements needed by the pipeline's data processing code
The structure of a typical HERE platform pipeline is shown below.
Much of this structure is taken care of by the pipeline project templates (Maven Archetypes ) supplied in the SDK. But the customer data processing block in the diagram is where pipeline customization occurs. This contains all of the logic and data manipulation algorithms required to implement the desired processing workflow. Detailed discussions of how to implement this workflow can be found in the following guides:
The Data Client Library - Scala/Java libraries that you can use in your projects to access the HERE platform for implementing stream processing pipelines
The Data Processing Library - provides a means to easily interact with both the Pipeline API and the Data API via Spark for implementing batch processing pipelines
While creating a new pipeline is not a simple task, the HERE platform and its tools have streamlined the process as much as possible. The general pattern of individual pipeline development uses the following sequence of events:
Determine the processing goals of the pipeline.
Identify the pipeline parameters, that is, (name, description, version, data source, data destination).
Define the processing activities that will take place in the pipeline and their order of execution (that is, workflow).
Develop the algorithms for manipulating the data to be processed in the pipeline.
Integrate the workflow and algorithms into the executable pipeline JAR file in a local development environment.
Define required and optional configuration parameters and files for using the pipeline.
Test the new pipeline with development datasets.
Release the pipeline for deployment in a production environment.
We address the greater challenge of real-world application development by enabling a developer workflow that guides the developer from an empty directory on their computer to a production-ready pipeline. In combination with the other capabilities of the Platform, a significant time-to-market advantage is provided. The following diagram illustrates this developer workflow.
The HERE platform developer workflow is split into eight distinct phases, as shown in the figure above. These phases ensure a high velocity for a team as it enables individual developers to perform their work on local development environments. The SDK contains tooling that provides scaffolding and mock-up APIs on a local environment and clone data for development purposes in a local development environment. For more details, see HERE Workspace for Java and Scala Developers. There are also many examples that can provide guidance for specific programming tasks. For a list of available examples, see Code Examples.
When we talk about a Pipeline JAR file, we are actually talking about a pipeline application compiled and packaged with its dependencies and assets into a Fat JAR file. However, a "Fat JAR file" can also refer to a nonpipeline application. For this reason, it is better to think of the packaged pipeline application as a pipeline JAR file to eliminate possible confusion.
The HERE platform is designed so that, when developing a new pipeline design, you must know whether or not you are setting up a stream or a batch pipeline. This determines which Maven archetype is used to set up the new pipeline's development project and which framework to use in the Pipeline Service. That project includes all basic dependencies required for the pipeline service. When you build the project, it creates a pipeline JAR file (aka “Fat JAR” or “Uber JAR”) that is targeted to the pipeline service type appropriate for the intended pipeline type. For example, a stream pipeline JAR file cannot be run as a batch pipeline. If the pipeline JAR file is designed to be a batch pipeline, it must be deployed and run as a batch pipeline. And a stream pipeline JAR file must be deployed and run as a stream pipeline. This is a basic requirement of the HERE platform.
Caution: JAR File Limits
The maximum size of the pipeline JAR filename is 200 characters. The maximum size of the pipeline JAR file is 500MB. If the POST transaction for the JAR file cannot be completed within 50 minutes, the connection will be closed by the remote host and return an error.
As mentioned above, a pipeline JAR file must be designed for batch processing or for stream processing. This involves using the correct Maven archetype from the SDK for your development project. Even if you are developing your pipeline without using a Maven Archetype, you must still meet all of the requirements covered by those archetypes. The archetype is designed as your project template for the basic pipeline code. There is more that your project needs, described in the following sections.
To develop a stream data pipeline, there are four run-time environments available - Stream-2.0.0 (deprecated) (including Apache Flink 1.7.1 framework), Stream 3.0 (including Apache Flink 1.10.1 framework), Stream 4.0 (including Apache Flink 1.10.3 framework), and Stream 5.0 (including Apache Flink 1.13.5 framework). The basic guide for developing a Flink application is described in the Flink v1.13 DataStream API Programming Guide or Flink v1.10 DataStream API Programming Guide or Flink v1.7 DataStream API Programming Guide, depending on what you are using. For fixes and improvements between Flink 1.10 and Flink 1.13, see the following Flink release posts:
To develop a batch data pipeline, the run-time environments Batch-2.0.0 (deprecated), Batch-2.1.0 (deprecated) and Batch-3.0 (including Apache Spark v2.4.7 framework) are available. The guide to the basics of developing a Spark pipeline application is described in both Spark Quick Start 2.4.2 and Spark Quick Start 2.4.7. Because the HERE platform does not yet support SQL, you should focus on the RDD Programming Guide.
Caution: New Batch and Stream Environments
When a new Batch or Stream run-time environment is released, the pipelines using the old Batch and Stream run-time environment are supported for 6 months since the release of the new run-time environment in order to provide sufficient time to migrate the existing pipelines to the new run-time environment. For more information, see Migrating pipeline to new run-time environment.
Not all catalog layers are compatible with both batch and stream pipelines. You should consider this in how the pipeline is designed. For further information, see Pipeline Patterns and the Data User Guide. Also, most libraries have their own Developer Guides. For a list of available Developer Guides, see the Documentation listing.