Pipeline Lifecycle

Typical Pipeline Operational Lifecycle

The operational lifecycle of a typical pipeline follows a consistent pattern. This pattern has three major phases whether you are using the Portal GUI, the CLI, or your own application using the API:

  1. Create the Pipeline
  2. Deploy the Pipeline
  3. Manage the Pipeline

The first two steps are summarized in Figure 1. Phase 1 is done in a local development environment. Phase 2 can be done using the Web Portal, the OLP CLI, or your own application using the Pipeline REST API.

process diagram of pipeline creation and deployment
Figure 1. OLP Pipeline Lifecycle

Phase 1: Creating the Pipeline

The goal of Phase 1 is to create a Pipeline JAR File. This JAR file contains the code for the pipeline framework, the data ingestion, the data output, and all of the data transformation logic required to implement the intended data processing workflow.

To simplify this task, project archetypes are provided to supply as much of the boilerplate code as possible and a framework to contain everything else. Different maven archetypes set up a project for either a Batch pipeline or a Stream pipeline. These archetypes also provide all of the interface code needed to execute the pipeline in the proper framework within the platform. The only thing the user has to provide is the data processing code itself. The Pipeline JAR file is actually a fat JAR file containing all of the libraries and other assets needed by the pipeline.

  1. Phase 1 begins by defining the Business Requirements for the pipeline. These include the data source, data type/schema, process flow, and desired results of data processing.
  2. From the business requirements, the workflow is determined, the data schema is formally defined, and the data transformation algorithms are developed. The algorithms and data ingestion/output are implemented in Java or Scala language and integrated into the pipeline project.
  3. The Java/Scala code is compiled.
  4. The result is a JAR file that contains the code for data ingestion, data processing, and outputting the processed data. All the required libraries and other assets are added to make a fat JAR file. The resulting pipeline JAR file is unique, transportable, and reusable.

    Note: Credentials Required

    Every pipeline application must be registered with the HERE platform before it can be used. This process is described in the Teams and Permissions User Guide. For specific procedural information, see the article Manage Apps.

This completes Phase 1, the Pipeline Creation process.

Comment

The Phase 1 process shown here is actually more complex than Phase 2, since it is not a simple task to design the transformation algorithms and translate them into compilable code. Nor does this process address the ancillary steps of testing, reviewing, or validating the pipeline code.
A good description of the detailed process of creating a batch pipeline can be found in the Data Processing Library Developer Guide using both Java and Scala.

For more information, see Creating Pipelines.

Phase 2: Deploying the Pipeline

Pipeline deployment begins with the pipeline JAR file. Pipeline JAR Files are designed for either Batch or Stream processing. They are also designed to implement a specific data processing workflow for a specific data schema. There are also runtime considerations that are specified during deployment.

Select the pipeline JAR file to be deployed and do the following to prepare it for deployment:

  1. Create a Pipeline Object - This step involves setting up an instance of a pipeline and obtaining a Pipeline ID.

    Create a pipeline object
    Figure 2.
  2. Create a Template - This step involves uploading the pipeline JAR file and obtaining a Template ID. This step also requires the input and output catalog identifiers to be specified.

    Create a package
    Figure 3.
  3. Create a Pipeline Version - This steps creates an executable instance of the pipeline and involves registering the runtime requirements for the deployed pipeline. A Pipeline Version ID is assigned upon successful completion of this step.

    Create a pipeline version
    Figure 4.
  4. The pipeline is now deployed and ready to be Activated.

Activate the Pipeline

To execute a Pipeline, one of its Pipeline Version must be activated.

To activate the Pipeline Version, an Activate operation is required upon the Pipeline Version ID. A Batch pipeline can be activated to run On-demand (Run Now) OR it can be Scheduled. With the Scheduled option, the Batch Pipeline Version can be executed when the input catalogs are updated with new data or based on a time schedule. See the below section for details on various modes of execution.

Execution Modes for activating a Pipeline

There are several execution modes available for activating a pipeline version. The following table summarizes these execution modes and the differences:

Pipeline Type Execution Mode: On-demand Execution Mode: Scheduled Execution Mode: Time Schedule
Batch The pipeline enters the Scheduled state and immediately changes to the Running state to attempt to process the specified input data catalogs. When the job is done, the pipeline returns to the Ready state. No further processing is done, even if the input catalogs receive new data. Additional processing must be initiated manually. The pipeline version enters the Scheduled state for a brief period of time and then changes to the Running state to begin processing the existing data in the input catalogs. After the job is completed, it returns to the Scheduled state where it waits for new data to be available in the input catalogs. Only new data is processed for each subsequent run. The pipeline enters the Scheduled state and waits for the next attempt time of the Time Schedule. Once the next attempt time has arrived, it changes to the Running state to begin processing the existing data in the input catalogs. After the job is completed, it returns to the Scheduled state where it waits for next attempt time.
Stream Not Supported. At the moment, there's no option to specify an end time for a Stream pipeline. Therefore, it cannot be run once. The pipeline begins in the Scheduled state for a brief period of time and then changes to the Running state to begin processing the data stream from the specified input catalog. The pipeline continues to run (and stays in the Running state) until it is paused, canceled, or deactivated. Not Supported because Stream pipelines process data continuously.

When you activate a Pipeline Version, a request is made to the pipeline service to initiate the execution of pipeline version. A Job is created to start the execution and a Job ID is generated. When the job starts, a URL for all the logs of the job is returned by the pipeline service.

typical pipeline lifecycle from predeployment through deployment in a runtime environment

Note: Logging URL

The logging URL is returned automatically when activation is done from the Web Portal or the CLI. When using the Pipeline API, another request must be made to get the URL.

Note::Multiple Pipeline Versions

Additional Pipeline Versions can be created using the same Template or another Template. Each Pipeline Version is distinguished by its own unique Pipeline Version ID.

Caution::Limits per Pipeline

A Pipeline can have only one (1) Pipeline Version running/active at any time.

This life cycle applies, with minor variations, to both Batch and Stream pipelines.

Info: Pipeline ID

It is important to remember that the deployment of any pipeline begins with creating an instance of that pipeline in the pipeline service. That instance is assigned a UUID for identification: the pipeline ID. Everything else is managed by the pipeline service under that pipeline ID, so it cannot change; it is immutable. The metadata associated with the pipeline ID is simply used as a convenient way to talk about the pipeline instance. So, names and descriptions may be changed, but as far as the pipeline service is concerned it is the same pipeline instance.

Note::Deployment Details

For more detailed information on how this all works, see the Deployment section.

Phase 3: Manage the Running Pipeline

Once the pipeline is activated and running, it will respond to the following operations:

  • Cancel
  • Deactivate
  • Delete
  • Pause
  • Resume
  • Show
  • Upgrade

To check the current state of the Pipeline Version, you can review it in the Web Portal, use the CLI commands or the Pipeline API.

The basic pipeline runtime environment looks like this:

typical pipeline lifecycle from predeployment through deployment in a runtime environment
Figure 5. Runtime Environment

Terminate a Pipeline Version

A running Pipeline Version can be terminated via the following operations:

  • Pause
    1. For a Batch pipeline version, the current Job is completed and future Jobs are paused. Thus, the pause may not happen quickly.
    2. For a Batch pipeline version that is run on-demand, the Pause operation is not available. Such a pipeline can only be Canceled.
    3. For a Stream pipeline version, the current state is saved and the Job is gracefully terminated.
  • Cancel
    1. For a Batch or Stream pipeline version, the running Job is immediately terminated without saving state and the pipeline version moves to the Ready state.
  • Terminate (internal) - This is an internal operation only. The current Job terminates with a success or failure. If the pipeline version is configured to run again, it will be set to a Scheduled state, otherwise it will be set to Ready state.

Note: Resume a Paused Pipeline Version

A Paused Pipeline Version can be restarted using the Resume operation. For a Stream Pipeline Version, the job resumes from the saved state of the paused job. For a Batch Pipeline Version, the Pipeline Version state is changed to Scheduled and the next job is created based on the execution mode.
A Canceled Pipeline Version cannot be Resumed. Instead, it must be Activated to return to a Running or Scheduled state.

Delete a Pipeline

To delete a Pipeline, its set of Pipeline Versions and associated content, the delete command is applied to the PipelineID. No running or paused Pipeline Versions can be deleted, which means that all Pipeline Versions to be deleted must be in the Ready state. An error will be returned if one or more of its Pipeline Versions are either running or paused.

To delete the pipeline using the CLI, use the command:

olp pipeline delete <pipeline-id> [command options]

For more information, see the Command Line Interface Developer Guide.

To delete the pipeline using the API, use the delete command. For detailed information, see the Pipeline API Reference.

Upgrading a Pipeline

For a pipeline, a running pipeline version can be replaced by another pipeline version. While this is most useful for Batch pipelines, it can also be done for a Stream pipeline. The purpose of an upgrade is to replace the existing Pipeline Version with a new Pipeline Version that is based on a different pipeline JAR file and/or configuration than the original.

Upgrade Sequence

  1. Create a new Pipeline Version using an existing or new Template.
  2. Execute the Upgrade operation from the Portal or the CLI.

As part of the Upgrade process, the old pipeline version is paused and the new pipeline version is activated. After a couple of minutes, the old pipeline version moves to the Ready state and the new pipeline version moves to the Scheduled state.

See the image below to understand the process.

Sequence diagram of pipeline upgrade process execution.
Figure 6. Upgrade Sequence.

Updating a Pipeline

You can change the name, description, and contact email properties associated with your pipeline instance. All other properties cannot be updated. These changes are made using the CLI command: pipeline update. Some metadata can be changed and each metadata item is available for redefinition as an optional parameter of the pipeline update command. For more information, see the Command Line Interface Developer Guide.

Update Sequence

  1. Cancel the running Pipeline Version. The job will stop processing and transition the Pipeline Version into a Ready state.
  2. Use the CLI to issue a pipeline update command and include the optional parameters that you wish to change.
  3. The pipeline instance has its metadata updated and Pipeline Versions associated with that pipeline ID can now be run with the new metadata associated with them.

Group ID/Project ID

Whenever you create a Pipeline Version, you must either assign it to a group by specifying the Group ID or a project by specifing the Project ID. Only users and applications that are part of the group or project can access the pipeline. To keep your pipelines private, restrict the access to yourself or to a group of registered users. These users are identified by the Group ID or Project ID. You must have a valid Group ID or Project ID to work with a pipeline. For more details, see the Teams and Permissions User Guide.

Caution: Stream Pipelines must use a unique application ID

A potential problem exists when you use the same group ID for a given combination of an application ID, layer ID, and catalog ID that can lead to partial data consumption issues. To avoid this situation, create a different Group (HERE Account Group) for every stream pipeline, thus ensuring that each pipeline uses a unique application ID. For more information, see the Stream Processing Best Practices article in the Pipeline Developer Guide.

See Also

results matching ""

    No results matching ""