Best Practices for Highly-Available Pipelines
Critical applications and location services that rely on collecting and processing large amounts of data need the underlying infrastructure to be reliable and responsive. Within the HERE platform, we provide various frameworks for processing the location data. The processing is done using Batch or Stream Pipelines. There are failures or exceptions that can occur with a production pipeline. To ensure that the pipelines keep working reliably and are fulfilling the business requirements, we provide several options to create and deploy highly-available pipelines. This guide helps in understanding these options and the best way to use these options to manage certain failure/exception situations.
For a pipeline, the following situations can prevent it from being highly-available:
The sections below describe each situation and the solutions to manage it.
Stream pipeline failed and, upon restart, it reprocessed the data or some data wasn't processed
Enable Checkpointing for Stream Pipelines
For a failed Stream pipeline to continue processing, it's important to restore it from a previous running state. Flink Checkpointing is required to create such states. Checkpointing is disabled by default. When enabled, Flink takes consistent snapshots (checkpoints) of the pipeline (or, in Flink terms, the Job graph) at specified intervals. These checkpoints are used to restore the pipeline from the last running state when restarting the pipeline. For more information, see Flink Checkpointing section of Stream Pipeline Best Practices.
Stream pipeline failed and did not restart
Enable Restart Strategy for Stream Pipelines
By default, a failed Stream pipeline doesn't restart. To enable the pipeline to restart upon a failure, Flink supports the following Restart strategies to control how the jobs are restarted in case of a failure:
- Fixed Delay Restart Strategy The fixed delay restart strategy attempts a given number of times to restart the job. If the maximum number of attempts is exceeded, the job eventually fails.
- Failure Rate Restart Strategy The failure rate restart strategy restarts job after failure, but when
failure rate
(failures per time interval) is exceeded, the job eventually fails. - No Restart Strategy (default) The job fails directly and no restart is attempted.
If Checkpointing is not enabled, the No Restart strategy is used. For more information, see Flink Restart Strategies.
Stream pipeline failed because a Task Manager failed
When a Flink Task Manager fails or is unreachable, the pipeline tries to get a replacement for the Task Manager. This can take several minutes and in that time the data processing will be on hold. It is also possible that the pipeline fails if the retries exceed the time or frequency thresholds before a new Task Manager is spooled and assigned to it. To prevent these situations, it is recommended to provision extra Task Managers on standby, while configuring the pipeline's resource cluster. The extra Task Managers don't process the data during normal pipeline operation and are only used when an active Task Manager fails.
For example, if a pipeline has a parallelism of 8 and it is assigned with 9 Task Managers, then only 8 Task Managers will be used for processing the data. The 9th one will be on standby until one of the 8 active Task Managers fails. To take advantage of this setup, it is recommended to set a shorter time interval within the Flink restart strategy. When one of the active Task Manager fails, the pipeline will quickly select the already available extra Task Manager to continue processing and the pipeline will not keep re-trying and waiting for a new Task Manager. For more information on Flink Parallelism, see Setting Parallelism and Cluster Configuration and Parallelism sections of the Stream Pipeline Best Practices. For more information on Restart Strategies, see the Stream pipeline failed and did not restart section above.
Note: Additional cost
Provisioning extra Task Managers incurs additional cost. For a critical pipeline, several extra Task Managers can be provisioned.
Stream pipeline failed because the Job Manager failed
Enable High-Availability Option for Stream Pipelines
Sometimes, the Flink Job Manager of the Stream pipeline fails and with that the running job and the pipeline fails. To prevent this situation, enable the High Availability option of the Flink Job Manager. High-Availability mode reduces the downtime during processing to near-zero and is beneficial for the time-sensitive Stream data processing. To learn to configure this option, see Flink Job Manager High Availability section of Steam Pipeline Best Practices.
Pipeline failed and is not reachable
Enable Multi-Region Setup for Pipelines
If a pipeline fails and is completely unreachable, it's possible that something serious has happened with the region where it was hosted. For both Batch and Stream Pipelines that require minimum downtime in case of a failure of the primary region, the pipeline can be configured with the multi-region option so that when the primary region fails, the pipeline gets automatically transferred to the secondary region. The running pipelines are restarted in the secondary region. The non-running pipelines get transferred as-is. When the primary region is alive again, the pipelines get transferred back to it with the same care.
For a Batch pipeline, the Spark History details are also transferred during a region failure.
If the primary region fails, an on-demand Batch pipeline in the Running state is automatically re-activated in the secondary region. However, a failed on-demand Batch pipeline needs to be manually re-activated.
Note: Additional cost for multi-region setup
The transfer of the pipeline state between two regions is billed as Pipeline I/O.
To enable multi-region setup, make sure that the following settings are configured:
- For a pipeline to work successfully in the secondary region, the input and output catalogs used by the pipeline must be available in the secondary region.
- For a Stream pipeline to successfully utilize the multi-region option, it's important to enable Checkpointing within the code to allow Flink to take a periodic Savepoint while running the pipeline in the primary region. When the primary region fails, the last available Savepoint is used to restart the pipeline in the secondary region. For more information on Checkpointing, see the Stream pipeline failed and, upon restart, it reprocessed the data or some data wasn't processed section above.
Note: Multi-region setup is not available for China
An attempt to enable multi-region setup in the China environment fails with the following error message: "No secondary region configured for your org, multiRegionEnabled
flag must not be set to "true".
Pipeline failed or restarted without a notification
If a pipeline is restarted by HERE Platform without any notification, it can happen at a time that is not convenient for you.
For a Stream pipeline, it is recommended to provide an email address in the email field so that if the pipeline is restarted by HERE Platform for some reason like a planned outage, an email notification is sent to the assigned email address in order to allow you to take an appropriate action at a time that is convenient for you.
To learn more, see Enable Notification and Recovery of Stream Pipelines
Use Pipeline Monitoring and Alerts
Pipelines generate certain standard metrics that can be used to track their status over time. To monitor the status of a pipeline, a Pipeline Status dashboard is available in Grafana. Based on that dashboard, you can set failure alerts that can be sent via email or other mediums. Furthermore, Spark UI for Batch pipelines and Flink Dashboard for Stream pipelines is available in the platform and helpful in assessing a pipeline's performance for improvement. Finally, it's also possible to create custom metrics for the pipelines to monitor and get alerted for certain expected output. For more information, see Pipeline Monitoring section and the Logs, Monitoring, and Alerts User Guide.