Data Pooling

Overview

Data Pooling is designed to enrich user's data by combining messages from multiple stream layers into a single output layer. The supported data formats are SDII Message, SDII Message List, Sensoris, and GeoJSON. Typically, a message's output format will match that of the input format, with the exception that a user can optionally transform an input SDII Message to a Sensoris message. This is configurable with the Wizard during setup.

Structure

Data Pooling operates as a stream pipeline which reads input data from two or more stream layers, combines (unions) them and publishes the results to another stream layer.

Application Flow Diagram
Figure 1. Application Flow Diagram

Legend: Legend Diagram

Input/Output Formats

SDII Message, SDII Message List, Sensoris, and GeoJSON are the supported input/output formats.

Prerequisites

  • This pipeline template expects the user to provide 2 or more stream catalog layers as input to the pipeline
  • Make sure that your input catalogs are shared with GROUP which you are going to use for deployment of this pipeline template
  • Confirm that your local credentials (~/.here/credentials.properties) are added to the same group
Create Output Catalog and Layer

You may create the output catalog and layer using the Wizard Deployer (see Running on the HERE platform below). Alternatively, you can create the catalog and layer manually, prior to running this pipeline template. Be mindful of the stream layer output settings. Please refer to Stream Layer Settings. Additionally, for reference, please refer to Stream Layer Best Practices.

Create Output Catalog and Layer Manually

Modify or copy the file 'config/output-catalog.json'. Update the following parameters.

  • "id": There's one at the catalog level and one at the layer level.
  • "ttl": The stream layer time to live, i.e., retention time in minutes. Please refer to Stream Layer Settings.
  • "dataOutThroughputMbps": Please refer to Stream Layer Settings.
  • "dataInThroughputMbps": Please refer to Stream Layer Settings.
  • "tags": There's one at the catalog level and one at the layer level.
  • "billingTags": There's one at the catalog level and one at the layer level.

Optionally, you can update the catalog and layer's name, summary, and description fields. Once modified, use the OLP CLI to create the catalog with layer. Please refer to OLP CLI

olp catalog create <catalog id> <catalog name> --json config/output-catalog.json

HERE platform pipelines are managed by group. Therefore, please grant read access to your group id so your pipeline can read from the input catalog. For instructions on how to manage groups, please refer to Manage Groups in Teams and Permissions User Guide. For instructions on how to share your catalog, please refer to Share a Catalog.

Create Input Catalog and Layer

This example requires you to have a catalog with a stream layer for the input data. For instructions on how to create a catalog, please refer to Create a Catalog. For instructions on how to create a layer, please refer to Create a Layer.

HERE platform pipelines are managed by group. Therefore, please grant read access to your group id so your pipeline can read from the input catalog. For instructions on how to manage groups, please refer to Manage Groups in Teams and Permissions User Guide. For instructions on how to share your catalog, please refer to Share a Catalog.

For instructions on how to publish your input data into a stream layer, please see the links below. The easiest option is to use the CLI.

Execution

Manually Configure Your Input Catalog Layers

Open the layers.properties file in the config directory and input your input catalogs with layers as shown in the files comments section.

Running on the HERE platform

In order to deploy and run this pipeline template, you will need the Wizard Deployer. The Wizard Deployer executes interactively, asking questions about the application, and expects the user to provide needed answers. Follow the Wizard's documentation instructions and set up the needed parameters, then follow these steps:

  1. Execute the script as ./wizard.sh
  2. Follow the prompts and provide needed answers

You can use your existing output layer or let the Wizard create a new catalog/layer for you. If using an existing catalog, make sure it is shared with the GROUP_ID that will be used for this deployment.

PLEASE NOTE: In order to process your data in a reasonable time frame, you may need to tune the amount of cores needed for processing.

You can start with a default configuration of one core and use Flink Dashboard to monitor memory utilization and data distribution in your running pipeline. You should also monitor Splunk logs for errors and exceptions, for instance, the OutOfMemoryError exception most likely indicates that more processing power is needed for the submitted amount of input data. For the current version when using the Wizard for deployment of this pipeline template, each worker will have 1 CPU, 7 GB RAM, 8 GB Disk Space.

Verification

In the Platform Portal select the Pipelines tab where you can see your pipeline deployed and running. After your pipeline finishes and your data is published, you can find your output catalog under the Data tab or query/retrieve your data programmatically using one of the following options:

Cost Estimation

Executing this pipeine template will incur the following costs:

Storage-Stream

Cost will depend on the settings of your input Stream layer.

Storage-Blob

Cost will depend on the amount of data that will be published to a Stream layer as an output from execution.

Data Transfer IO

Cost will depend on amount of:

  • input data published to a Stream layer (published before execution of this pipeline template)
  • same input data retrieved from Stream layer (during pipeline template execution)
  • amount of data written out to a Stream layer
Metadata

N/A

Compute Core and Compute RAM

Cost will depend on the amount of data that needs to be processed. More data will require more processing power and will take longer to finish.

Log Search IO

Cost will depend on the log level set for the execution of this pipeline template. To minimize this cost, the user can set log level to WARN.

Support

If you need support with this pipeline template, please contact us.

results matching ""

    No results matching ""