Archiver

Overview

Archiver can be used to efficiently archive streamed messages and easily query later for analysis or pipeline development purposes. Archiver processes a stream of sensor data messages by indexing each message based on its timestamp and geographical location, then storing them in an Index layer. Messages can be indexed based on:

  • Geographical location:
    • First or last reported location of the recorded path
    • Location of first or last observed event
  • Time:
    • Timestamp of first or last reported location
    • Timestamp of first or last observed event

During the deployment, the user must choose the archiving strategy. Available options are:

  • Start of the trip
  • End of the trip
  • First event
  • Last event

The indexing attributes used by Archiver are timeWindow and tileId. For example, if the user chooses "First event" strategy, then a timestamp of the first observed event will be used as a value for timeWindow attribute. An interpolated coordinate of the same event will be used to calculate a HERE Tile id that will become a value for the tileId attribute.

NOTE: Most recorded drives span across multiple tiles, but entire message will be stored using one tile id only (start or end, depends on the selected archiving strategy). Same rule applies to the time attribute: it doesn't matter how long the trip took; it will still fall into one time slice. Users should consider this information when querying for their data.

Supported data types are Sensoris and SDII. Data may be stored in plain Protobuf or Parquet format. Apache Parquet has the benefits of improved query performance and more efficient storage utilization.

Archiver considers several factors for archiving a stream of messages, such as:

  • Number of messages per hour
  • Message size
  • Size of geographic area
  • Distribution (density) of data

Since the factors above affect archiving attributes and will be used to calculate the number of cores needed to archive incoming data in an optimal way, it is important that the user provides accurate answers to the corresponding questions.

Structure

Application Flow Diagram
Figure 1. Application Flow Diagram

Legend: Legend Diagram

Archiver operates as a Stream pipeline that reads data from a Stream input layer, indexes messages based on index criteria and time window chosen by the user during deployment and stores them to an Index layer. It uses the Data Archiving Library under the hood.

Prerequisites

  • The user should provide an existing Stream layer with Sensoris or SDII messages.
  • This pipeline template writes output to an Index layer of the catalog. You can use your existing output layer or let the Wizard create a new catalog/layer for you. Please refer to the "Execution" section below for further details.
  • If you are planning to use an existing catalog/layer, please make sure that your output catalog is shared with the same GROUP that will be used to deploy this pipeline template.
  • Confirm that your local credentials (~/.here/credentials.properties) are added to the same group.
  • Confirm that the same credentials have access to the input catalog.

Execution

Running on the HERE platform

In order to deploy and run Archiver, you will need the Wizard Deployer. The Wizard executes interactively, asking questions about the application, and expects the user to provide needed answers.

Follow the instructions to set up the Wizard Deployer and configure needed parameters beforehand. Then follow these steps:

  1. Execute the script as ./wizard.sh.
  2. Follow the prompts and provide needed answers.

During deployment, users will have to provide answers to questions like frequency and average size of the input messages, density of data distribution, and size of the area covered by the data. Correct answers to these questions will help to calculate the optimal tile zoom level and number of machines needed to archive incoming data (see the HERE Tile Partitioning section for more information on zoom level).

You can use your existing output layer or let the Wizard Deployer create a new catalog/layer for you. If using an existing catalog, make sure it is shared with the GROUP_ID used for this deployment and that the Index layer configuration is the same as used in output-catalog.json file.

Verification

In Platform Portal, select the Pipelines tab where you will see your pipeline deployed and running. The Data Archiving Library writes out indexed data in chunks depending on the aggregationWindowSeconds value provided by the user during deployment. For example, if it was set to 30 minutes, then the first batch of the data will be stored 30 minutes after the pipeline starts running.

After your data is archived in the index layer, you can query/retrieve data using one of the following:

The index attributes used by Archiver are timeWindow and tileId. The easiest way to verify your data is being archived as expected is to query it using this CLI command:

olp catalog layer partition get hrn:here:data:::catalog1 layer1 \
   --filter "timeWindow>1577836860000;timeWindow<1577836970000;tileId==23472834"
   --output <path_for_partitions>

Please make sure you use the tile id of the corresponding zoom level. To see which zoom level is being used to archive your data, you can check the configuration of your output Index layer.

Support

If you need support with this pipeline template, please contact us.

results matching ""

    No results matching ""