Archiver can be used to efficiently archive streamed messages and easily query later for analysis or pipeline development purposes. Archiver processes a stream of sensor data messages by indexing each message based on its timestamp and geographical location, then storing them in an Index layer. Messages can be indexed based on:
Geographical location:
First or last reported location of the recorded path
Location of first or last observed event
Time:
Timestamp of first or last reported location
Timestamp of first or last observed event
During the deployment, the user must choose the archiving strategy. Available options are:
Start of the trip
End of the trip
First event
Last event
The indexing attributes used by Archiver are timeWindow and tileId. For example, if the user chooses "First event" strategy, then a timestamp of the first observed event will be used as a value for timeWindow attribute. An interpolated coordinate of the same event will be used to calculate a HERE Tile id that will become a value for the tileId attribute.
NOTE: Most recorded drives span across multiple tiles, but entire message will be stored using one tile id only (start or end, depends on the selected archiving strategy). Same rule applies to the time attribute: it doesn't matter how long the trip took; it will still fall into one time slice. Users should consider this information when querying for their data.
Supported data types are Sensoris and SDII. Data may be stored in plain Protobuf or Parquet format. Apache Parquet has the benefits of improved query performance and more efficient storage utilization.
Archiver considers several factors for archiving a stream of messages, such as:
Number of messages per hour
Message size
Size of geographic area
Distribution (density) of data
Since the factors above affect archiving attributes and will be used to calculate the number of cores needed to archive incoming data in an optimal way, it is important that the user provides accurate answers to the corresponding questions.
Structure
Figure 1. Application Flow Diagram
Legend:
Archiver operates as a Stream pipeline that reads data from a Stream input layer, indexes messages based on index criteria and time window chosen by the user during deployment and stores them to an Index layer. It uses the Data Archiving Library under the hood.
Prerequisites
The user should provide an existing Stream layer with Sensoris or SDII messages.
This pipeline template writes output to an Index layer of the catalog. You can use your existing output layer or let the Wizard create a new catalog/layer for you. Please refer to the "Execution" section below for further details.
If you are planning to use an existing catalog/layer, please make sure that your output catalog is shared with the same GROUP that will be used to deploy this pipeline template.
Confirm that your local credentials (~/.here/credentials.properties) are added to the same group.
Confirm that the same credentials have access to the input catalog.
Execution
Running on the HERE platform
In order to deploy and run Archiver, you will need the Wizard Deployer. The Wizard executes interactively, asking questions about the application, and expects the user to provide needed answers.
Follow the instructions to set up the Wizard Deployer and configure needed parameters beforehand. Then follow these steps:
Execute the script as ./wizard.sh.
Follow the prompts and provide needed answers.
During deployment, users will have to provide answers to questions like frequency and average size of the input messages, density of data distribution, and size of the area covered by the data. Correct answers to these questions will help to calculate the optimal tile zoom level and number of machines needed to archive incoming data (see the HERE Tile Partitioning section for more information on zoom level).
You can use your existing output layer or let the Wizard Deployer create a new catalog/layer for you. If using an existing catalog, make sure it is shared with the GROUP_ID used for this deployment and that the Index layer configuration is the same as used in output-catalog.json file.
Verification
In Platform Portal, select the Pipelines tab where you will see your pipeline deployed and running. The Data Archiving Library writes out indexed data in chunks depending on the aggregationWindowSeconds value provided by the user during deployment. For example, if it was set to 30 minutes, then the first batch of the data will be stored 30 minutes after the pipeline starts running.
After your data is archived in the index layer, you can query/retrieve data using one of the following:
The index attributes used by Archiver are timeWindow and tileId. The easiest way to verify your data is being archived as expected is to query it using this CLI command:
Please make sure you use the tile id of the corresponding zoom level. To see which zoom level is being used to archive your data, you can check the configuration of your output Index layer.
Support
If you need support with this pipeline template, please contact us.