Real-Time Anonymizer

Overview

Location data can reveal sensitive information about persons, which leads to a privacy breach. Collected location data needs to be anonymized for sharing, balancing the user anonymity level and data utility of the anonymized data.

The Real-Time Anonymizer provides a set of algorithms for performing use case specific anonymization of real-time location data, including the following features:

  • Configuring the anonymization strategy (method and accompanying parameters)
  • Loading real-time location data
  • Performing anonymization of location data with configured anonymization strategy
  • Outputting anonymized real-time location data

The Real-Time Anonymizer anonymizes real-time SENSORIS or SDII location data (from HERE Platform stream layer) for traffic use case, according to the specific anonymization strategy configured. The output anonymized data is published as SENSORIS data messages in a HERE Platform stream layer.

The Real-Time Anonymizer User Guide provides more information on how to setup and configure the Real-Time Anonymizer pipeline.

Structure

Real-Time Anonymizer operates as a stream pipeline which reads input data from one stream layer, anonymizes the data (according to the configured anonymization strategy) and publishes the anonymized data to another stream layer.

Application Flow Diagram
Figure 1. Application Flow Diagram

Legend: Legend Diagram

Input/Output Data Formats

Supported input/output data formats:

  • SENSORIS
  • SDII

Prerequisites

  • This pipeline template expects the user to provide a catalog with one stream layer as the input to the pipeline. This stream layer should contain SENSORIS or SDII data messages.
  • An output data catalog and stream layer can also be created and provided to this pipeline (otherwise a new catalog and stream layer will be automatically created by the Wizard). If a stream layer is manually created, this should be configured for SENSORIS or SDII data messages. [optional]
  • Anonymization strategy should be defined in config/anonymization-pipeline.config file, with a suitable strategy configured, so that the anonymized data achieves the required balance between user anonymity and data utility.
  • Make sure that your input and output catalogs are shared with the GROUP which you are going to use for deployment of this pipeline template.
  • Confirm that your local credentials (~/.here/credentials.properties) are added to the same GROUP.

Create Input/Output Catalogs and Layers

The Real-Time Anonymizer pipeline works with two separate catalogs and layers:

  1. Input catalog and stream layer
  2. Output catalog and stream layer [optional]

If an existing catalog and stream layer is not provided to the Real-Time Anonymizer pipeline (for anonymized data to be output to), then the Wizard Deployer will create these automatically.

Use Catalog

A newly created or existing catalog can be used as input and output [optional] for the pipeline. For instructions on how to manually create a new catalog, please refer to Create a Catalog.

HERE platform pipelines are managed by groups. To enable your pipeline to read/write from your catalogs, you need to grant access (read access for input catalog and read and write access for output catalog) to your group ID. For instructions on how to manage groups, see Manage Groups in Teams and Permissions User Guide and for details on how to share your catalog, see Sharing a Catalog.

Create Stream Layer

A new or existing stream layer (within the required catalog) is needed as input and output [optional] layers for the pipeline. These stream layers will need to be configured for SENSORIS or SDII data messages. When creating a new stream layer, be mindful of the Stream Layer Settings.

Create new Output Catalog and Layer using Wizard Deployer

If an existing output catalog and stream layer has not been manually created for the Real-Time Anonymizer pipeline and specified for the Wizard Deployer, a new catalog and layer will be created by the Wizard. This new catalog and stream layer will be created according to the specified configuration in config/output-catalog.json. In this file a detailed configuration can be specified including the following parameters:

  • "id": There's one at the catalog level and one at the layer level.
  • "ttl": The stream layer time to live, that is, retention time in minutes. For more information, see Stream Layer Settings.
  • "dataOutThroughputMbps": For more information, see Stream Layer Settings.
  • "dataInThroughputMbps": For more information, see Stream Layer Settings.
  • "tags": There is one tag at the catalog level and one at the layer level.
  • "billingTags": There is one billing tag at the catalog level and one at the layer level.

Real-Time Anonymizer Configuration

The user needs to set anonymization strategy parameters in config/anonymization-pipeline.config file, which the pipeline uses during runtime. For more details on the anonymization strategy configuration options, see the Real-Time Anonymizer User Guide.

Note: Anonymization Strategy Values

Carefully choose the anonymization strategy values and review the output data to ensure that you have achieved an acceptable level of user anonymity.

This config file has two parts:

  • use case parameters
  • anonymization strategy parameters

The full anonymization-pipeline.config file is shown below. All required fields must be set before the configuration will be accepted.

# Use Case Parameters:

# Use case type to be used in anonymization algorithm [Required]
pipeline.config.useCase.type=TrafficInformation
# Data Type of input and output data for anonymization [Required]
pipeline.config.useCase.dataType=NearRealTime
# Data Format of input and output data. Supported data formats are `SENSORIS` and `SDII` [Required]
pipeline.config.useCase.dataFormat=
# Minimum number of points required in input trajectory chunk, for anonymization 
# to be applied. Value must be greater than 2. Default value is "2" [Optional]
pipeline.config.useCase.minInputPointsCount=
# Minimum number of points required in input trajectory chunk, for anonymization 
# to be applied. Value must be greater than 2. Default value is "2" [Optional]
pipeline.config.useCase.minOutputPointsCount=
# Retention time defines how long information about trajectory is preserved after 
# trajectory's chunk is anonymized. Default value is 10 mins. [Optional]
pipeline.config.useCase.retentionTimeMinutes=

# Anonymization Strategy Parameters:

# Type of anonymization algorithm [Required]
pipeline.config.anonymization.type=SplitAndGap
# Unit of measurement for "subTrajectorySize" (only "seconds" supported) [Required]
pipeline.config.anonymization.subTrajectorySize.unit=seconds
# Min size of anonymized trajectories [Required]
pipeline.config.anonymization.subTrajectorySize.min=
# Max size of anonymized trajectories [Required]
pipeline.config.anonymization.subTrajectorySize.max=
# Unit of measurement for values "min" and "max" (only "seconds" supported) [Required]
pipeline.config.anonymization.gapSize.unit=seconds
# Min size of gaps between anonymized trajectories [Required]
pipeline.config.anonymization.gapSize.min=
# Max size of gaps between anonymized trajectories [Required]
pipeline.config.anonymization.gapSize.max=
# Unit of measurement for values "min" and "max" (only "seconds" supported) [Optional]
pipeline.config.anonymization.skipFirst.time.unit=seconds
# Min amount of data to be removed at the start of the raw trajectory [Optional]
pipeline.config.anonymization.skipFirst.time.min=
# Max amount of data to be removed at the start of the raw trajectory [Optional]
pipeline.config.anonymization.skipFirst.time.max=
# Unit of measurement for values "min" and "max" speed (only "km/h" supported) [Optional]
pipeline.config.anonymization.skipFirst.speed.unit=km/h
# All data with speed value missing or less than configured value will be removed [Optional]
pipeline.config.anonymization.skipFirst.speed.min=
# At the start of raw trajectory, all data with speed value missing or less than
# configured value will be removed [Optional]
pipeline.config.anonymization.skipFirst.speed.max=
# Unit of measurement for values "min" and "max" (only "meters" supported) [Optional]
pipeline.config.anonymization.skipFirst.proximity.unit=meters
# At the start of raw trajectory, all data with speed value missing or less than
# configured value will be removed [Optional]
pipeline.config.anonymization.skipFirst.proximity.min=
# At the start of raw trajectory, all data with speed value missing or less than
# configured value will be removed [Optional]
pipeline.config.anonymization.skipFirst.proximity.max=
# 'skipUntil' condition is required when multiple 'skipFirst' conditions are provided (conditions include: proximity, speed or time). 
# Operators supported include 'and' and 'or'. Example 'skipUntil' = '(proximity or speed) and time'. 
# Optional for single conditions. [Optional]
pipeline.config.anonymization.skipFirst.skipUntil=
# Unit of measurement for values sampling rate (only "seconds" supported) [Optional]
pipeline.config.anonymization.samplingRate.unit=seconds
# Min distance between adjacent points in anonymized trajectories [Optional]
pipeline.config.anonymization.samplingRate.min=
# Max distance between adjacent points in anonymized trajectories [Optional]
pipeline.config.anonymization.samplingRate.max=

Execution

Running on the HERE platform

In order to deploy and run Real-Time Anonymizer, you can use the Wizard Deployer. The Wizard executes interactively, asking questions about the application, and expects the user to provide the requested information.

Follow the instructions to set up the Wizard Deployer and configure needed parameters beforehand. Then follow these steps:

  1. Execute the script as ./wizard.sh.
  2. Follow the prompts and provide needed answers.

Note: Using existing Output Catalog / Layer

The Wizard Deployer will ask for an output catalog hrn and layer id. If an existing catalog / layer should be used, the correct hrn and id should be provided. If a new catalog / layer should be created, a new catalog / layer will be created with the input hrn / layer id.

You can use your existing output layer or let the Wizard create a new catalog/layer for you. If using an existing catalog, make sure it is shared with the GROUP_ID that will be used for this deployment.

Note: Pipeline Core Configuration

In order to process your data in a reasonable time frame, you may need to tune the amount of cores needed for processing.

You can start with a default configuration of one core and use Flink Dashboard to monitor memory utilization and data distribution in your running pipeline. You should also monitor Splunk logs for errors and exceptions, for instance, the OutOfMemoryError exception most likely indicates that more processing power is needed for the submitted amount of input data.

Publishing Test Input Data

It is important to test that the Real-Time Anonymizer pipeline has been setup correctly. To test the pipeline, input data (SENSORIS or SDII data messages) should be published into the input stream layer. To publish data messages in a stream layer there are a number of options:

The recommended approach is using OLP CLI.

Verification

In the Platform Portal select the Pipelines tab where you can see your pipeline deployed and running. After your pipeline starts, and your first data is published, you can find your output catalog under the Data tab or query/retrieve your data programmatically using one of the following options:

This can be quickly achieved using OLP CLI and the following command:

    olp catalog layer stream get <output catalog HRN> <output layer ID>

Cost Estimation

Executing this pipeline template will incur the following costs:

Storage-Stream

Cost will depend on the settings of your input stream layer.

Storage-Blob

Cost will depend on the amount of data that will be published to a stream layer as an output from execution.

Data Transfer IO

Cost will depend on the amount of:

  • input data published to a stream layer (published before execution of this pipeline template)
  • same input data retrieved from stream layer (during pipeline template execution)
  • amount of data written out to a stream layer
Metadata

N/A

Compute Core and Compute RAM

Cost will depend on the amount of data that needs to be processed. More data will require more processing power and will take longer to finish.

Log Search IO

Cost will depend on the log level set for the execution of this pipeline template. To minimize this cost, the user can set log level to WARN.

Support

If you need support with this pipeline template, please contact us.

results matching ""

    No results matching ""