Location data can reveal sensitive information about persons, which leads to a privacy breach. Collected location data needs to be anonymized for sharing, balancing the user anonymity level and data utility of the anonymized data.
The Real-Time Anonymizer provides a set of algorithms for performing use case specific anonymization of real-time location data, including the following features:
The Real-Time Anonymizer anonymizes real-time SENSORIS or SDII location data (from HERE Platform stream layer) for traffic use case, according to the specific anonymization strategy configured. The output anonymized data is published as SENSORIS data messages in a HERE Platform stream layer.
The Real-Time Anonymizer User Guide provides more information on how to setup and configure the Real-Time Anonymizer pipeline.
Real-Time Anonymizer operates as a stream pipeline which reads input data from one stream layer, anonymizes the data (according to the configured anonymization strategy) and publishes the anonymized data to another stream layer.
Supported input/output data formats:
config/anonymization-pipeline.configfile, with a suitable strategy configured, so that the anonymized data achieves the required balance between user anonymity and data utility.
The Real-Time Anonymizer pipeline works with two separate catalogs and layers:
If an existing catalog and stream layer is not provided to the Real-Time Anonymizer pipeline (for anonymized data to be output to), then the Wizard Deployer will create these automatically.
A newly created or existing catalog can be used as input and output [optional] for the pipeline. For instructions on how to manually create a new catalog, please refer to Create a Catalog.
HERE platform pipelines are managed by groups. To enable your pipeline to read/write from your catalogs, you need to grant access (
read access for input catalog and
write access for output catalog) to your group ID. For instructions on how to manage groups, see Manage Groups in Teams and Permissions User Guide and for details on how to share your catalog, see Sharing a Catalog.
A new or existing stream layer (within the required catalog) is needed as input and output [optional] layers for the pipeline. These stream layers will need to be configured for SENSORIS or SDII data messages. When creating a new stream layer, be mindful of the Stream Layer Settings.
If an existing output catalog and stream layer has not been manually created for the Real-Time Anonymizer pipeline and specified for the Wizard Deployer, a new catalog and layer will be created by the Wizard. This new catalog and stream layer will be created according to the specified configuration in
config/output-catalog.json. In this file a detailed configuration can be specified including the following parameters:
The user needs to set anonymization strategy parameters in
config/anonymization-pipeline.config file, which the pipeline uses during runtime. For more details on the anonymization strategy configuration options, see the Real-Time Anonymizer User Guide.
Carefully choose the anonymization strategy values and review the output data to ensure that you have achieved an acceptable level of user anonymity.
This config file has two parts:
anonymization-pipeline.config file is shown below. All required fields must be set before the configuration will be accepted.
# Use Case Parameters: # Use case type to be used in anonymization algorithm [Required] pipeline.config.useCase.type=TrafficInformation # Data Type of input and output data for anonymization [Required] pipeline.config.useCase.dataType=NearRealTime # Data Format of input and output data. Supported data formats are `SENSORIS` and `SDII` [Required] pipeline.config.useCase.dataFormat= # Minimum number of points required in input trajectory chunk, for anonymization # to be applied. Value must be greater than 2. Default value is "2" [Optional] pipeline.config.useCase.minInputPointsCount= # Minimum number of points required in input trajectory chunk, for anonymization # to be applied. Value must be greater than 2. Default value is "2" [Optional] pipeline.config.useCase.minOutputPointsCount= # Retention time defines how long information about trajectory is preserved after # trajectory's chunk is anonymized. Default value is 10 mins. [Optional] pipeline.config.useCase.retentionTimeMinutes= # Anonymization Strategy Parameters: # Type of anonymization algorithm [Required] pipeline.config.anonymization.type=SplitAndGap # Unit of measurement for "subTrajectorySize" (only "seconds" supported) [Required] pipeline.config.anonymization.subTrajectorySize.unit=seconds # Min size of anonymized trajectories [Required] pipeline.config.anonymization.subTrajectorySize.min= # Max size of anonymized trajectories [Required] pipeline.config.anonymization.subTrajectorySize.max= # Unit of measurement for values "min" and "max" (only "seconds" supported) [Required] pipeline.config.anonymization.gapSize.unit=seconds # Min size of gaps between anonymized trajectories [Required] pipeline.config.anonymization.gapSize.min= # Max size of gaps between anonymized trajectories [Required] pipeline.config.anonymization.gapSize.max= # Unit of measurement for values "min" and "max" (only "seconds" supported) [Optional] pipeline.config.anonymization.skipFirst.time.unit=seconds # Min amount of data to be removed at the start of the raw trajectory [Optional] pipeline.config.anonymization.skipFirst.time.min= # Max amount of data to be removed at the start of the raw trajectory [Optional] pipeline.config.anonymization.skipFirst.time.max= # Unit of measurement for values "min" and "max" speed (only "km/h" supported) [Optional] pipeline.config.anonymization.skipFirst.speed.unit=km/h # All data with speed value missing or less than configured value will be removed [Optional] pipeline.config.anonymization.skipFirst.speed.min= # At the start of raw trajectory, all data with speed value missing or less than # configured value will be removed [Optional] pipeline.config.anonymization.skipFirst.speed.max= # Unit of measurement for values "min" and "max" (only "meters" supported) [Optional] pipeline.config.anonymization.skipFirst.proximity.unit=meters # At the start of raw trajectory, all data with speed value missing or less than # configured value will be removed [Optional] pipeline.config.anonymization.skipFirst.proximity.min= # At the start of raw trajectory, all data with speed value missing or less than # configured value will be removed [Optional] pipeline.config.anonymization.skipFirst.proximity.max= # 'skipUntil' condition is required when multiple 'skipFirst' conditions are provided (conditions include: proximity, speed or time). # Operators supported include 'and' and 'or'. Example 'skipUntil' = '(proximity or speed) and time'. # Optional for single conditions. [Optional] pipeline.config.anonymization.skipFirst.skipUntil= # Unit of measurement for values sampling rate (only "seconds" supported) [Optional] pipeline.config.anonymization.samplingRate.unit=seconds # Min distance between adjacent points in anonymized trajectories [Optional] pipeline.config.anonymization.samplingRate.min= # Max distance between adjacent points in anonymized trajectories [Optional] pipeline.config.anonymization.samplingRate.max=
In order to deploy and run Real-Time Anonymizer, you can use the Wizard Deployer. The Wizard executes interactively, asking questions about the application, and expects the user to provide the requested information.
Follow the instructions to set up the Wizard Deployer and configure needed parameters beforehand. Then follow these steps:
The Wizard Deployer will ask for an output catalog hrn and layer id. If an existing catalog / layer should be used, the correct hrn and id should be provided. If a new catalog / layer should be created, a new catalog / layer will be created with the input hrn / layer id.
You can use your existing output layer or let the Wizard create a new catalog/layer for you. If using an existing catalog, make sure it is shared with the
GROUP_ID that will be used for this deployment.
In order to process your data in a reasonable time frame, you may need to tune the amount of cores needed for processing.
You can start with a default configuration of one core and use Flink Dashboard to monitor memory utilization and data distribution in your running pipeline. You should also monitor Splunk logs for errors and exceptions, for instance, the OutOfMemoryError exception most likely indicates that more processing power is needed for the submitted amount of input data.
It is important to test that the Real-Time Anonymizer pipeline has been setup correctly. To test the pipeline, input data (SENSORIS or SDII data messages) should be published into the input stream layer. To publish data messages in a stream layer there are a number of options:
The recommended approach is using OLP CLI.
In the Platform Portal select the Pipelines tab where you can see your pipeline deployed and running. After your pipeline starts, and your first data is published, you can find your output catalog under the Data tab or query/retrieve your data programmatically using one of the following options:
This can be quickly achieved using OLP CLI and the following command:
olp catalog layer stream get <output catalog HRN> <output layer ID>
Executing this pipeline template will incur the following costs:
Cost will depend on the settings of your input stream layer.
Cost will depend on the amount of data that will be published to a stream layer as an output from execution.
Cost will depend on the amount of:
Cost will depend on the amount of data that needs to be processed. More data will require more processing power and will take longer to finish.
Cost will depend on the log level set for the execution of this pipeline template. To minimize this cost, the user can set log level to WARN.
If you need support with this pipeline template, please contact us.