Data API best practices

The Data API is a REST interface that provides access to data and data management functions. To simultaneously provide robustness, scalability, efficient storage capabilities and flexible query possibilities, the Data API design is based on the principle of separating data (blobs) from metadata (partitions).

The following shows the standard flow for how data is typically stored in the Data API:

Data consumer flow:

  • Fetch blobs (data) for each partition
  • Discover metadata (partitions)

Data producer flow:

  • Publish metadata
  • Publish data, collect metadata

Uploading or retrieving data is typically a two-step process. To retrieve data, applications first need to discover blob IDs (dataHandle) by querying the Metadata API, Query API or Index API and then fetch data referenced by these dataHandles via the Blob API or Volatile Blob API.

For uploading data to the Data API, the process is reversed. The application first uploads data to Blob API or Volatile Blob API collecting metadata and then uploads metadata (in batches) via the Publish API.

To achieve better performance in some instances, it is possible to combine metadata and data together and pass it as single message, as in stream processing, or skip the discover metadata step by preloading / caching the required working set.

API lookup

In the HERE platform, the base URLs for a REST API are unique for each catalog and may differ between catalogs. For example, the base URL for the Blob API of a catalog can have a different base URL than the Blob API for another catalog. Use the API Lookup service to get the actual base URLs for the Data API.

Once you have the base URLs for a specific catalog, you can cache them according to Cache-Control HTTP Header, typically for an hour. There is no need to perform an API lookup request each time you make a request to a Data API.

Config API

The Config API provides basic catalog management operations. Similarly to the API Lookup, the Config API responses are cacheable. HERE recommends requesting catalog configuration only once per instance / Spark or Flink worker node and to keep reusing cached catalog configuration objects as needed.

The API Lookup and Config API services do not require any special performance considerations due to the nature of their low request rates and small response sizes.

Note

The /config/v1/status/{status_token} endpoint is used to determine if the request succeeded, failed, or is still being processed (pending). While in the pending state, the request should not be retried.

For consecutive delete and create requests, the recommendation is to wait 60 seconds due to the nature of the system.

Metadata and APIs

The Metadata API, Query API, and Index API provide flexible ways of discovering metadata (partitions) optimized for specific use cases.

Metadata API and Query API

When working with versioned layers, the Metadata API supports consolidated views of partitions representing the latest logically consistent version of the catalog. It also notes the discovery of changes between multiple catalog versions to support incremental compilations of map content.

The Query API provides very similar functionality to the Metadata API, but is optimized for map-panning and interactive UI use cases that require queries on quadtree levels.

Note

For the Metadata and Query APIs, queries should not exceed 500 requests/sec per client appId.

Index API

The Index API supports RSQL queries to retrieve matching partitions from an index layer. You can significantly improve query time by optimizing your data model. HERE recommends that you evaluate your workload, for example, how the data will be queried, read/write ratio, and other similar factors before designing your data model. The Index API supports defining up to four custom indices that allow for very fine-grained data retrieval workflows.

The Index API, in combination with the Data Archiving Library (part of the HERE Data SDK for Java & Scala), provides a way to create an archive of messages ingested via a stream layer.

Note

The Index Layer Service can return HTTP 429/503 when service utilization is high. Throttling should last for only short period of time.

The client should retry the request using an exponential backoff algorithm while the service adapts to the high utilization.

When an extended period of throttling and retries occurs and the service still returns HTTP 429/503, please contact the HERE technical support.

Data and APIs

Blob API, Volatile Blob API, and Interactive API provide APIs to retrieve actual data.

Blob API and Volatile Blob API

Thes Blob API and Volatile Blob API are optimized for high-load and high-throughput workflows. To achieve ultimate performance, applications should use multiple connections to retrieve or upload data in parallel. HERE recommends using a pool of HTTP connections and re-using each connection for multiple requests.

Note

To maintain reasonable quality of service (QoS), the Blob API will reject slow write requests with the HTTP 408 Request timeout status code if the average upload speed is lower than 50 kB/sec. This can be an issue when working with Data API in the China region.

Additionally, when dealing with batch and streaming workloads, it is hard to anticipate a “traffic spike” with a sudden surge in demand, which typically doubles the existing traffic levels in a very short period of time. To continue functioning and meet the existing service level agreement (SLA), the system can return HTTP 429/503status codes for short periods of time, usually up to 10 minutes, and throttle requests from one or more users while it adapts to new request rates.

Interactive API

The Interactive API supports queries by property and spatial queries. If you expect more than 100 MB of data in a response, you can iterate over the result data to retrieve portions in subsequent requests. You can use header ‘accept-encoding: gzip’ to reduce the network traffic.

You can improve the query time by adding relevant properties to the searchableProperties list when configuring your layer. These properties are indexed.

Note

When the layer contains less than 10,000 features, all properties are searchable.

Indexing happens continuously, and no cost for data I/O is incurred. However, indexes are added to the data stored.

The maximum number of searchable properties is eight. This is the maximum number of user-added and automatically indexed properties combined.

Streams and APIs

The Data API provides two options for working with stream layers via REST APIs: Ingest API for publishing abd Stream API for consumption, as well as direct Kafka access using binary Kafka protocol.

Stream API

To have full control and achieve the best possible performance, HERE recommends using direct Kafka access and fallback to REST APIs only when direct Kafka access is not available, such as when using proxy settings or firewalls.

The maximum throughput and parallelization for stream layers is set during stream layer creation. You can specify the maximum throughput for data going into the layer and, separately, the maximum throughput for data going out of the layer.

The service begins throttling inbound messages when the inbound rate exceeds the inbound throughput. The service begins throttling outbound messages when the total outbound rate to all consumers exceeds the outbound throughput. When throttling occurs, the service response is delayed, but no messages are dropped.

The maximum message size for a stream layer is 1 MB. For messages larger than 1 MB, HERE recommends you upload the data to Blob API first and pass a message in stream by reference (data handle). If you are using the HERE Data SDK for Java & Scala, this is done automatically for you.

Design principles

The following guidelines help optimize performance when building applications that upload and retrieve data from the Data API:

Reduce chatty interactions

Avoid designing interactions where an application must make multiple calls to the Data API (each of which returns a small amount of data). Instead, combine several related operations into a single request to reduce the number of round trips and resource locking.

Additionally, select lower zoom levels when producing map tiles, strive for higher data / metadata ratio, and apply the adaptive leveling feature of the Data Processing Library.

Request parallelization for high throughput

The Data API is a large distributed system. To help take advantage of its scale, we encourage you to horizontally scale parallel requests to the Data API service endpoints. For high-throughput transfer applications, you should use multiple connections to retrieve or upload data in parallel. If your application issues request directly to Data API using the REST API, we recommend using a pool of HTTP connections and re-using each connection for multiple requests. Avoiding per-request connection setup removes the need to perform TCP slow-start and Secure Sockets Layer (SSL) handshakes on each request.

Performance profiling and load testing

Do performance profiling and load testing during development, as part of test routines, and before final release to ensure the application performs and scales as required. When optimizing performance, look at network throughput, CPU, and RAM requirements.

Measuring performance is important when you tune the number of requests to issue to the Data API concurrently. Measure the network bandwidth being achieved over single request and the use of other resources that your application uses in processing the data. You can then identify the bottleneck resource (that is, the resource with the highest usage), and hence the number of requests that are likely to be useful. Even a small number of concurrent requests (20 concurrent requests of 50-80 MB/s of desired network throughput) can saturate a 10 Gb/s network interface card (NIC). Going with too low parallelism will result in underutilized resources which are too high in resource congestion.

Timeouts and retries

There are certain situations where an application receives a response from the Data API indicating that a retry is necessary. Responses with HTTP status code 408, 429, 500, 502, 503, and 504 are retriable status codes. If an application generates high request rates, it might receive such responses. If these errors occur, HERE Data SDK for Java & Scala implements the automatic retry logic using exponential back off. If you are not using the HERE Data SDK for Java & Scala, implement a similar retry logic when receiving one of these errors.

The Data API automatically scales in response to sustained new request rates, dynamically optimizing performance. While Data API is internally optimizing for a new request rate, you will temporarily receive HTTP error responses until the optimization completes.

For batch processing it is recommended to use longer retry times or increase maximum number of retries so that intermittent network errors or spikes of HTTP errors will not affect multi-hours batch processing jobs.

For latency-sensitive applications it is advisable to use shorter timeouts and retry slow operations. When you retry a request, HERE recommends using a new connection to Data API and potentially perform a fresh DNS lookup.

Compress data or use an efficient binary format

The largest volume of data in an application is often the HTTP responses to client requests generated by the application and passed over the network. Minimizing the response size reduces the load on the network, optimizes storage size, and transfer I/O. Enabling layer compression can considerably reduce response sizes.

Note

You cannot update the compression attribute once the layer is created.

If you are using the HERE Data SDK for Java & Scala to read or write data from a compressed layer in the Data API, compression and decompression are handled automatically.

Some formats, especially textual formats such as text, XML, JSON, and GeoJSON, have very good compression rates. Other data formats are already compressed, such as JPEG or PNG images, so compressing them again with gzip will not result in reduced sizes. Often, compressing them again will increase the size of the payload. For general-purpose binary formats such as protobuf, compression rates depend on the actual content and message size. Layer compression should not be used for Parquet, as it breaks random access to blob data, which is necessary to efficiently read data in Parquet.

Data compression can reduce the volume of data transmitted and minimize transfer time and costs. However, the compression and decompression processes incur overhead. Compression should only be used when there is a demonstrable gain in performance.

Use caching for frequently accessed content

Many applications that store data in the Data API work with location-centric or geospatial data, usually serving “hot areas” (city centers, industrial areas, and so on). These hot areas are repeatedly requested by users and are the best candidates for caching. Applications that use caching also send fewer direct requests to the Data API, which can help reduce transfer I/O costs.

Applications working with the Data API should also respect the Cache-Control HTTP Header, which contains directives (instructions) for caching in both requests and responses.

Use multipart uploads

You can improve the upload experience for larger data blobs (50MB+) by using the Data API multipart uploads feature. This feature improves the upload experience by uploading separate parts of a large blob independently, in any order and in parallel.

Use byte-range fetches

The Data API supports retrieving data or metadata using the Range HTTP header where appropriate. You can fetch a byte-range from an object, transferring only the specified portion. Using Range HTTP Header allows your application to improve retry times when requests are interrupted.

Use the latest version of the HERE Data SDK for Java & Scala

The HERE Data SDK for Java & Scala provides built-in support for many of the recommended guidelines for optimizing Data API performance.

The HERE Data SDK for Java & Scala provide a simpler API for taking advantage of the Data API from within an application, and is regularly updated to follow the latest best practices. For example, the Data SDK includes logic to automatically retry requests on intermittent networks issues and HTTP 5xx errors also provide functionality which automates horizontal scaling of connections to achieve thousands of requests per second, using byte-range requests where appropriate. It is important to use the latest version of the HERE Data SDK for Java & Scala to obtain the latest performance optimization features.

You can also optimize performance when you are using HTTP REST API requests. When using the REST API, follow the same best practices that are outlined in this section.

Considerations when designing data models for versioned layers

The Data API provides access to partitioned data. However, you must decide on the partitioning scheme that best suits your use-case. Well-partitioned data can reduce costs and improve the performance of applications you build on top of the Data API. When deciding on a partitioning approach, consider the following:

  • End users of your application (end user application, map compilation system or other)
  • Cost considerations for large numbers of partitions
  • Interdependencies between layers

Most applications benefit from partitioning approaches that have homogenous sizes of partitions - this allows your applications to have a more predictable performance when processing the data, which in turn leads to less downtime between your compilation stages. For the HERE Tile partitioning scheme, this may mean using multiple zoom levels for different map regions, with more dense tiling in areas with higher amounts of data. For more information, see Partitions.

Extremely fine-grained partitioning approaches may lead to additional costs for both data access and data storage of your metadata, and may lead to performance degradation. HERE recommends higher zoom levels whenever possible.

HERE recommends your partitions to be below 100MB in size, since the tail latency on your requests may negatively affect user experiences on the portal and devices.

If your application references multiple layers, consider the partitioning approach between layers, as having the same partitioning approach in layers that are frequently consumed together is beneficial.

results matching ""

    No results matching ""