Compact Index Layer

Why Compact an Index Layer

The small files problem is a well-known problem in the big data domain. When data is broken down into a large number of small or very small files, processing them becomes very inefficient. In the HERE platform indexing, excessive partitioning may cause this problem. In an index layer, the more attributes and attribute values there are, the more likely the small files problem will manifest itself.

The HERE platform provides a simple and scalable solution to compact files with the same index attribute values into one or more files based on the configuration. This solution reduces the index layer storage cost, improves query performance, and also makes subsequent data processing more efficient.

How to Use the Index Compaction Library

Step 1: Create a Compaction Application

The easiest way to create a compaction application is to start with one of the Reference Examples. These examples show how to use the Index Compaction Library to compact data.

To create your own compaction application, you should do the following:

1a: Implement a User-Defined Function

The Index Compaction Library requires you to implement the merge API:

  • merge(keys, files) should merge all files in user-defined format and return an array of bytes
    • keys - a collection containing same index attribute values for a group of files to be compacted
    • files - a collection containing indexed files to be merged for query efficiency

If user chooses to handle error by themselves, they should ensure that null value is returned. Otherwise, it won't be considered as an error scenario. Eg:- if user returns empty byte array, it is not an error scenario for the Index Compaction Library. The Index Compaction Library will check for a null value from the merge function and log the failure, then continue compaction for remaining records with different grouping key.

For complete details, see Index Compaction Library API Reference.

1b: Provide Configuration for the Compaction Application

You should provide application specific configuration in the application.conf file. For more details about the process, see Configuration.

1c: Package the Application into a Fat JAR File

You should package an executable jar that includes the user-defined function implementation, configuration, and all transitive dependencies, except spark and logging dependencies.

For the mvn build tool, this can be accomplished with the following command:

mvn clean package

Step 2: Set Permissions

The compaction pipeline must have read and write access to input and output catalogs. For instructions on how to grant access, see Share a Catalog.

Step 3: Deploy a Pipeline

To run the application, you should create a pipeline in the HERE Workspace. For instructions, see Pipelines Developer's Guide.

Step 4: Monitor a Pipeline

In the Pipelines page, find your pipeline and ensure that it is in the Running state. For additional information on monitoring pipelines, see Pipeline Monitoring.

Step 5: Query Index Layer

Once the compaction pipeline has completed, you can query the compacted data using one of the following methods:

Reference Examples

There are reference examples for compacting data in parquet and protobuf format. Examples can be found in HERE Data SDK for Java & Scala. Each example contains a file, which has instructions to run the example.

results matching ""

    No results matching ""