The small files problem is a well-known problem in the big data domain. When data is broken down into a large number of small or very small files, processing them becomes very inefficient. In the HERE platform indexing, excessive partitioning may cause this problem. In an index layer, the more attributes and attribute values there are, the more likely the small files problem will manifest itself.
The HERE platform provides a simple and scalable solution to compact files with the same index attribute values into one or more files based on the configuration. This solution reduces the index layer storage cost, improves query performance, and also makes subsequent data processing more efficient.
The easiest way to create a compaction application is to start with one of the Reference Examples. These examples show how to use the Index Compaction Library to compact data.
To create your own compaction application, you should do the following:
The Index Compaction Library requires you to implement the
keys- a collection containing same index attribute values for a group of files to be compacted
files- a collection containing indexed files to be merged for query efficiency
If user chooses to handle error by themselves, they should ensure that null value is returned. Otherwise, it won't be considered as an error scenario. Eg:- if user returns empty byte array, it is not an error scenario for the Index Compaction Library. The Index Compaction Library will check for a null value from the merge function and log the failure, then continue compaction for remaining records with different grouping key.
For complete details, see Index Compaction Library API Reference.
You should provide application specific configuration in the
application.conf file. For more details about the process, see Configuration.
You should package an executable
jar that includes the user-defined function implementation, configuration, and all transitive dependencies, except spark and logging dependencies.
mvn build tool, this can be accomplished with the following command:
mvn clean package
The compaction pipeline must have read and write access to input and output catalogs. For instructions on how to grant access, see Share a Catalog.
To run the application, you should create a pipeline in the HERE Workspace. For instructions, see Pipelines Developer's Guide.
Once the compaction pipeline has completed, you can query the compacted data using one of the following methods:
There are reference examples for compacting data in parquet and protobuf format. Examples can be found in HERE Data SDK for Java & Scala. Each example contains a README.md file, which has instructions to run the example.