Read Index Layer Data
The Data Client Library provides the class LayerDataFrameReader
, a custom Spark DataFrameReader for creating DataFrames that contain the data for all supported layer type including index layer.
All the formats supported by DataFrameReader are also supported by the LayerDataFrameReader
. Additionally, formats such as Apache Avro, Apache Parquet, Protobuf and raw byte arrays (octet-stream).
When you are reading from an index layer certain restrictions apply. See Get Data from an Index Layer which contains information about restrictions and known limitations.
Read process
Read operation works according to the following steps:
- Spark connector starts with a first communication with the server to get some useful information. For example layer type, layer schema, layer encoding format, etc.
- Partitions within the layer get filtered using the provided filter query. If the query is not provided, the value "timestamp=ge=0" will be used by default, and it would mean that all the partitions will be matched.
- At this stage, we know the layer format. We can now create its Spark corresponding file format and with partition data, we have an iterator of rows (records).
- Some implicit columns will be added to each row depending on the layer type and partition metadata.
- The resulting rows will be handed over to the Spark framework to return the finalized
DataFrame
.
Spark supports query index data in parallel, just define quantity of desired parts and add option olp.connector.query-parallelism to reader.
val reader = sparkSession
.readLayer(catalogHrn, layerId)
.format("raw")
.query(
"tileId=INBOUNDINGBOX=(23.648524, 22.689013, 62.284241, 60.218811) and eventType==SignRecognition")
.option("olp.connector.metadata-columns", true)
.option("olp.connector.query-parallelism", 100)
if (compressed)
reader.option("olp.connector.data-decompression-timeout", 1200000)
val df: DataFrame = reader.load()
Dataset<Row> df =
JavaLayerDataFrameReader.create(sparkSession)
.readLayer(catalogHrn, layerId)
.format("raw")
.query(
"tileId=INBOUNDINGBOX=(23.648524, 22.689013, 62.284241, 60.218811) and eventType==SignRecognition")
.option("olp.connector.metadata-columns", true)
.option("olp.connector.query-parallelism", 100)
.load();
Dataframe columns
Besides the user-defined columns which derive from the partition data, Spark connector provides additional columns used to represent the data partitioning information and partition payload attributes.
Data columns
Corresponds to user defined columns and derives from the partition data.
Layer partitioning columns
Corresponds to user defined index layer partitioning columns. They have the same names as the layer definition but with the idx_
prefix following the type conversions as defined below:
Index type | Data Type |
bool | Boolean |
int | Long |
string | String |
HERETile | Long |
HERETime | Long |
Partition payload attribute columns
Column name | Data Type | Meaning |
mt_metadata | Map[String, String] | Metadata of partition |
mt_timestamp | Long | Timestamp of creation (UTC) |
mt_checksum | String | Checksum of payload |
mt_crc | String | CRC of payload |
mt_dataSize | Long | Size of payload |
mt_compressedDataSize | Long | Compressed size of payload |
Project Dependencies
If you want to create an application that uses the HERE platform Spark Connector to read data from index layer, add the required dependencies to your project as described in chapter Dependencies for Spark Connector.
Read Parquet-Encoded Data
The following snippet demonstrates how to access a Parquet-encoded DataFrame
from an index layer of a catalog. Note that the parquet schema is expected to be bundled with the data. Therefore, you don't need to specify the format explicitly.
import com.here.platform.data.client.spark.LayerDataFrameReader.SparkSessionExt
import com.here.platform.pipeline.PipelineContext
import org.apache.spark.sql.SparkSession
val reader = sparkSession
.readLayer(catalogHrn, layerId)
.query(
"tileId=INBOUNDINGBOX=(23.648524, 22.689013, 62.284241, 60.218811) and eventType==SignRecognition")
.option("olp.connector.metadata-columns", true)
.option("olp.connector.query-parallelism", 100)
if (compressed)
reader.option("olp.connector.data-decompression-timeout", 1200000)
val df = reader.load()
df.printSchema()
df.show()
val messagesWithAtLeastOneSignRecognition = df
.select("pathEvents.signRecognition")
.where("size(pathEvents.signRecognition) > 0")
val count = messagesWithAtLeastOneSignRecognition.count()
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
Dataset<Row> df =
JavaLayerDataFrameReader.create(sparkSession)
.readLayer(catalogHrn, layerId)
.query(
"tileId=INBOUNDINGBOX=(23.648524, 22.689013, 62.284241, 60.218811) and eventType==SignRecognition")
.option("olp.connector.query-parallelism", 100)
.load();
long messagesWithAtLeastOneSignRecognitionCount =
df.select("pathEvents.signRecognition")
.where("size(pathEvents.signRecognition) > 0")
.count();
Read Avro-Encoded Data
The following snippet demonstrates how to access an Avro-encoded DataFrame
from an index layer of a catalog. Note that the avro schema is expected to be bundled with the data. Therefore, you don't need to specify the format explicitly.
import com.here.platform.data.client.spark.LayerDataFrameReader.SparkSessionExt
import com.here.platform.pipeline.PipelineContext
import org.apache.spark.sql.{DataFrame, SparkSession}
val reader = sparkSession
.readLayer(catalogHrn, layerId)
.query(
"tileId=INBOUNDINGBOX=(23.648524, 22.689013, 62.284241, 60.218811) and eventType==SignRecognition")
.option("olp.connector.query-parallelism", 100)
if (compressed)
reader.option("olp.connector.data-decompression-timeout", 1200000)
val df: DataFrame = reader.load()
val messagesWithAtLeastOneSignRecognition = df
.select("pathEvents.signRecognition")
.where("size(pathEvents.signRecognition) > 0")
val count = messagesWithAtLeastOneSignRecognition.count()
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
Dataset<Row> df =
JavaLayerDataFrameReader.create(sparkSession)
.readLayer(catalogHrn, layerId)
.query(
"tileId=INBOUNDINGBOX=(23.648524, 22.689013, 62.284241, 60.218811) and eventType==SignRecognition")
.option("olp.connector.query-parallelism", 100)
.load();
Dataset<Row> messagesWithAtLeastOneSignRecognition =
df.select("pathEvents.signRecognition").where("size(pathEvents.signRecognition) > 0");
long count = messagesWithAtLeastOneSignRecognition.count();
Read Protobuf-Encoded Data
The following snippet demonstrates how to access a Protobuf-encoded DataFrame
from an index layer of a catalog. Note that the protobuf schema is expected to be referenced from the layer configuration. Therefore, you don't need to specify the format explicitly.
import com.here.platform.data.client.spark.LayerDataFrameReader.SparkSessionExt
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.size
val reader = sparkSession
.readLayer(catalogHrn, layerId)
.query(
"tileId=INBOUNDINGBOX=(23.648524, 22.689013, 62.284241, 60.218811) and eventType==SignRecognition")
.option("olp.connector.metadata-columns", true)
.option("olp.connector.query-parallelism", 100)
if (compressed)
reader.option("olp.connector.data-decompression-timeout", 1200000)
val df = reader.load()
val sqlContext = sparkSession.sqlContext
import sqlContext.implicits._
val messagesWithAtLeastOneSignRecognition = df
.select("mt_dataHandle", "message.pathEvents.signRecognition")
.where("size(message.pathEvents.signRecognition) > 0")
val dataHandle = messagesWithAtLeastOneSignRecognition
.select("mt_dataHandle")
.head()
.getString(0)
val count: Int = messagesWithAtLeastOneSignRecognition
.select(size($"signRecognition"))
.head()
.getInt(0)
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
Dataset<Row> df =
JavaLayerDataFrameReader.create(sparkSession)
.readLayer(catalogHrn, layerId)
.query(
"tileId=INBOUNDINGBOX=(23.648524, 22.689013, 62.284241, 60.218811) and eventType==SignRecognition")
.option("olp.connector.metadata-columns", true)
.option("olp.connector.query-parallelism", 100)
.load();
Dataset<Row> messagesWithAtLeastOneSignRecognition =
df.select("mt_dataHandle", "message.pathEvents.signRecognition")
.where("size(message.pathEvents.signRecognition) > 0");
String dataHandle =
messagesWithAtLeastOneSignRecognition.select("mt_dataHandle").head().getString(0);
int count =
messagesWithAtLeastOneSignRecognition
.select(size(new Column("signRecognition")))
.head()
.getInt(0);
Note that to read protobuf data from a layer, the schema must be specified in the layer configuration and needs to be available on Artifact Service. Furthermore the schema must have a ds
variant. For more information on how to maintain schemas, see the Archetypes Developer's Guide.
Read Csv-Encoded Data
The following snippet demonstrates how to access a Csv-encoded DataFrame
from an index layer of a catalog. In this example, the csv row contains columns field1 as integer and field2 as string.
import com.here.platform.data.client.spark.LayerDataFrameReader.SparkSessionExt
import org.apache.spark.sql.SparkSession
val df = sparkSession
.readLayer(catalogHrn, layerId)
.query("eventType==SignRecognition")
.load()
df.select("idx_eventType", "field1").where("field1 > 0").show()
df.printSchema()
import static org.apache.spark.sql.functions.*;
import com.here.hrn.HRN;
import com.here.platform.data.client.spark.javadsl.JavaLayerDataFrameReader;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.types.IntegerType;
Dataset<Row> dataFrame =
JavaLayerDataFrameReader.create(sparkSession)
.readLayer(catalogHrn, layerId)
.query("eventId=in=(1,2,3)")
.load();
dataFrame.select("idx_eventId", "field1").where("field1 > 0").show();
dataFrame.printSchema();
Read Text-Encoded Data
The following snippet demonstrates how to access a Text-encoded DataFrame
from an index layer of a catalog. In this example, the row object contains field data as string.
Note
Restrictions
While reading Text data, each line becomes each row that has string value
column by default. Therefore, Text data source has only a single column value
per row.
import com.here.platform.data.client.spark.LayerDataFrameReader.SparkSessionExt
import org.apache.spark.sql.SparkSession
val df = sparkSession
.readLayer(catalogHrn, layerId)
.query("eventType==SignRecognition")
.load()
df.select("idx_eventId", "value").show()
df.printSchema()
import static org.apache.spark.sql.functions.*;
import com.here.hrn.HRN;
import com.here.platform.data.client.spark.javadsl.JavaLayerDataFrameReader;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.types.IntegerType;
Dataset<Row> dataFrame =
JavaLayerDataFrameReader.create(sparkSession)
.readLayer(catalogHrn, layerId)
.query("eventId=in=(1,2,3)")
.load();
dataFrame.select("idx_eventId", "value").show();
dataFrame.printSchema();
Read JSON-Encoded Data
The following snippet demonstrates how to access a JSON-encoded DataFrame
from an index layer of a catalog. In this example, the JSON object contains property intVal as integer and strVal as string.
import com.here.platform.data.client.spark.LayerDataFrameReader.SparkSessionExt
import org.apache.spark.sql.SparkSession
val df = sparkSession
.readLayer(catalogHrn, layerId)
.query("eventType==SignRecognition")
.load()
df.select("idx_eventType", "intVal").where("intVal > 0").show()
df.printSchema()
import static org.apache.spark.sql.functions.*;
import com.here.hrn.HRN;
import com.here.platform.data.client.spark.javadsl.JavaLayerDataFrameReader;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.types.IntegerType;
Dataset<Row> dataFrame =
JavaLayerDataFrameReader.create(sparkSession)
.readLayer(catalogHrn, layerId)
.query("eventId=in=(1,2,3)")
.load();
dataFrame.select("idx_eventId", "intVal").where("intVal > 0").show();
dataFrame.printSchema();
The following snippet demonstrates how to access data in any arbitrary format from an index layer of a catalog:
import org.apache.spark.sql._
import org.apache.spark.sql.catalyst.encoders.RowEncoder
import org.apache.spark.sql.types.{DataTypes, Metadata, StructField, StructType}
val schema: StructType = new StructType(
Array[StructField](
StructField("mt_dataHandle", DataTypes.StringType, nullable = false, Metadata.empty),
StructField("signRecognitionCount", DataTypes.IntegerType, nullable = false, Metadata.empty)
))
val reader = sparkSession
.readLayer(catalogHrn, layerId)
.format("raw")
.query(
"tileId=INBOUNDINGBOX=(23.648524, 22.689013, 62.284241, 60.218811) and eventType==SignRecognition")
.option("olp.connector.metadata-columns", true)
.option("olp.connector.query-parallelism", 100)
if (compressed)
reader.option("olp.connector.data-decompression-timeout", 1200000)
val df: DataFrame = reader.load()
val dfSignRecognitionCount: DataFrame = df.flatMap { row: Row =>
val messageList: mutable.Buffer[SdiiMessage.Message] =
SdiiMessageList.MessageList.parseFrom(row.getAs[Array[Byte]]("data")).getMessageList.asScala
messageList.map { message =>
RowFactory.create(row.getAs[Object]("mt_dataHandle"),
message.getPathEvents.getSignRecognitionCount.asInstanceOf[Object])
}
}(RowEncoder(schema))
val messagesWithAtLeastOneSignRecognition = dfSignRecognitionCount
.select("mt_dataHandle", "signRecognitionCount")
.where("signRecognitionCount > 0")
val dataHandles = messagesWithAtLeastOneSignRecognition
.map[String]((r: Row) => r.getAs[String]("mt_dataHandle"))(Encoders.STRING)
.dropDuplicates()
.collectAsList()
val count = messagesWithAtLeastOneSignRecognition.count()
import com.here.platform.data.client.spark.javadsl.JavaLayerDataFrameReader;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.RowFactory;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.catalyst.encoders.RowEncoder;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.Metadata;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;
StructType schema =
new StructType(
new StructField[] {
new StructField("mt_dataHandle", DataTypes.StringType, false, Metadata.empty()),
new StructField(
"signRecognitionCount", DataTypes.IntegerType, false, Metadata.empty())
});
Dataset<Row> df =
JavaLayerDataFrameReader.create(sparkSession)
.readLayer(catalogHrn, layerId)
.format("raw")
.query(
"tileId=INBOUNDINGBOX=(23.648524, 22.689013, 62.284241, 60.218811) and eventType==SignRecognition")
.option("olp.connector.metadata-columns", true)
.option("olp.connector.query-parallelism", 100)
.load();
Dataset<Row> dfSignRecognitionCount =
df.flatMap(
(FlatMapFunction<Row, Row>)
row ->
SdiiMessageList.MessageList.parseFrom(row.<byte[]>getAs("data"))
.getMessageList().stream()
.map(
m ->
RowFactory.create(
row.getAs("mt_dataHandle"),
m.getPathEvents().getSignRecognitionCount()))
.iterator(),
RowEncoder.apply(schema));
Dataset<Row> messagesWithAtLeastOneSignRecognition =
dfSignRecognitionCount
.select("mt_dataHandle", "signRecognitionCount")
.where("signRecognitionCount > 0");
List<String> dataHandles =
messagesWithAtLeastOneSignRecognition
.map((MapFunction<Row, String>) row -> row.getAs("mt_dataHandle"), Encoders.STRING())
.dropDuplicates()
.collectAsList();
long count = messagesWithAtLeastOneSignRecognition.count();
Known issues:
-
DataFrame
contains the columns representing the index layer structure definition but the relative location of these columns is at the very end, where they should be located before the metadata columns. - If these column values are not present we use some defaults instead of
NULL
values. - Location of these columns should be at the very end of the row but they are located right after the payload columns.
Note
-
raw
format refers to application/octet-stream
in layer config and not to be confused with raw layer config. - For information on RSQL, see RSQL.