Hadoop FS Support

Hadoop FS Support is an implementation of the Apache Hadoop File System interface, which opens several opportunities for you to work with the platform and other industry standard, big data tools while mitigating the need for you to write a lot of custom code to do so. A few such examples include:

  • Read and write to Object store layer using Hadoop FS support in Spark (see the tutorial)
  • Bring your data into HERE platform using an object store layer (see the tutorial)
  • Connecting an object store layer to open-source processing and analytics tools outside the platform such as Apache Spark, Apache Drill, Presto, AWS EMR, and others.

Layer support

Hadoop FS Support is only available for the objectstore layer type.

Hadoop FileSystem interface support

The following Hadoop Filesystem methods are supported:

  • String getScheme() returns the scheme, which is blobfs.
  • URI getUri() returns the URI, for example blobfs://hrn:here:data::olp-here:blobfs-test:test-data.
  • void initialize(URI name, Configuration conf) initializes the BlobFs FileSystem.
  • FSDataInputStream open(Path f, int bufferSize) provides an InputStream to read data from an object stored in the Object Store layer.
  • FileStatus getFileStatus(Path f) provides information about a file.
  • FileStatus[] listStatus(Path f) provides a list of files along with their respective information.
  • FSDataOutputStream create(Path f, FsPermission permission, boolean overwrite, int bufferSize, short replication, long blockSize, Progressable progress) provides an OutputStream to write data to. The parameters progress, permission and replication are not implemented.
  • boolean rename(Path src, Path dst). Renames the path.
  • boolean delete(Path f, boolean recursive) deletes a path, if the recursive flag is set to true, all sub-directories and files are also deleted.
  • boolean mkdirs(Path f, FsPermission permission) creates directories for the given path. The parameter permission is not implemented.
  • void close() closes the file system.

Usage

For the catalog HRN hrn:here:data::olp-here:blobfs-test and the layer ID test-data, the URL to be used in Hadoop/Spark/Drill is blobfs://hrn:here:data::olp-here:blobfs-test:test-data.

Spark

Your spark application will need to have a dependency on the hadoop-fs-support package.

val catalogHrn = "hrn:here:data::olp-here:blobfs-test"
val layerId = "test-data"

val sourcePath = s"blobfs://$catalogHrn:$layerId/source"
val destinationPath = s"blobfs://$catalogHrn:$layerId/destination"

val sourceRdd = sparkContext.textFile(sourcePath)
sourceRdd.saveAsTextFile(destinationPath)

Hadoop Fs Shell

You can use the Hadoop File System Shell to explore the contents of the Object Store layer. The operations supported from the Hadoop File System Shell are the following:

  • cat
  • copyFromLocal
  • copyToLocal
  • count
  • cp
  • ls
  • mkdir
  • moveFromLocal
  • moveToLocal
  • mv
  • put
  • rm
  • rmdir
  • rmr
  • test
  • text
  • touchz
  • truncate
  • usage
export HADOOP_CLASSPATH="hadoop-fs-support_2.12-${VERSION}-assembly.jar"
hadoop fs -mkdir blobfs://hrn:here:data::olp-here:blobfs-test:test-data/directory1
hadoop fs -cp file.txt blobfs://hrn:here:data::olp-here:blobfs-test:test-data/directory1
hadoop fs -ls blobfs://hrn:here:data::olp-here:blobfs-test:test-data/directory1

Hadoop configurations

BlobFs supports the following Hadoop configurations:

  • fs.blobfs.multipart.part-upload-parallelism BlobFs uploads an object by splitting the object into various parts and using the multi-part upload functionality of Object Store. This configuration defines how many parts for a single object can be uploaded simultaneously. The minimum allowed parallelism is 1. The default value is 2. The upload speed can increase with an increased parallelism, doing that is more costly as each uploaded part is buffered in the memory.
  • fs.blobfs.multipart.part-size Size of each part of the object that is uploaded in bytes. The minimum part size allowed is 5242880. The maximum part size allowed is 100663296. The default value is 100663296.

Authentication

For instructions on how to set up HERE credentials, see Get Your Credentials.

Additionally, HERE credentials can also be passed as Hadoop configuration,

  • fs.blobfs.accessKeyId HERE access key ID. This configuration is the same as here.access.key.id in the credentials.properties.
  • fs.blobfs.accessClientId HERE client ID. This configuration is the same as here.client.id in the credentials.properties.
  • fs.blobfs.accessKeySecret HERE access key secret. This configuration is the same as here.access.key.secret in the credentials.properties.
  • fs.blobfs.accessEndpointUrl HERE token endpoint URL. This configuration is the same as here.token.endpoint.url in the credentials.properties.

Configuring Hadoop installations

EMR

In order to run on EMR, you will need to create an EMR cluster with an additional parameter configurations, for example

aws emr create-cluster \
  --name "$cluster_name" \
  --release-label "emr-5.17.0" \
  --applications Name="Spark" \
  --region ${region} \
  --log-uri ${log_location} \
  --instance-type "m4.large" \
  --instance-count 4 \
  --service-role "EMR_DefaultRole" \
  --ec2-attributes KeyName="some-key",InstanceProfile="EMR_EC2_DefaultRole",AdditionalMasterSecurityGroups="sg-xxxx",AdditionalSlaveSecurityGroups="sg-xxxxx",SubnetId="subnet-xxxx" \
  --configurations file://emr-config.json

The content of the emr-config.json file is the following:

[
  {
    "Classification": "core-site",
    "Properties": {
      "fs.blobfs.impl": "com.here.platform.data.client.hdfs.DataServiceBlobHadoopFileSystem"
    }
  }
]

Standalone Hadoop installations

  • The BlobFS fat jar needs to be included on the Hadoop classpath.
  • In some cases, the core-site.xml file needs to have the following property to be added:
  <property>
    <name>fs.blobfs.impl</name>
    <value>com.here.platform.data.client.hdfs.DataServiceBlobHadoopFileSystem</value>
  </property>

Drill

The BlobFS fat jar needs to be included on the Hadoop classpath.

Notes

Object Store is not a true file system. Object Store is a distributed key-value store and as such, it does not behave exactly like a file system. A file system expects that operations such as delete or rename are atomic. For Object Store, these operations will finish eventually. A file system expects that during reading from or writing to a file, the content of the file should not be changed or the file should not be deleted. Object Store does not assure this behavior. You will need to take precaution against this behavioral difference on your own.

Hadoop FS Support implements Hadoop FileSystem version 2.7.3. It is highly likely that BlobFS will work with Hadoop versions up to 2.9.x, but it is not guaranteed to work. Hadoop version 3.x.x is not yet supported.

Compatibility issues with HRN. BlobFs requires HRN as the authority of the URI. Some tools create incorrect URIs for HRNs. To work around such cases, you can also pass the hex value of the HRN as the authority of the URI. For the catalog HRN hrn:here:data::olp-here:blobfs-test and the layer ID test-data, this looks as follows:

val hrnHex = Hex.encodeHexString("hrn:here:data::olp-here:blobfs-test:test-data".getBytes)
val blobFsUri = "blobfs://$hrnHex"

Hadoop FS Support does not support Apache Flink.

results matching ""

    No results matching ""