Build Your Own Docker
Data SDK for Python can also be installed using Dockerfile, a text document that contains all the commands you could call on the command line to assemble an image. Using docker build, you can create an automated build that executes several command-line instructions in succession.
Prerequisites
Setup Files
To begin, download the docker archive, unzip the downloaded archive, and open a terminal in the unzipped folder:
For Linux/MacOS:
unzip docker-files.zip
cd docker-files/
For Windows:
cd docker-files\
Note
This software has Open Source Software dependencies, which will be downloaded and installed upon execution of the installation commands. For more information, see Dockerfile
which is part of the zip file.
Copy the credential files (credentials.properties, hls_credentials.properties and settings.xml
) into the current directory:
For Linux/MacOS:
cp ~/.here/credentials.properties .
cp ~/.here/hls_credentials.properties .
cp ~/.m2/settings.xml .
For Windows:
copy %USERPROFILE%\.here\credentials.properties .
copy %USERPROFILE%\.here\hls_credentials.properties .
copy %USERPROFILE%\.m2\settings.xml .
Update Configuration File
Sparkmagic configuration file (spark-conf-files.zip) includes Data SDK jars for version 2.11.7. The latest version of Data SDK jars can be identified using this link in the Include BOMs sub-section. To obtain the latest Data SDK jars, execute the script config_file_updater.py
using below commands:
python config_file_updater.py --version <version_to_upgrade_to>
Note
This script requires Python 3.7+ on your local machine.
Build Image
Build the Docker image:
docker build -t olp-sdk-for-python-1.12 --rm .
Note
The default Docker image name considered is olp-sdk-for-python-1.12, if you want to change it and create/update an image, specify the name in the command:
```bash
docker build -t <yourimagename> --rm .
```
Execute Image
To execute the image in a container:
docker run -p 8080:8080 -p 8998:8998 -it olp-sdk-for-python-1.12
Note
-
Once you exit/restart the container, all your changes are lost. To retain the changes, the most common way is to use a Docker volume mount to mount another directory into your container:
docker run -v <host_src>:<container_directory_to_mount> -p 8080:8080 -p 8998:8998 -it olp-sdk-for-python-1.12
-
Sample example to retain ivy cache jars for local spark: For Linux:
docker run -v ~/.ivy2:/home/here/.ivy2 -p 8080:8080 -p 8998:8998 -it olp-sdk-for-python-1.12
For Windows:
docker run -v %USERPROFILE%\.ivy2:/home/here/.ivy2 -p 8080:8080 -p 8998:8998 -it olp-sdk-for-python-1.12
Tutorial Notebooks
Open the output Jupyter url in the browser and execute the sample notebooks.
The tutorial notebooks included with the SDK are located in the folder:
$HOME/olp-sdk-for-python-1.12/tutorial-notebooks/python
.
We recommend reading the Getting Started notebook to get an overview of all of the tutorial notebooks:
$HOME/olp-sdk-for-python-1.12/tutorial-notebooks/GettingStarted.ipynb
API Reference
Explore the Data SDK for Python API reference by opening the html docs located at:
$HOME/olp-sdk-for-python-1.12/documentation/Data SDK for Python API Reference.html
.
Note
We recommend opening this documentation directly in Chrome and Firefox browsers instead of Jupyter or Internet Explorer.
Customized Execution of Image
To execute the image in a container, use this command:
docker run -p 8080:8080 -p 8998:8998 -it olp-sdk-for-python-1.12 /bin/bash
Note
-
Once you exit/restart the container, all your changes are lost. To retain the changes, the most common way is to use a Docker volume mount to mount another directory into your container:
docker run -v <host_src>:<container_directory_to_mount> -p 8080:8080 -p 8998:8998 -it olp-sdk-for-python-1.12 /bin/bash
-
Sample example to retain ivy cache jars for local spark: For Linux:
docker run -v ~/.ivy2:/home/here/.ivy2 -p 8080:8080 -p 8998:8998 -it olp-sdk-for-python-1.12 /bin/bash
For Windows:
docker run -v %USERPROFILE%\.ivy2:/home/here/.ivy2 -p 8080:8080 -p 8998:8998 -it olp-sdk-for-python-1.12 /bin/bash
Activate the conda environment:
source activate olp-sdk-for-python-1.12-env
Navigate to the home directory and proceed to start Jupyter:
cd ~/
jupyter notebook --NotebookApp.iopub_data_rate_limit=1000000000 --ip=0.0.0.0 --port=8080
JupyterLab
If you work with the JupyterLab "desktop" instead of the "classic" Jupyter notebooks, use this command to start Jupyter:
cd ~/
jupyter lab --NotebookApp.iopub_data_rate_limit=1000000000 --ip=0.0.0.0 --port=8080
With JupyterLab you will benefit from installing a few additional JupyterLab extensions. These will either render files in some frequently used formats (e.g. HTML or GeoJSON) or some computed output (like Leaflet map cells) directly inside JupyterLab:
jupyter labextension install @mflevine/jupyterlab_html
jupyter labextension install @jupyterlab/geojson-extension
jupyter labextension install jupyter-leaflet
jupyter labextension install @jupyter-widgets/jupyterlab-manager
You might also be able to install these inside JupyterLab using its interactive Extension Manager.
Docker with Spark (DEPRECATED)
You can start the Livy server using the following command:
~/livy/bin/livy-server start
Livy server runs by default on localhost:8998
. You can stop it by running the following:
~/livy/bin/livy-server stop
Tutorial Notebooks
The tutorial notebooks for Spark are located in the folder: $HOME/olp-sdk-for-python-1.12/tutorial-notebooks/spark
.
EMR Spark Cluster (DEPRECATED)
Edit the emr.env
file providing your AWS and the repository credentials.
vi ~/.here/emr/emr.env
export DEFAULT_AWS_ACCESS_KEY="your AWS access key"
export DEFAULT_AWS_ACCESS_KEY_SECRET="your AWS access key secret"
export DEFAULT_HERE_USER="your HERE maven repository user"
export DEFAULT_HERE_PASSWORD="your HERE maven repository password"
export DEFAULT_EMR_CORES="2"
export DEFAULT_EMR_VERSION="emr-5.24.0"
export DEFAULT_EMR_MASTER_TYPE="m4.large"
export DEFAULT_EMR_WORKER_TYPE="m4.2xlarge"
export DEFAULT_TAG_TEAM="My Team"
export DEFAULT_TAG_PROJECT="My Project"
export DEFAULT_TAG_OWNER="Me"
export DEFAULT_TAG_ENV="PoC"
export DEFAULT_AWS_REGION="us-east-2"
Provision the EMR cluster:
emr-provision -ns <custom-single-word>
Note
<custom-single-word>
is a suffix added to AWS resource names to avoid collisions. It should contain alphanumeric characters and hyphens only.
- Make sure to deprovision the cluster before exiting the docker container to prevent getting charged for unused infrastructure.
- Once you exit the container, the state of the docker container is lost. Hence, it won't be possible to deprovision after you exit and rerun. In case of issues, you can delete the AWS resources using an AWS console.
After successful provisioning, you should see a message similar to:
Apply complete! Resources: 20 added, 0 changed, 0 destroyed.
Outputs:
emr_master_public_dns = ec2-3-16-25-189.us-east-2.compute.amazonaws.com
Environment up and running, fully operational!
Access your Livy session list here:
>> http://ec2-3-16-25-189.us-east-2.compute.amazonaws.com:8998
Access the YARN Resource Manager here:
>> http://ec2-3-16-25-189.us-east-2.compute.amazonaws.com:8088
You can use this bucket to upload and process data
>> s3://spark-emrlab-bucket-lab
Within Jupyter, create a notebook, then select one of Python3 kernels and add the following cells:
Cell 1
%load_ext sparkmagic.magics
Cell 2
%%spark config
{
"driverMemory": "2G",
"executorMemory": "4G",
"executorCores": 2,
"conf": {
"spark.scheduler.mode": "FAIR",
"spark.executor.instances": 2,
"spark.dynamicAllocation.enabled": "true",
"spark.shuffle.service.enabled": "true",
"spark.dynamicAllocation.executorIdleTimeout": "60s",
"spark.dynamicAllocation.cachedExecutorIdleTimeout": "60s",
"spark.dynamicAllocation.minExecutors": 1,
"spark.dynamicAllocation.maxExecutors": 4,
"spark.dynamicAllocation.initialExecutors": 1,
"spark.jars.ivySettings": "/var/lib/spark/.here/ivy.settings.xml",
"spark.driver.userClassPathFirst": "false",
"spark.executor.userClassPathFirst": "false",
"spark.jars.packages": "com.here.olp.util:mapquad:4.0.13,com.here.platform.location:location-compilation-core_2.11:0.11.156,com.here.platform.location:location-core_2.11:0.11.156,com.here.platform.location:location-inmemory_2.11:0.11.156,com.here.platform.location:location-integration-here-commons_2.11:0.11.156,com.here.platform.location:location-integration-optimized-map_2.11:0.11.156,com.here.platform.location:location-data-loader-standalone_2.11:0.11.156,com.here.platform.location:location-spark_2.11:0.11.156,com.here.platform.location:location-compilation-here-map-content_2.11:0.11.156,com.here.platform.location:location-examples-utils_2.11:0.4.115,com.here.schema.sdii:sdii_archive_v1_java:1.0.0-20171005-1,com.here.sdii:sdii_message_v3_java:3.3.2,com.here.schema.rib:lane-attributes_v2_scala:2.8.0,com.here.schema.rib:road-traffic-pattern-attributes_v2_scala:2.8.0,com.here.schema.rib:advanced-navigation-attributes_v2_scala:2.8.0,com.here.schema.rib:cartography_v2_scala:2.8.0,com.here.schema.rib:adas-attributes_v2_scala:2.8.0,com.typesafe.akka:akka-actor_2.11:2.5.11,com.beachape:enumeratum_2.11:1.5.13,com.github.ben-manes.caffeine:caffeine:2.6.2,com.github.cb372:scalacache-caffeine_2.11:0.24.3,com.github.cb372:scalacache-core_2.11:0.24.3,com.github.os72:protoc-jar:3.6.0,com.google.protobuf:protobuf-java:3.6.1,com.here.platform.data.client:blobstore-client_2.11:0.1.833,com.here.platform.data.client:spark-support_2.11:0.1.833,com.iheart:ficus_2.11:1.4.3,com.typesafe:config:1.3.3,org.apache.logging.log4j:log4j-api-scala_2.11:11.0,org.typelevel:cats-core_2.11:1.4.0,org.typelevel:cats-kernel_2.11:1.4.0,org.apache.logging.log4j:log4j-api:2.8.2,com.here.platform.data.client:data-client_2.11:0.1.833,com.here.platform.data.client:client-core_2.11:0.1.833,com.here.platform.data.client:hrn_2.11:0.1.614,com.here.platform.data.client:data-engine_2.11:0.1.833,com.here.platform.data.client:blobstore-client_2.11:0.1.833,com.here.account:here-oauth-client:0.4.14,com.here.platform.analytics:spark-ds-connector-deps_2.11:0.6.15,com.here.platform.analytics:spark-ds-connector_2.11:0.6.15",
"spark.jars.excludes": "com.here.*:*_proto,org.json4s:*,org.apache.spark:spark-core_2.11,org.apache.spark:spark-sql_2.11,org.apache.spark:spark-streaming_2.11,org.apache.spark:spark-launcher_2.11,org.apache.spark:spark-network-shuffle_2.11,org.apache.spark:spark-unsafe_2.11,org.apache.spark:spark-network-common_2.11,org.apache.spark:spark-tags_2.11,org.scala-lang:scala-library,org.scala-lang:scala-compiler,org.scala-lang.modules:scala-parser-combinators_2.11,org.scala-lang.modules:scala-java8-compat_2.11,org.scala-lang:scala-reflect,org.scala-lang:scalap,com.fasterxml.jackson.core:jackson-*"
}
}
Cell 3
%spark add -s scala-spark -l scala -u <PUT YOUR LIVY ENDPOINT HERE> -k
%spark add -s pyspark -l python -u <PUT YOUR LIVY ENDPOINT HERE> -k
Note
On EMR, it is necessary to explicitly provide the credentials to read the platform data in the notebook. You will need your App ID and KeySecret to submit your job.
Use Your Credentials
Scala
%%spark
val accessKeyId = "<Your Access Key ID>"
val accessKeySecret = "<Your Access Key Secret>"
val layerHRN = "<Some Layern HRN>"
val df = spark.read.option( "partitions", 900)
.option("parallelism", 4)
.option("accesskeyid", accessKeyId)
.option("accesskeysecret", accessKeySecret)
.ds(layerHRN)
PySpark
%%spark
accessKeyId = "<Your Access Key ID>"
accessKeySecret = "<Your Access Key Secret>"
layerHRN = "<Some Layern HRN>"
df = spark.read.format("com.here.platform.analytics.ds")
.option("partitions", 900)
.option("parallelism", 4)
.option("accesskeyid", accessKeyId)
.option("accesskeysecret", accessKeySecret)
.option("layerhrn", layerHRN)
.load()
Start coding your job!
After finishing your job, destroy the cluster to prevent getting charged for unused infrastructure.
emr-deprovision
Deep Debugging
By default, internet access is restricted only to Livy and Yarn resource manager endpoints. If you want to explore the cluster logs and access the internal node machines you will need to open an SSH tunnel and connect. When you deploy a new cluster, we create an script command for you to open the SSH tunnel:
$ cd ~/.here/emr
$ ./emr-tunnel.sh
Next, you will need to install foxy proxy in your web browser:
Then, depending on your web browser, load the foxy proxy configuration that we provide at these file paths:
- For Chrome:
~/anaconda3/envs/<your_env>/lib/olp-emr/util/foxy-proxy-chrome.xml
- For Firefox:
~/anaconda3/envs/<your_env>/lib/olp-emr/util/foxy-proxy-firefox.json
Finally, you can activate Foxy proxy for all URLs or based on the patterns (See Foxy proxy for instructions). Now you will be able to access internal machine endpoints via your web browser.
Tutorial Notebooks
The tutorial notebooks for EMR are located in the folder:
$HOME/olp-sdk-for-python-1.12/tutorial-notebooks/emr
.
Thank you for choosing the HERE Data SDK for Python. After the setup, kindly consider filling out this short 1-minute survey to help us improve the setup experience.