Local Spark (Linux/MacOS)

In the local Spark approach, all Spark jobs are executed on the user's machine.

The Sparkmagic extension uses Livy to execute all user code. Livy uses the local installation of Spark on the user's machine.

Software Requirements

You must install the following software tools in the specified version:

  • Java 8+
  • Hadoop 2.7.3
  • Spark 2.4.0
  • Livy 0.5.0-incubating

Follow our example installation steps described in the Installation Steps for Spark Tools guide.

Configuration

It is necessary for Hadoop, Spark and Livy to communicate with each other. Use the following configuration:

Caution

  • This configuration steps assume the following installation folders:

    • Hadoop: /usr/local/hadoop
    • Spark: /usr/local/spark
    • Livy: $HOME/livy

Hadoop

Specify the Hadoop installation folder by adding the following environment variables to the ~/.bashrc file if you are using Linux, or ~/.bash_profile for MacOS:

export HADOOP_HOME=/usr/local/hadoop
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop

Load the environment variables, for Linux execute:

source ~/.bashrc

For MacOS execute:

source ~/.bash_profile

Spark

Download the following dependencies:

curl -o "/tmp/scala-java8-compat_2.11-0.8.0.jar" \
    "https://repo1.maven.org/maven2/org/scala-lang/modules/scala-java8-compat_2.11/0.8.0/scala-java8-compat_2.11-0.8.0.jar"

curl -o "/tmp/json4s-native_2.11-3.5.3.jar" \
    "https://repo1.maven.org/maven2/org/json4s/json4s-native_2.11/3.5.3/json4s-native_2.11-3.5.3.jar"

curl -o "/tmp/protobuf-java-3.10.0.jar" \
    "https://repo1.maven.org/maven2/com/google/protobuf/protobuf-java/3.10.0/protobuf-java-3.10.0.jar"

Copy them inside the Spark installation folder:

sudo cp "/tmp/scala-java8-compat_2.11-0.8.0.jar" "/usr/local/spark/jars/"
sudo cp "/tmp/json4s-native_2.11-3.5.3.jar" "/usr/local/spark/jars/"
sudo cp "/tmp/protobuf-java-3.10.0.jar" "/usr/local/spark/jars/"

Specify the Spark installation folder by adding the following environment variables to the ~/.bashrc file if you are using linux, or ~/.bash_profile for MacOS:

export SPARK_HOME=/usr/local/spark
export PATH=$PATH:$SPARK_HOME/bin

Load the environment variables. For Linux, execute:

source ~/.bashrc

For MacOS, execute:

source ~/.bash_profile

Locate the folder where the olp-sdk-for-python-1.12-env conda environment is installed running the command conda env list:

$ conda env list
# conda environments:
#
base                               /home/user/miniconda3
olp-sdk-for-python-1.12-env    *  /home/user/miniconda3/envs/olp-sdk-for-python-1.12-env

In the above example, the folder of the environment is /home/user/miniconda3/envs/olp-sdk-for-python-1.12-env. It is necessary to configure Spark to point to the python binary located inside this folder location. For this, create the Spark environment file:

cd /usr/local/spark/conf
sudo cp spark-env.sh.template spark-env.sh
sudo vi spark-env.sh

Add the following environment variables with the actual location of the olp-sdk-for-python-1.12-env conda environment plus the suffix /bin/python3 to the previously created spark-env.sh file:

export PYSPARK_PYTHON=/home/user/miniconda3/envs/olp-sdk-for-python-1.12-env/bin/python3
export PYSPARK3_PYTHON=/home/user/miniconda3/envs/olp-sdk-for-python-1.12-env/bin/python3

Note

  • In the above example case the environment path was /home/user/miniconda3/envs/olp-sdk-for-python-1.12-env, verify what is your own environment path before adding the corresponding values.

  • Also note that we add the suffix /bin/python3 to the previous path to point to the right python binary.

Livy

Configure Livy connection timeout:

echo "livy.rsc.server.connect.timeout: 1800s" > ~/livy/conf/livy-client.conf
cp ~/livy/conf/log4j.properties.template ~/livy/conf/log4j.properties

Caution

  • For these steps to work properly, it is necessary to not be connected into the HERE VPN, due to a known issue that misconfigures the Livy server with a wrong URL. Validate before starting the Livy server that you are not connected to HERE VPN.

All you need to execute Spark jobs using Sparkmagic extension is now configured.

You can start the Livy server using this command:

~/livy/bin/livy-server start

Livy server runs by default on localhost:8998. You can stop it by running:

~/livy/bin/livy-server stop

Tutorial Notebooks

The tutorial notebooks for Spark are located in the folder: $HOME/olp-sdk-for-python-1.12/tutorial-notebooks/spark.

You can start with the Getting Started notebook located at $HOME/olp-sdk-for-python-1.12/tutorial-notebooks/GettingStarted.ipynb to get an overview of all tutorial notebooks.

Setup Validation

Following are steps to check that your local Spark environment is properly configured.

Start Jupyter and Livy services

Start the Livy server using this command (assuming that your Livy installation is in ~/livy/):

~/livy/bin/livy-server start

Activate the SDK conda environment:

conda activate olp-sdk-for-python-1.12-env

Go to home directory and proceed to start Jupyter:

cd ~/
jupyter notebook --NotebookApp.iopub_data_rate_limit=1000000000 --ip=0.0.0.0

Execute the Health Check notebook

Open the tutorial notebook $HOME/olp-sdk-for-python-1.12/tutorial-notebooks/spark/spark_ProcessDataLocally_scala.ipynb and execute all its paragraphs.

If all the paragraphs run successfully, your local spark environment is properly configured.


Thank you for choosing the HERE Data SDK for Python. After the setup, kindly consider filling out this short 1-minute survey to help us improve the setup experience.


results matching ""

    No results matching ""