In the local Spark approach, all Spark jobs are executed on the user's machine.
The Sparkmagic extension uses Livy to execute all user code. Livy uses the local installation of Spark on the user's machine.
You must install the following software tools in the specified version:
Follow our example installation steps described in the Installation Steps for Spark Tools guide.
It is necessary for Hadoop, Spark and Livy to communicate with each other. Use the following configuration:
This configuration steps assume the following installation folders:
Specify the Hadoop installation folder by adding the following environment variables to the
~/.bashrc file if you are using Linux, or
~/.bash_profile for MacOS:
export HADOOP_HOME=/usr/local/hadoop export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
Load the environment variables, for Linux execute:
For MacOS execute:
Download the following dependencies:
curl -o "/tmp/scala-java8-compat_2.11-0.8.0.jar" \ "https://repo1.maven.org/maven2/org/scala-lang/modules/scala-java8-compat_2.11/0.8.0/scala-java8-compat_2.11-0.8.0.jar" curl -o "/tmp/json4s-native_2.11-3.5.3.jar" \ "https://repo1.maven.org/maven2/org/json4s/json4s-native_2.11/3.5.3/json4s-native_2.11-3.5.3.jar" curl -o "/tmp/protobuf-java-3.10.0.jar" \ "https://repo1.maven.org/maven2/com/google/protobuf/protobuf-java/3.10.0/protobuf-java-3.10.0.jar"
Copy them inside the Spark installation folder:
sudo cp "/tmp/scala-java8-compat_2.11-0.8.0.jar" "/usr/local/spark/jars/" sudo cp "/tmp/json4s-native_2.11-3.5.3.jar" "/usr/local/spark/jars/" sudo cp "/tmp/protobuf-java-3.10.0.jar" "/usr/local/spark/jars/"
Specify the Spark installation folder by adding the following environment variables to the
~/.bashrc file if you are using linux, or
~/.bash_profile for MacOS:
export SPARK_HOME=/usr/local/spark export PATH=$PATH:$SPARK_HOME/bin
Load the environment variables. For Linux, execute:
For MacOS, execute:
Locate the folder where the olp-sdk-for-python-1.11-env conda environment is installed running the command
conda env list:
$ conda env list # conda environments: # base /home/user/miniconda3 olp-sdk-for-python-1.11-env * /home/user/miniconda3/envs/olp-sdk-for-python-1.11-env
In the above example, the folder of the environment is
/home/user/miniconda3/envs/olp-sdk-for-python-1.11-env. It is necessary to configure Spark to point to the python binary located inside this folder location. For this, create the Spark environment file:
cd /usr/local/spark/conf sudo cp spark-env.sh.template spark-env.sh sudo vi spark-env.sh
Add the following environment variables with the actual location of the olp-sdk-for-python-1.11-env conda environment plus the suffix
/bin/python3 to the previously created
export PYSPARK_PYTHON=/home/user/miniconda3/envs/olp-sdk-for-python-1.11-env/bin/python3 export PYSPARK3_PYTHON=/home/user/miniconda3/envs/olp-sdk-for-python-1.11-env/bin/python3
In the above example case the environment path was
/home/user/miniconda3/envs/olp-sdk-for-python-1.11-env, please verify what is your own environment path before adding the corresponding values.
Also note that we add the suffix /bin/python3 to the previous path to point to the right python binary.
Configure Livy connection timeout:
echo "livy.rsc.server.connect.timeout: 1800s" > ~/livy/conf/livy-client.conf cp ~/livy/conf/log4j.properties.template ~/livy/conf/log4j.properties
All you need to execute Spark jobs using Sparkmagic extension is now configured.
You can start the Livy server using this command:
Livy server runs by default on
localhost:8998. You can stop it by running:
The tutorial notebooks for Spark are located in the folder:
You can start with the Getting Started notebook located at
$HOME/olp-sdk-for-python-1.11/tutorial-notebooks/GettingStarted.ipynb to get an overview of all tutorial notebooks.
Following are steps to check that your local Spark environment is properly configured.
Start the Livy server using this command (assuming that your Livy installation is in
Activate the SDK conda environment:
conda activate olp-sdk-for-python-1.11-env
Go to home directory and proceed to start Jupyter:
cd ~/ jupyter notebook --NotebookApp.iopub_data_rate_limit=1000000000 --ip=0.0.0.0
Open the tutorial notebook
$HOME/olp-sdk-for-python-1.11/tutorial-notebooks/spark/spark_ProcessDataLocally_scala.ipynb and execute all its paragraphs.
If all the paragraphs run successfully, your local spark environment is properly configured.
Help us improve our setup experience, please fill out this short 1-minute survey after you are finished setting up the SDK. Complete survey