spark-app
Follow instructions for your environment.
We will need SBT to build our Spark application.
To see if SBT is installed, try this
$ sbt -help
If no SBT is installed, no worries. SBT is easy to install following these steps.
$ cd
# if this following URL doesn't work, get the most recent
# download link from : http://www.scala-sbt.org/
$ wget https://dl.bintray.com/sbt/native-packages/sbt/0.13.11/sbt-0.13.11.tgz
$ tar xvf sbt-0.13.11.tgz
$ ~/sbt/bin/sbt -help
First time you run SBT, it will download a bunch of dependencies. This will take a couple of minutes. Go get some coffee ! :-)
The folder hadoop-spark/spark-app
has Spark application setup as SBT project.
Inspect file : build.sbt on the top level project.
Inspect code : src/main/scala/x/Clickstream.scala
From the spark-app
folder
# go to application dir
$ cd ~/hadoop-spark/spark-app
$ ~/sbt/bin/sbt package
# if running for the first time, go get some coffee :-)
Before submitting the application to the cluster, let's test it locally to make sure it works
$ spark-submit --master local[*] \
--driver-memory 512m --executor-memory 512m \
--class 'x.Clickstream' target/scala-2.10/testapp_2.10-1.0.jar \
'file:///root/hadoop-spark/data/clickstream/clickstream.json'
Arguments (must have):
- --master local[* ]: Run in local mode using all CPU cores (*)
- --class 'x.Clickstream' : name of the class to execute
- 'target/scala-2.10/testapp_2.10-1.0.jar' : location of jar file
- input location : point to the data. Here we are using local file in 'data/clickstream/clickstream.json'
Arguments (optional):
- --driver-memory 512m : memory to be used by client application
- --executor-memory 512m : memory used by Spark executors
Are the logs too much and distracting from program output?
To redirect logs, add '> logs' at the end of the command. Now all the logs will be sent to file called 'logs', and we can see our program output clearly**
$ spark-submit --master local[*] \
--driver-memory 512m --executor-memory 512m \
--num-executors 2 --executor-cores 1 \
--class 'x.Clickstream' target/scala-2.10/testapp_2.10-1.0.jar \
'file:///root/hadoop-spark/data/clickstream/clickstream.json' 2> logs
Use log4j directives.
We have a logging/log4j.properties
file. Inspect this file
The file has following contents
# Set everything to be logged to the console
log4j.rootCategory=WARN, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n
# Settings to quiet third party logs that are too verbose
log4j.logger.org.eclipse.jetty=WARN
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO
$ spark-submit --master local[*] \
--driver-class-path logging/ \
--driver-memory 512m --executor-memory 512m \
--num-executors 2 --executor-cores 1 \
--class 'x.Clickstream' target/scala-2.10/testapp_2.10-1.0.jar \
'file:///root/hadoop-spark/data/clickstream/clickstream.json'
Here is the command to submit the Spark application
$ spark-submit --master yarn --deploy-mode client \
--driver-memory 512m --executor-memory 512m \
--num-executors 2 --executor-cores 1 \
--class 'x.Clickstream' target/scala-2.10/testapp_2.10-1.0.jar \
/user/root/clickstream/in-json/clickstream.json
Must have arguments:
- --master yarn : we are submitting to YARN
- --deploy-mode client : in 'client' mode
- --class 'x.Clickstream' : name of the class to execute
- 'target/scala-2.10/testapp_2.10-1.0.jar' : location of jar file
- input location : point to the data. Here we are using data in HDFS '/user/root/clickstream/in-json/clickstream.json'
Optional arguments:
- --driver-memory 512m : memory to be used by client application
- --executor-memory 512m : memory used by Spark executors
- --num-execuctors 2 : how many executors to use
- --executor-cores 1 : use only 1 CPU core
Since we are running on a virtual machine, we are keeping our resource usage low by specifying low memory usage (512M) and only using one CPU core.
- Resource Manager UI : to see how the application is running
- Spark Application UI : while it is running
- Spark History Server UI : After application has finished
Change the input to /user/root/clickstream/in-json
to load all JSON files.
$ time \
spark-submit --master yarn --deploy-mode client \
--driver-memory 512m --executor-memory 512m \
--num-executors 2 --executor-cores 1 \
--class 'x.Clickstream' target/scala-2.10/testapp_2.10-1.0.jar \
/user/root/clickstream/in-json/ 2> logs