Jul 2, 2015

Running SparkSQL in standalone mode for development experiments

Spark has very decent documentation in general. Cloudera and Databricks blogs cover the rest of the finer points concerning configuration optimizations. But it took me a few passes through the documentation to learn how to configure my Spark development cluster. It'll probably take you longer to copy and unpack Spark archive to your boxes than to configure all you need for experimenting in standalone mode. But to save you a few minutes, I created this gist which summarizes my experience. 

To install Spark, on each server:
  • unpack Spark archive
  • "cp conf/spark-env.sh.template conf/spark-env.sh"
  • "vi conf/spark-env.sh"

To run Spark:
  • on master, "./sbin/start-master.sh"
  • on worker(s), "./bin/spark-class org.apache.spark.deploy.worker.Worker spark://127.0.0.1:8091"

The gist shows how to configure a few most useful things:
  • Spark Executor memory, in case you want to quickly experiment with it without changing spark-env.sh on all the workers
  • recommended non-default serializer for more realistic performance
  • remote JMX access to Spark Executor (NOT secure)
  • data frame cache batch size