Skip to content

Session

create session

from pyspark.sql import SparkSession
spark = (
    SparkSession
    .builder
    .appName('spark-local')
    .master('local[*]')
    .config('spark.memory.offHeap.enabled', 'true')
    .config('spark.memory.offHeap.size', '10g')
    .getOrCreate()
)
...
spark.stop()

builder.master

Sets the Spark master URL to connect to, such as

  • local[*]: run locally with all cores

  • local[1]: run locally with 1 core

  • much faster than local[*] for spark.createDataFrame but
  • for data processing we need multiple threads otherwise can be very slow

  • spark://<master-ip>:7077: run on a Spark standalone cluster

  • yarn: yse YARN as cluster manager

  • cli: spark-submit --master local[2] my_app.py

spark.default.parallelism

https://sparkbyexamples.com/spark/difference-between-spark-sql-shuffle-partitions-and-spark-default-parallelism/

  • only applicable to RDD

  • default value is the number of all cores on all nodes in a cluster

  • on local, it is set to the number of cores on your system

For RDD, transformations like reduceByKey(), groupByKey(), join() triggers the data shuffling. Set the desired partitions for shuffle operations.

spark.conf.set('spark.default.parallelism', 100)

spark.sql.shuffle.partitions

  • only works with DataFrame

  • default value for this configuration is 200

For DataFrame, transformations like groupBy(), join() triggers the data shuffling. Set the value using:

spark.conf.set('spark.sql.shuffle.partitions', 100)

Set the values in cli

spark-submit --conf spark.sql.shuffle.partitions=100 --conf spark.default.parallelism=100

default partitions casued slow performance

https://stackoverflow.com/questions/34625410/why-does-my-spark-run-slower-than-pure-python-performance-comparison

Spark shuffle is a very expensive operation as it moves the data between executors or even between worker nodes in a cluster. Spark automatically triggers the shuffle when we perform aggregation and join operations on RDD and DataFrame.

By default when run spark in SQL Context or Hive Context it will use 200 partitions by default. This will significantly slow down the performance when running locally.

spark.conf.set('spark.sql.shuffle.partitions', 1)