Session¶
create session¶
from pyspark.sql import SparkSession
spark = (
SparkSession
.builder
.appName('spark-local')
.master('local[*]')
.config('spark.memory.offHeap.enabled', 'true')
.config('spark.memory.offHeap.size', '10g')
.getOrCreate()
)
...
spark.stop()
builder.master¶
Sets the Spark master URL to connect to, such as
local[*]: run locally with all coreslocal[1]: run locally with 1 core- much faster than
local[*]forspark.createDataFramebut for data processing we need multiple threads otherwise can be very slow
spark://<master-ip>:7077: run on a Spark standalone clusteryarn: yse YARN as cluster managercli:
spark-submit --master local[2] my_app.py
spark.default.parallelism¶
https://sparkbyexamples.com/spark/difference-between-spark-sql-shuffle-partitions-and-spark-default-parallelism/
only applicable to RDD
default value is the number of all cores on all nodes in a cluster
on local, it is set to the number of cores on your system
For RDD, transformations like reduceByKey(), groupByKey(), join() triggers the data shuffling. Set the desired partitions for shuffle operations.
spark.sql.shuffle.partitions¶
only works with DataFrame
default value for this configuration is 200
For DataFrame, transformations like groupBy(), join() triggers the data shuffling. Set the value using:
Set the values in cli
default partitions casued slow performance¶
https://stackoverflow.com/questions/34625410/why-does-my-spark-run-slower-than-pure-python-performance-comparison
Spark shuffle is a very expensive operation as it moves the data between executors or even between worker nodes in a cluster. Spark automatically triggers the shuffle when we perform aggregation and join operations on RDD and DataFrame.
By default when run spark in SQL Context or Hive Context it will use 200 partitions by default. This will significantly slow down the performance when running locally.