Configuring Apache Spark to oversubscribe the # of cpu cores
Apache Spark by default allocates 1 executor per CPU available on the system, so a 16 core VM will have 16 executors available. This is a sane default setting, each Thread will have a full cpu core available. Azure Databricks follows this convention.
However when your workload is not CPU-bound, but IO or memory bound it could make sense to change this.
You can easily override the number of cores by setting SPARK_WORKER_CORES
as env variable. So a Standard_F16
can run 64 Spark tasks in parallel instead of just 16.
Setting this in Databricks is very simple:
- Open the cluster configuration page
- Select Advanced Options
- Under
Environment Variables
addSPARK_WORKER_CORES=64
More information:
We are not done yet. Picking the correct over-committing core ratio requires carefully watching the CPU metrics, especially watch the CPU utilization before and after.