Apache Spark by default allocates 1 executor per CPU available on the system, so a 16 core VM will have 16 executors available. This is a sane default setting, each Thread will have a full cpu core available. Azure Databricks follows this convention.
However when your workload is not CPU-bound, but IO or memory bound it could make sense to change this.
You can easily override the number of cores by setting
SPARK_WORKER_CORES as env variable. So a
Standard_F16 can run 64 Spark tasks in parallel instead of just 16.
Setting this in Databricks is very simple:
- Open the cluster configuration page
- Select Advanced Options
We are not done yet. Picking the correct overcommiting core ratio requires carefully watching the CPU metrics, especially watch the CPU utilization before and after.