Diagnosing Executor Failures in Spark

In many cases, Spark jobs fail for a pretty clear reason, like misspelled field name. However, what if, after an hour, all you have is a bunch of error messages saying “exit code 137”? Let’s look how to diagnose such problems.

What is code 137

The Spark executors do not magically start. Instead, some management software, typically Kubernetes or Yarn, runs them and waits for completion. The magic “137” value is the return value from the wait system call that waits for the executor to finish. In turn, 137 is the sum of two values, 128 and 9. The 128 means “error from child process” and 9 is the return status from the child process. Finally, 9 means SIGKILL, the process was killed.

So, 137 actually does not mean anything specifically, it just means that java process running your Spark executor was killed by something else. However, in 99% of cases, that something else is out-of-memory killer.

There are three kinds of out-of-memory killers

  • System wide OOM killer runs when a server, as a whole, has no more memory. It’s pretty rare.
  • Limits OOM killer runs when your applications consumes more total memory than allowed. Both Kubernetes and Yarn monitor memory usage, and will kill the process in that case.
  • Heap OOM killer will kill java process when it’s running out of heap.

Let’s look at the last two cases.

Native memory limits

There are several signs of OOM kill due to native memory limits

  • If you’re using Kuberenetes, and can examine failed pods, they will have “OOM” in the exit reason
  • Log messages like termination reason: OOMKilled or SIGKILL, possible container OOM also suggest the same problem.

Before we proceed, let’s clarify what “memory usage” is.

The native memory usage

In both Kubernetes and Yarn, each application has a memory limit. While you technically can skip it, that’s pretty bad idea and in practice it’s always set. Before we go further, it’s worth understanding what is being limited.

Each program uses memory for many purposes - variables, constants, code, caches, even buffered files. Suppose that Spark executor is a 1GB Docker image, and it sounds like we’ll need 1GB of memory just to start running it. Turns out, modern operating systems are smart, and can loaded only the used pieces to memory. Morever, the memory used for code can be reclaimed if that code is no longer necessary. After all, the code is still on the disk, and can be easily loaded back to the memory.

But what if you have 1GB array in the memory, and busy running lopps across it. That memory must be physycally reserved for your program. It cannot be temporarily discarded. It probably can be saved to swap space, but that will seriously reduce performance. The memory that the program has actually modified is called “working set”, and that’s what’s normally limited.

In Kuberentes, the metric to watch for is container_memory_working_set_bytes. If it exceeds the pre-set limit, the process is killed without any questions.

Memory usage in Spark

When running a JVM process such as Spark, working set memory consists of 3 main pieces

  • Heap usage, which starts small and increases up to heap size limit
  • Direct memory, which is outside the heap, but still under control of JVM.
  • Native memory, which is allowed by JVM itself and native libraries you use.

Heap size is set by the spark.executor.memory option. Then, either spark.executor.memoryOverhead option or spark.executor.memoryOverheadFactor option determine the memory overhead (one is absolute, another is percentage relative to executor memory). Finally, the sum of heap size and overhead becomes the executor memory limit.

Most cases of OOM are causes by direct and native memory to execeed the configured memory overhead.

Ergo, when you run into OOM, the first order of business is to increase memory overhead. You might also need to decrease heap size. Let’s run a realistic example:

  • AWS r7g.2xlarge instance type has 64GB of memory.
  • The operating system and services need some memory. In our experience, spark can safely use 58GB of memory out of 64GB.
  • We can initially set spark.executor.memory to 52GB
  • We can also set spark.executor.memoryOverhead to 6GB.
  • The total limit will then be 58GB

If we run into OOM, we can reduce spark.executor.memory to 40GB, and set spark.executor.memoryOverhead to 18GB.

How high can we go? If you use a particular third-party library, it’s best to consult their documentation. If all you use is standard Spark, up to 50% of memory might go to memory overhead.

JVM heap errors

Compared to native memory, JVM heap errors are easy. Remember that as your program runs, JVM tries to allocate memory blocks from the heap. If there are no free memory blocks, it tries to garbage-collects some memory that is no longer used. This can fail in two ways:

  • After trying hard, no memory could be found at all.
  • The memory can be found, but more time is spent looking for free memory blocks than doing useful things

The first condition results in an error that say java.lang.OutOfMemoryError: Java heap space. The second error java.lang.OutOfMemoryError: GC Overhead Limit Exceeded.

If you see any of the errors above, the solution is usually to give more memory to each task.

  • You can use smaller partitions, by increasing spark.sql.shuffle.partitions
  • You can use bigger machines, and increase spark.executor.memory
  • As the last resort, decrease spark.executor.cores, so that more memory is available to each task.

Conclusion

Hopefully this guide will help you in diagnosing Spark executor failures and the infamous error 137.