Getting Started

Initial setup

To setup Spark Performance Advisor, first make sure you have logged it to our console. Next, we need to make your Spark job submit performance data. It requires

Having the code for Spark Listener
Activing the listener
Providing the API token for our service

The simplest way for initial testing is to set 3 spark configuration variables

Configuration	Value
spark.jars.packages	com.joom.spark:spark-platform_2.12:0.3.5
spark.extraListeners	com.joom.spark.monitoring.StatsReportingSparkListener
spark.joom.cloud.token	<your api token>

We recommend that you visit the Getting Started section of the service console and copy the names and values of these variables, including your unique API token.

You can pass those parameters to spark-submit or to Airflow operator you might use, or in other ways suitable for your environment.

Then, run your job. On the second step of the getting started wizard, you will see if we received any data, complete job run data, and when we’ve processed that data. After the last checkbox is checked, the initial setup is complete.

Production setup

Adding parameters to one job is a quick way to test that Spark Advisor is working. However, to minimize overhead and to make sure that all job use these options, a little bit more setup is required, described in the sections below.

Giving job a name

Tracking job a performance is only good if we know job’s name. We suggest you use the team_name.job_name format, and set it when building your Spark session, like this:

val spark = SparkSession
    .builder()
    .appName("sales.daily_report")
    .getOrCreate()

Using our package

With the setup above, our code package will be downloaded by Spark at runtime. While the package is fairly small, and Spark often caches it, it is still not perfect for production. It’s better to make this package available locally to all jobs. The best way to do it depends on your environment.

We generally recommend using shadow jars to include our packages. You might be already using it. In that case, just adding our package to your build process is sufficient.

Making the configuration permanent

After you’ve installed the package, please make sure that the spark.extraListeners and spark.joom.cloud.token are configured for all jobs. There are two general ways to accomplish this - in the job launch and in the job code.

For example, if you use Airflow and the SparkSubmitOperator, you can pass these two configuration variables in the conf parameter. To avoid doing this for every task, we recommend that you create a helper function to create Spark tags in Airflow, add those parameters, and use the function everywhere.

If you use Spark on Kubernetes with the Spark Operator, you need to add the configuration to spark.sparkConf map of the spark applications object that you create. Again, we’d recommend creating an utility function and use it.

Finally, the configuration can be specified in the Spark job when creating the session, with a code like

val spark = SparkSession
    .builder()
    ...
    .conf("spark.extraListeners", "com.joom.spark.monitoring.StatsReportingSparkListener")
    .conf("spark.joom.cloud.token", "... your token")
    .getOrCreate()

Here, we’ll also repeat recommendation to create a function that builds the spark session, and sets these configurations.

Environment-specific notes

Databricks

Databricks is unique in that it allows to add packages and settings to a cluster in the cluster settings UI.

In the Libraries tab select Install New, then Maven and then, in the Coordinates field, enter com.joom.spark:spark-platform_2.12:0.3.5. Then, press Install.

In the Configuration tab of your cluster settings. Towards the end, choose the Spark sub-tab. Add the following in the Spark config input field:

spark.extraListeners: com.joom.spark.monitoring.StatsReportingSparkListener
spark.joom.cloud.token: ...your token...