To setup Spark Performance Advisor, first make sure you have logged it to our console. Next, we need to make your Spark job submit performance data. It requires
- Having the code for Spark Listener
- Activing the listener
- Providing the API token for our service
The simplest way for initial testing is to set 3 spark configuration variables
|spark.joom.cloud.token||<your api token>|
We recommend that you visit the Getting Started section of the service console and copy the names and values of these variables, including your unique API token.
You can pass those parameters to
spark-submit or to Airflow operator you might use, or in other ways suitable for your environment.
Then, run your job. On the second step of the getting started wizard, you will see if we received any data, complete job run data, and when we’ve processed that data. After the last checkbox is checked, the initial setup is complete.
Adding parameters to one job is a quick way to test that Spark Advisor is working. However, to minimize overhead and to make sure that all job use these options, a little bit more setup is required, described in the sections below.
Giving job a name
Tracking job a performance is only good if we know job’s name. We suggest you use the
team_name.job_name format, and set it when building your Spark session, like this:
val spark = SparkSession .builder() .appName("sales.daily_report") .getOrCreate()
Using our package
With the setup above, our code package will be downloaded by Spark at runtime. While the package is fairly small, and Spark often caches it, it is still not perfect for production. It’s better to make this package available locally to all jobs. The best way to do it depends on your environment.
We generally recommend using shadow jars to include our packages. You might be already using it. In that case, just adding our package to your build process is sufficient.
Making the configuration permanent
After you’ve installed the package, please make sure that the
spark.joom.cloud.token are configured for all jobs. There are two general ways to accomplish this - in the job launch and in the job code.
For example, if you use Airflow and the
SparkSubmitOperator, you can pass these two configuration variables in the
conf parameter. To avoid doing this for every task, we recommend that you create a helper function to create Spark tags in Airflow, add those parameters, and use the function everywhere.
If you use Spark on Kubernetes with the Spark Operator, you need to add the configuration to
spark.sparkConf map of the spark applications object that you create. Again, we’d recommend creating an utility function and use it.
Finally, the configuration can be specified in the Spark job when creating the session, with a code like
val spark = SparkSession .builder() ... .conf("spark.extraListeners", "com.joom.spark.monitoring.StatsReportingSparkListener") .conf("spark.joom.cloud.token", "... your token") .getOrCreate()
Here, we’ll also repeat recommendation to create a function that builds the spark session, and sets these configurations.
Databricks is unique in that it allows to add packages and settings to a cluster in the cluster settings UI.
Libraries tab select
Install New, then
Maven and then, in the
Coordinates field, enter
com.joom.spark:spark-platform_2.12:0.3.5. Then, press
Configuration tab of your cluster settings. Towards the end, choose the
Spark sub-tab. Add the following in the
Spark config input field:
spark.extraListeners: com.joom.spark.monitoring.StatsReportingSparkListener spark.joom.cloud.token: ...your token...