Getting Started
Initial setup
To setup Spark Performance Advisor, first make sure you have logged it to our console. Next, we need to make your Spark job submit performance data. It requires
- Having the code for Spark Listener
- Activing the listener
- Providing the API token for our service
The simplest way for initial testing is to set 3 spark configuration variables
Configuration | Value |
---|---|
spark.jars.packages | com.joom.spark:spark-platform_2.12:0.3.5 |
spark.extraListeners | com.joom.spark.monitoring.StatsReportingSparkListener |
spark.joom.cloud.token | <your api token> |
We recommend that you visit the Getting Started section of the service console and copy the names and values of these variables, including your unique API token.
You can pass those parameters to spark-submit
or to Airflow operator you might use, or in other ways suitable for your environment.
Then, run your job. On the second step of the getting started wizard, you will see if we received any data, complete job run data, and when we’ve processed that data. After the last checkbox is checked, the initial setup is complete.
Production setup
Adding parameters to one job is a quick way to test that Spark Advisor is working. However, to minimize overhead and to make sure that all job use these options, a little bit more setup is required, described in the sections below.
Giving job a name
Tracking job a performance is only good if we know job’s name. We suggest you use the team_name.job_name
format, and set it when building your Spark session, like this:
val spark = SparkSession
.builder()
.appName("sales.daily_report")
.getOrCreate()
Using our package
With the setup above, our code package will be downloaded by Spark at runtime. While the package is fairly small, and Spark often caches it, it is still not perfect for production. It’s better to make this package available locally to all jobs. The best way to do it depends on your environment.
We generally recommend using shadow jars to include our packages. You might be already using it. In that case, just adding our package to your build process is sufficient.
Making the configuration permanent
After you’ve installed the package, please make sure that the spark.extraListeners
and spark.joom.cloud.token
are configured for all jobs. There are two general ways to accomplish this - in the job launch and in the job code.
For example, if you use Airflow and the SparkSubmitOperator
, you can pass these two configuration variables in the conf
parameter. To avoid doing this for every task, we recommend that you create a helper function to create Spark tags in Airflow, add those parameters, and use the function everywhere.
If you use Spark on Kubernetes with the Spark Operator, you need to add the configuration to spark.sparkConf
map of the spark applications object that you create. Again, we’d recommend creating an utility function and use it.
Finally, the configuration can be specified in the Spark job when creating the session, with a code like
val spark = SparkSession
.builder()
...
.conf("spark.extraListeners", "com.joom.spark.monitoring.StatsReportingSparkListener")
.conf("spark.joom.cloud.token", "... your token")
.getOrCreate()
Here, we’ll also repeat recommendation to create a function that builds the spark session, and sets these configurations.
Environment-specific notes
Databricks
Databricks is unique in that it allows to add packages and settings to a cluster in the cluster settings UI.
In the Libraries
tab select Install New
, then Maven
and then, in the Coordinates
field, enter com.joom.spark:spark-platform_2.12:0.3.5
. Then, press Install
.
In the Configuration
tab of your cluster settings. Towards the end, choose the Spark
sub-tab. Add the following in the Spark config
input field:
spark.extraListeners: com.joom.spark.monitoring.StatsReportingSparkListener
spark.joom.cloud.token: ...your token...