Joom Spark Platform is a ready-to-use Spark on Kubernetes setup. In a few minutes, you can deploy essential components such as Spark Operator and Hive Metastore, and start your first job.

Joom Spark Platform is presently available for AWS EKS as AWS Marketplace product. Before using it, you need to subscribe for free to get access.

Getting Started

Prerequisites

Make sure you have an AWS EKS cluster, and you have the kubectl command configured to access it. You will also need the eksctl and helm tools installed. If you don’t have a cluster yet, it might be easier to use QuickLaunch, as described later.

Subscription

To obtain the Joom Spark Platform, you need to subscribe (for free) on AWS Marketplace.

Installation

We need to create a namespace and a service account. For initial testing, create a service account with read-only access to S3

kubectl create namespace spark
eksctl create iamserviceaccount \
    --name spark \
    --namespace spark \
    --cluster <ENTER_YOUR_CLUSTER_NAME_HERE> \
    --attach-policy-arn arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess \
    --approve \
    --override-existing-serviceaccounts

Use Helm 3.8.0 or later, and login to the Helm Registry

aws ecr get-login-password --region us-east-1 | helm registry login \
    --username AWS --password-stdin 709825985650.dkr.ecr.us-east-1.amazonaws.com

Then, install the Helm chart:

helm install --namespace spark \
    joom-spark-platform \
    oci://709825985650.dkr.ecr.us-east-1.amazonaws.com/joom/joom-spark-platform \
    --version 1.0.0

It might take a couple minutes for all the components to start up. Run

kubectl -n spark get pods

and make sure all pods are ready before proceeding.

The first spark job

Obtain the example Spark application manifest and apply it:

aws s3 cp s3://joom-analytics-cloud-public/examples/minimal/minimal.yaml minimal.yaml
kubectl apply -f minimal.yaml

Finally, watch the output of the Spark job

kubectl -n spark logs demo-minimal-driver -f

You should see a Spark session starting, and a test dataframe printed. If you get an error that the pod does not exist, try again in a few seconds.

Spark jobs that write data

Most likely, you want your Spark jobs to write some data. First, you need to decide what S3 bucket to use. For testing, a new bucket might be the best idea.

We need to give the necessary permissions to the service account. First, delete it

eksctl delete iamserviceaccount --name spark --namespace spark --cluster <ENTER_YOUR_CLUSTER_NAME_HERE>

Wait a couple of minutes and create it again with S3 write access:

eksctl create iamserviceaccount --name spark --namespace spark
    --cluster <ENTER_YOUR_CLUSTER_NAME_HERE> \
    --attach-policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess \
    --approve \
    --override-existing-serviceaccounts

Obtain a Spark application manifest:

aws s3 cp s3://joom-analytics-cloud-public/examples/minimal/minimal-write.yaml minimal-write.yaml

In the file, modify the DATA_BUCKET environment variable to the name of your bucket. Then, apply the manifest and review job logs

kubectl apply -f minimal-write.yaml

and view the logs

kubectl -n spark logs demo-minimal-write-driver -f

In the logs, you will see that test data is written to S3 and registered in the Hive metastore.

If you made this far, congratuatlions! At this point, you can write your own spark jobs, put it in your own S3 buckets, and modify the manifests to use them.

Getting Started with QuickLaunch

If you don’t yet have EKS cluster, or if you want to safely experiment in a separate environment, AWS Marketplace provides the QuickLauch functionality that will create a cluster with Joom Spark Platform already installed.

To use it, after subscribing, select “Helm Chart” fullfillment option, and then, on the “Launch” page, select “Launch on a new EKS cluster with QuickLaunch”. Follow the prompts to name the cluster and provide other information, and then wait until it is created.

Then, make sure you have AWS CLI installed, as well as kubectl and eksctl libraries.

Connect to the cluster by running

aws eks update-kubeconfig --name <cluster-name> --region <region>

Create a service account with S3 read permissions

eksctl create iamserviceaccount \
    --name spark \
    --namespace spark \
    --cluster <ENTER_YOUR_CLUSTER_NAME_HERE> \
    --attach-policy-arn arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess \
    --approve \
    --override-existing-serviceaccounts

After that, you can proceed to running your first Spark job, as documented under “The first Spark job” section above.

Talk to us

The Joom Spark Platform is free and has no formal support, but we’d be happy to discuss your experience and help, and to generally discuss data engineering and Spark topic. Feel free to schedule a meeting