Send metrics from apache spark to graphite

Writing down a quick note on how to enable graphite metrics for apache spark. And where it will continue automatically after a restart.

It will not rely in spark-submit, that all examples seems to do.

Configuration file

This is the complete configuration you need in your metrics.properties. This file can be named anything, but to follow the standard javaprograms use it is named properties, for the ini-style configfiles.

*.sink.Graphite.class=org.apache.spark.metrics.sink.GraphiteSink
*.sink.Graphite.host=metrics.internal.network
*.sink.Graphite.port=2003
*.sink.Graphite.prefix=services.spark.hostname
*.sink.Graphite.period=10
*.sink.Graphite.unit=seconds
*.sink.Graphite.protocol=tcp

Note: This configuration file is case sensitive “Graphite is not the same as graphite”

Starting up

There is a file named spark-env.sh that is sourced if available when spark starts. Here you want to add this line.

SPARK_DAEMON_JAVA_OPTS='-Dspark.metrics.conf=/opt/etc/spark/metrics.properties'

My file have smoe more properties and then it looks like this:

SPARK_DAEMON_JAVA_OPTS='-Dspark.metrics.conf=/opt/etc/spark/metrics.properties -Dspark.deploy.zookeeper.url=zookeeper01:2181,zookeeper02:2181,zookeeper03:2181 -Dspark.deploy.recoveryMode=ZOOKEEPER'

Since it talks to a zookeeper cluster for leader election.

Graphite prefix

I use a prefix standard in my graphite that is:

hostname.metric.submetric

And for services I use

services.service.metric

This separates hosts and services nicely.

Under the *.sink.Graphite.prefix you supply there will be a subfolder named “master” or “worker” depending on the spark role you are starting.