Spark job lineage in Azure Databricks with Spline and Azure Cosmos DB API for MongoDB

Update Nov 2020: Given the changes in Spline, the steps in this blog post will most likely not work for you. Please instead refer to the following (more recent) resources, at your own discretion: Data lineage tracking using Spline on Atlas via Event Hub and Azure-Databricks-With-Spline

Warning: this walkthrough is not to be considered as official guidance or recommendation from Microsoft. It is presented for educational purposes only, and comes “as-is” and confers no rights, no warranties or guarantees. Since this article was originally posted in 2019, the implementation of Spline might undergo changes which may break the steps described herein.

Tracking lineage of data as it is manipulated within Apache Spark is a common ask from customers. As of date, there are two options, the first of which is the Hortonworks Spark Atlas Connector, which persists lineage information to Apache Atlas. However, some customers who use Azure Databricks do not necessarily need or use the “full” functionality of Atlas, and instead want a more purpose-built solution. This is where the second option, Spline, comes in. Spline can persist lineage information to Apache Atlas or to a MongoDB database. Now, given that Azure Cosmos DB exposes a MongoDB API, it presents an attractive PaaS option to serve as the persistence layer for Spline.

This blog post is the result of my attempts to use Spline from within Azure Databricks, persisting the lineage information to Azure Cosmos DB using the MongoDB API. Some open “to-do” items are at the end of this blog post.

Installing Spline within Azure Databricks

First and foremost, you need to install a number of JAR libraries to allow Spark to start talking to Spline and thereon to Azure Cosmos DB MongoDB API. There is an open item wherein the Spline team is actively considering providing an “uber JAR” which will include all these dependencies. Until then, you will need to use the Maven coordinates as shown below and install these into Azure Databricks as Maven libraries. The list (assuming you are using Spark 2.4) is below. If you are using other versions of Spark within Azure Databricks, you will need to change the Maven coordinates for org.apache.spark:spark-sql-kafka and

za.co.absa.spline:spline-core-spark-adapter to match the Spark version.

[Updated on 26 March 2019] The original version of this post had every single dependency (including “child” / transitive dependencies) listed. Based on expert advice from my colleague Alexandre Gattiker, there is a cleaner way of just installing the 3 libraries:

za.co.absa.spline:spline-core:0.3.6
za.co.absa.spline:spline-core-spark-adapter-2.4:0.3.6
za.co.absa.spline:spline-persistence-mongo:0.3.6

To add just these libraries, you need to specify “exclusions” when adding these libraries in the Databricks UI, such as what is shown in the screenshot below:

The exclusions that we have to add are:

org.apache.spark:spark-sql-kafka-0-10_2.11:${spark.version},org.json4s:json4s-native_2.11:${json4s.version}

If you still do need the full list with transitive dependencies included, it is now included as an Appendix at the very end of this post.

Preparing Azure Cosmos DB for Spline

This was pretty easy, all I needed to do was create a new Azure Cosmos DB account with the MongoDB API enabled. I did need to enable the “pipeline aggregation” preview feature without which the Spline UI does not work. For good measure I also enabled the 3.4 wire protocol but in hindsight it is not required, as Spline only uses the legacy MongoDB driver which is a much older version of the wire protocol.

Spark code changes

Setup the Spark session configuration items in order to connect to Azure Cosmos DB’s MongoDB endpoint:

System.setProperty("spline.mode", "REQUIRED")
System.setProperty("spline.persistence.factory", "za.co.absa.spline.persistence.mongo.MongoPersistenceFactory")
System.setProperty("spline.mongodb.url", "<<the primary connection string from the Azure Cosmos DB account>>")
System.setProperty("spline.mongodb.name", "<<Cosmos DB database name>>")

Enable lineage tracking for that Spark session:

import za.co.absa.spline.core.SparkLineageInitializer._
spark.enableLineageTracking()

Then we run a sample aggregation query (we do this as a Python cell just to show that the lineage tracking and persistence works even in PySpark):

%python
rawData = spark.read.option("inferSchema", "true").json("/databricks-datasets/structured-streaming/events/")
rawData.createOrReplaceTempView("rawData")
sql("select r1.action, count(*) as actionCount from rawData as r1 join rawData as r2 on r1.action = r2.action group by r1.action").write.mode('overwrite').csv("/tmp/pyaggaction.csv")

Running the Spline UI

To run the UI, I simply followed the instructions from the Spline documentation page. Since this process just needs Java to run, I can envision a PaaS-only option wherein this UI runs inside a container or an Azure App Service (doing that is left as an exercise to the reader – or maybe for a later blog post)! For now, here’s the syntax I used:

java -D"spline.mongodb.url=<<connection string from Cosmos DB page in Azure Portal>>" -D"spline.mongodb.name=<<Azure Cosmos DB name>>" -jar <<full path to spline-web-0.3.6-exec-war.jar>>

Then by browsing to the port 8080 on the machine running the UI, you can see the Spline UI. Firstly, a lineage view on the above join + aggregate query that we executed:

and then, a more detailed view:

This is of course a simple example; you can try more real-world examples on your own. Spline is under active development and is open-source; the authors are very active and responsive to queries and suggestions. I encourage you to try this out and judge for yourselves.

“To-do” list

As of this moment, I must call out some open items that I am aware of:

[Update 1 Apr 2019] The issue with Search has been understood and there is a reasonable way out. Again, my colleague Alexandre has been instrumental in finding the mitigation.

~~Unfortunately, the Search textbox does not seem to work correctly for me. I have opened an issue with the Spline team and hopefully we can track down why this is breaking.~~

I have not tested this with any major Spark jobs. If you plan to use this for any kind of serious usage, you should thoroughly test it. Please remember that this is an third-party open-source project, provided on an “as-is” basis.
[Update 26 March 2019]: I have verified that the above setup works correctly with VNET Service Endpoints to Azure Cosmos DB, and with corresponding firewall rules set on the Cosmos DB side to only allow traffic from the said VNET where the Databricks workspace is deployed.

My Azure Databricks workspace is deployed into an existing VNET. I still need to test service endpoints and firewall rules on the Cosmos DB side to ensure that traffic to the Azure Cosmos DB is restricted to only that from the VNET.

Last but not the least, as I already mentioned, if the Spline team releases an uber JAR that would reduce the overhead of managing and installing all the dependencies, that would make life a bit easier on the Azure Databricks front.

I hope this was useful; do try it out and leave me any questions / feedback / suggestions you have!

Appendix

Here is the full list of dependencies including all children / transitive dependencies. It may be useful if your Databricks workspace is deployed in a “locked-down” VNET with very restrictive NSG rules in place for outbound traffic:

org.json4s:json4s-native_2.11:3.5.3
org.json4s:json4s-ast_2.11:3.5.3
org.json4s:json4s-scalap_2.11:3.5.3
org.json4s:json4s-core_2.11:3.5.3
org.json4s:json4s-ext_2.11:3.5.3
org.scala-lang:scalap:2.11.12
org.scalaz:scalaz-core_2.11:7.2.27
org.slf4s:slf4s-api_2.11:1.7.25
com.github.salat:salat-core_2.11:1.11.2
com.github.salat:salat-util_2.11:1.11.2
org.mongodb:bson:3.10.1
org.apache.atlas:atlas-common:1.0.0
org.apache.atlas:atlas-intg:1.0.0
org.apache.atlas:atlas-notification:1.0.0
org.mongodb:casbah-core_2.11:3.1.1
org.mongodb:casbah-commons_2.11:3.1.1
org.mongodb:casbah-query_2.11:3.1.1
org.apache.kafka:kafka-clients:2.1.1
org.mongodb:mongo-java-driver:3.10.1
org.mongodb:mongodb-driver-legacy:3.10.1
org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.0
za.co.absa.spline:spline-core:0.3.6
za.co.absa.spline:spline-commons:0.3.6
za.co.absa.spline:spline-core-spark-adapter-2.4:0.3.6
za.co.absa.spline:spline-core-spark-adapter-api:0.3.6
za.co.absa.spline:spline-model:0.3.6
za.co.absa.spline:spline-persistence-api:0.3.6
za.co.absa.spline:spline-persistence-atlas:0.3.6
za.co.absa.spline:spline-persistence-hdfs:0.3.6
za.co.absa.spline:spline-persistence-mongo:0.3.6

Disclaimer

This Sample Code is provided for the purpose of illustration only and is not intended to be used in a production environment. THIS SAMPLE CODE AND ANY RELATED INFORMATION ARE PROVIDED “AS IS” WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE IMPLIED WARRANTIES OF MERCHANTABILITY AND/OR FITNESS FOR A PARTICULAR PURPOSE. We grant You a nonexclusive, royalty-free right to use and modify the Sample Code and to reproduce and distribute the object code form of the Sample Code, provided that You agree: (i) to not use Our name, logo, or trademarks to market Your software product in which the Sample Code is embedded; (ii) to include a valid copyright notice on Your software product in which the Sample Code is embedded; and (iii) to indemnify, hold harmless, and defend Us and Our suppliers from and against any claims or lawsuits, including attorneys’ fees, that arise or result from the use or distribution of the Sample Code. This posting is provided “AS IS” with no warranties, and confers no rights.

9 Comments

Mark June 4, 2019 at 11:16 am

Hi There,

I seem to be having an issue running the UI. Everything in your article works up to the point where you launch the UI.

Where do you execute the jar command? Do you install the JAR Locally along with the pom.xml and script? Then run the jar command on command prompt?

Need more clarity on that.

Thanks,

Mark

LikeLike

1. Arvind Shyamsundar June 8, 2019 at 9:36 pm
  
  Hi, Mark. I ran the web UI on my laptop as it was more of a test setup. Closer to a real-world implementation, you can put that UI inside an app service instance, as my colleague Alexandre demonstrates here: https://cloudarchitected.com/2019/04/data-lineage-in-azure-databricks-with-spline/
  
  LikeLike
  
Jared Zagelbaum July 11, 2019 at 8:42 am

Hi Arvind,

Do you have any guidance around configuring Azure Databricks with Apache ATLAS? Steps for setting up simple POC environment would be helpful.

Thank you,

Jared

LikeLike

1. Arvind Shyamsundar July 14, 2019 at 3:44 pm
  
  Hi Jared – that works as well, I have tested it in the past with HDP Sandbox. Conceptually it is similar to the MongoDB persistence that I have described, with the main difference being in the session variables being used. The Spline team has themselves blogged about how to persist to Atlas here: https://absaoss.github.io/spline/2019/01/24/Atlas-Support-Is-Back.html.
  
  LikeLike
  
Anonymous October 17, 2019 at 2:26 am

Very cool! thanks.

LikeLike

Miara October 29, 2020 at 4:40 am

Hi Team,
I am having an issue running the UI. Everything is working well up to the point where you launch the UI.
But when I run jar command on my local system, I am getting below error:
javax.naming.NoInitialContextException: Need to specify class name in environment or system property, or as an applet parameter, or in an application resource file: java.naming.factory.initial

Beacuse of this error I am not able to see data on UI.

Please help me to solve this error.

Thanks,
Miara

LikeLike

1. Arvind Shyamsundar December 10, 2020 at 10:42 am
  
  Given the changes in Spline, the steps in this blog post will most likely not work for you. Please instead refer to the following (more recent) resources, at your own discretion: Data lineage tracking using Spline on Atlas via Event Hub (https://medium.com/@reenugrewal/data-lineage-tracking-using-spline-on-atlas-via-event-hub-6816be0fd5c7) and Azure-Databricks-With-Spline (https://github.com/AdamPaternostro/Azure-Databricks-With-Spline)
  
  LikeLike
  
Zacay Daushin November 25, 2022 at 12:48 am

Hi Aravind
Can you share an example that uses spline to write databricks lineage to a kafka topic
I tried what spline wrote of how to enable kafka listen to databricks
But it doesn’t seems to work
Thanks in advance
Zacay Daushin

LikeLike

1. Arvind Shyamsundar January 6, 2023 at 6:22 pm
  
  Since I wrote the article, a lot has changed both on the Spline side and on the Databricks side. I am sorry I don’t have any updated instructions to share.
  
  LikeLike

Arvind Shyamsundar's technical blog

Arvind Shyamsundar is a Principal PM @ MSFT Azure Data, working on Azure SQL. Data geek. Apache Accumulo and Fluo PMC. SQL MCM, ex-Principal PFE (MSFT Services). These are my own opinions and not those of Microsoft.

Spark job lineage in Azure Databricks with Spline and Azure Cosmos DB API for MongoDB

Update Nov 2020: Given the changes in Spline, the steps in this blog post will most likely not work for you. Please instead refer to the following (more recent) resources, at your own discretion: Data lineage tracking using Spline on Atlas via Event Hub and Azure-Databricks-With-Spline

Installing Spline within Azure Databricks

Preparing Azure Cosmos DB for Spline

Spark code changes

Running the Spline UI

“To-do” list

Appendix

9 Comments

Leave a comment Cancel reply

Update Nov 2020: Given the changes in Spline, the steps in this blog post will most likely not work for you. Please instead refer to the following (more recent) resources, at your own discretion: Data lineage tracking using Spline on Atlas via Event Hub and Azure-Databricks-With-Spline

Installing Spline within Azure Databricks

Preparing Azure Cosmos DB for Spline

Spark code changes

Running the Spline UI

“To-do” list

Appendix

Share this:

Related

9 Comments

Leave a comment Cancel reply