March 28, 2019June 27, 2019 by Arvind Shyamsundar

Avoiding error 403 ("request not authorized") when accessing ADLS Gen 2 from Azure Databricks while using a Service Principal

Azure Data Lake Storage Generation 2 (ADLS Gen 2) has been generally available since 7 Feb 2019. Azure Databricks is a first-party offering for Apache Spark. Many customers want to set ACLs on ADLS Gen 2 and then access those files from Azure Databricks, while ensuring that the precise / minimal permissions granted. In the process, we have seen some interesting patterns and errors (such as the infamous 403 / “request not authorized” error). While there is documentation on the topic, I wanted to take an end-t0-end walkthrough the steps, and hopefully with this you can get everything working just perfectly! Let’s get started…

Setting up a Service Principal in Azure AD

To do this, we first need to create an application registration in Azure Active Directory (AAD). This is well documented in many places, but for reference, here is what I did.

I created a new app registration from the Portal and called it ”adlsgen2app”:

From the next screenshot below, note the Application’s ID: 0eb2e28a-0e97-41cc-b765-4c1ec255a0bf. This is also sometimes referred to as the “Client ID”. We are going to ignore the Object ID of the application, because as you will see we will later need the Object ID of the Service Principal for that application within our AAD tenant. More on that soon.

We then proceed to create a secret key (“keyfordatabricks”) for this application (redacted in the screenshot below for privacy reasons):

Note down that key in a safe place so that we can later store in in AKV and then eventually reference that AKV-backed secret from Databricks.

Setting up a AKV-backed secret store for Azure Databricks

In order to reference the above secret stored in Azure Key Vault (AKV), from within Azure Databricks, we must first add the secret manually to AKV and then associate the AKV itself with the Databricks workspace. The instructions to do this are well documented at the official page. For reference, here is what I did.

As mentioned, I first copied the Service Principal secret into AKV (I had created a AKV instance called “mydbxakv”):

Then, I followed the steps in the Azure Databricks documentation to create an AKV-backed secret scope within Databricks, and reference the AKV from there:

Granting the Service Principal permissions in ADLS Gen 2

This is probably the most important piece, and we have had some confusion here on how exactly to set the permissions / Access Control Lists in ADLS Gen 2. To start with, we use Azure Storage Explorer to set / view these ACLs. Here’s a screenshot of the ADLS Gen 2 account that I am using. Under that account, there is a “container” (technically a “file system”) called “acltest”:

Before we can grant permissions at ADLS Gen 2 to the Service Principal, we need to identify its Object ID (OID). To do this, I used Azure CLI to run the sample command below. The GUID passed to –id is the Application ID which we noted a few steps ago.

az ad sp show --id 0eb2e28a-0e97-41cc-b765-4c1ec255a0bf --query objectId

The value that is returned by the above command is the Object ID (OID) for the Service Principal:

79a448a0-11f6-415d-a451-c89f15f438f2

The OID for the Service Principal has to be used to define permissions in the ADLS Gen 2 ACLs. I repeat: do not use the Object ID from the Application, you must use the Object ID from the Service Principal in order to set ACLs / permissions at ADLS Gen 2 level.

Set / Manage the ACLs at ADLS Gen 2 level

Let’s see this in action; in the above “acltest” container, I am going to add permission for this service principal on a specific file that I have to later access from Azure Databricks:

Sidebar: if we view the default permissions on this file you will see that only $superuser has access. $superuser represents the access to the ADLS Gen 2 file system via. storage key, and is only seen when these containers / file systems were created using Storage Key authentication.

To view / manage ACLs, you right click on the container / folder / file in Azure Storage explorer, and then use the “Manage Access” menu. Once you are in Manage Access dialog, as shown in the screenshot below, I have copied (but not yet added) the OID for the service principal that we obtained previously. Again – I cannot emphasize this enough – please make sure you use the Object ID for the Service Principal and not the Object ID for the app registration.

Next I clicked the Add button and also set the level of access to web_site_1.dat (Read and Execute in this case, as I intend to only read this data into Databricks):

Then I clicked Save. You can also use “Default permissions” if you set ACLs on top-level folders, but do remember that those default permissions only apply to newly created children.

Permissions on existing files

Currently for existing files / folders, you have to grant desired permissions explicitly. Another very important point is that the Service Principal OID must also have been granted Read and Execute at the root (the container level), as well as any intermediate folder(s). In my case, the file web_site_1.dat is located under /mydata. So note that I have also to add the permissions at root level:

Then at /mydata level:

In other words, the whole chain: all the folders in the path leading up to and including the (existing) file being accessed, must have permissions granted for the Service Principal.

Using the Service Principal from Azure Databricks

Firstly, review the requirements from the official docs. We do recommend using Databricks runtime 5.2 or above.

Providing the ADLS Gen 2 credentials to Spark

In the below walkthrough, we choose to do this at session level. In the below sample code, note the usage of the Databricks dbutils.secrets calls to obtain the secret for the app, from AKV, and also note the usage of the Application ID itself as the “client ID”):

spark.conf.set("fs.azure.account.auth.type", "OAuth") 
spark.conf.set("fs.azure.account.oauth.provider.type", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set("fs.azure.account.oauth2.client.id", "0eb2e28a-0e97-41cc-b765-4c1ec255a0bf")  # This GUID is just a sample for this walkthrough; it needs to be replaced with the actual Application ID in your case
spark.conf.set("fs.azure.account.oauth2.client.secret", dbutils.secrets.get(scope = "mysecretscope", key = "adlsgen2secret"))
spark.conf.set("fs.azure.account.oauth2.client.endpoint", "https://login.microsoftonline.com/<<AAD tenant id>>/oauth2/token")

Note that the <<AAD tenant id>> placeholder above has also got to be substituted with the actual GUID for the AAD tenant. You can get that from the Azure Portal blade for Azure Active Directory.

Alternate way of configuring the ADLS Gen 2 credentials

In the previous code snippet, the service principal credentials are setup in such a way that they become the default for any ADLS Gen 2 account being accessed from that Spark session. However, there is another (potentially more precise) way of specifying these credentials, and that is to suffix the Spark configuration item keys with the ADLS Gen 2 account name. For example, imagine that “myadlsgen2” is the name of the ADLS Gen 2 account that we are using. The suffix to be applied in this case would be myadlsgen2.dfs.core.windows.net. Then the Spark conf setting would look like the below:

spark.conf.set("fs.azure.account.auth.type.myadlsgen2.dfs.core.windows.net", "OAuth") 
spark.conf.set("fs.azure.account.oauth.provider.type.myadlsgen2.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set("fs.azure.account.oauth2.client.id.myadlsgen2.dfs.core.windows.net", "<<application ID GUID>>") 
spark.conf.set("fs.azure.account.oauth2.client.secret.myadlsgen2.dfs.core.windows.net", dbutils.secrets.get(scope = "mysecretscope", key = "adlsgen2secret"))
spark.conf.set("fs.azure.account.oauth2.client.endpoint.myadlsgen2.dfs.core.windows.net", "https://login.microsoftonline.com/<<AAD tenant id>>/oauth2/token")

This method of suffixing the account name was enabled by this Hadoop fix and also referenced here.

“Happy Path” testing

If all the steps were done correctly, then when you run code to read the file:

df = spark.read.csv("abfss://acltest@<<storage account name>>.dfs.core.windows.net/mydata/web_site_1.dat").show()

…it works correctly!

But, what if you missed a step?

If you have missed any step in granting permissions at the various level in the folder hierarchy, it will fail with an 403 error like the below:

StatusCode=403
StatusDescription=This request is not authorized to perform this operation using this permission.

The same error, from a Scala cell:

If you happen to run into the above errors, double-check all your steps. Most likely you missed a folder or root-level permission (assuming you gave the permission to the file correctly). The other reason that I have seen is because the permissions using Azure Storage Explorer were set using the Object ID of the application, and not the Object ID of the service principal. This is clearly documented here.

Another reason for the 403 “not authorized” error

Until recently, this error would also occur even if the ACLs were granted perfectly, due to an issue with the ABFS driver. Due to that issue, customers had to add the Service Principal to the storage account contributor IAM permission on the ADLS Gen 2 account. Thankfully, this issue was fixed in HADOOP-15969 and the fix is now included in the Databricks runtime 5.x. You no longer need to grant the Service Principal any IAM permissions on the ADLS Gen 2 account – if you get the ACLs right!

Disclaimer

This Sample Code is provided for the purpose of illustration only and is not intended to be used in a production environment. THIS SAMPLE CODE AND ANY RELATED INFORMATION ARE PROVIDED “AS IS” WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE IMPLIED WARRANTIES OF MERCHANTABILITY AND/OR FITNESS FOR A PARTICULAR PURPOSE. We grant You a nonexclusive, royalty-free right to use and modify the Sample Code and to reproduce and distribute the object code form of the Sample Code, provided that You agree: (i) to not use Our name, logo, or trademarks to market Your software product in which the Sample Code is embedded; (ii) to include a valid copyright notice on Your software product in which the Sample Code is embedded; and (iii) to indemnify, hold harmless, and defend Us and Our suppliers from and against any claims or lawsuits, including attorneys’ fees, that arise or result from the use or distribution of the Sample Code. This posting is provided “AS IS” with no warranties, and confers no rights.

March 25, 2019November 5, 2020 by Arvind Shyamsundar

Spark job lineage in Azure Databricks with Spline and Azure Cosmos DB API for MongoDB

Update Nov 2020: Given the changes in Spline, the steps in this blog post will most likely not work for you. Please instead refer to the following (more recent) resources, at your own discretion: Data lineage tracking using Spline on Atlas via Event Hub and Azure-Databricks-With-Spline

Warning: this walkthrough is not to be considered as official guidance or recommendation from Microsoft. It is presented for educational purposes only, and comes “as-is” and confers no rights, no warranties or guarantees. Since this article was originally posted in 2019, the implementation of Spline might undergo changes which may break the steps described herein.

Tracking lineage of data as it is manipulated within Apache Spark is a common ask from customers. As of date, there are two options, the first of which is the Hortonworks Spark Atlas Connector, which persists lineage information to Apache Atlas. However, some customers who use Azure Databricks do not necessarily need or use the “full” functionality of Atlas, and instead want a more purpose-built solution. This is where the second option, Spline, comes in. Spline can persist lineage information to Apache Atlas or to a MongoDB database. Now, given that Azure Cosmos DB exposes a MongoDB API, it presents an attractive PaaS option to serve as the persistence layer for Spline.

This blog post is the result of my attempts to use Spline from within Azure Databricks, persisting the lineage information to Azure Cosmos DB using the MongoDB API. Some open “to-do” items are at the end of this blog post.

Installing Spline within Azure Databricks

First and foremost, you need to install a number of JAR libraries to allow Spark to start talking to Spline and thereon to Azure Cosmos DB MongoDB API. There is an open item wherein the Spline team is actively considering providing an “uber JAR” which will include all these dependencies. Until then, you will need to use the Maven coordinates as shown below and install these into Azure Databricks as Maven libraries. The list (assuming you are using Spark 2.4) is below. If you are using other versions of Spark within Azure Databricks, you will need to change the Maven coordinates for org.apache.spark:spark-sql-kafka and

za.co.absa.spline:spline-core-spark-adapter to match the Spark version.

[Updated on 26 March 2019] The original version of this post had every single dependency (including “child” / transitive dependencies) listed. Based on expert advice from my colleague Alexandre Gattiker, there is a cleaner way of just installing the 3 libraries:

za.co.absa.spline:spline-core:0.3.6
za.co.absa.spline:spline-core-spark-adapter-2.4:0.3.6
za.co.absa.spline:spline-persistence-mongo:0.3.6

To add just these libraries, you need to specify “exclusions” when adding these libraries in the Databricks UI, such as what is shown in the screenshot below:

The exclusions that we have to add are:

org.apache.spark:spark-sql-kafka-0-10_2.11:${spark.version},org.json4s:json4s-native_2.11:${json4s.version}

If you still do need the full list with transitive dependencies included, it is now included as an Appendix at the very end of this post.

Preparing Azure Cosmos DB for Spline

This was pretty easy, all I needed to do was create a new Azure Cosmos DB account with the MongoDB API enabled. I did need to enable the “pipeline aggregation” preview feature without which the Spline UI does not work. For good measure I also enabled the 3.4 wire protocol but in hindsight it is not required, as Spline only uses the legacy MongoDB driver which is a much older version of the wire protocol.

Spark code changes

Setup the Spark session configuration items in order to connect to Azure Cosmos DB’s MongoDB endpoint:

System.setProperty("spline.mode", "REQUIRED")
System.setProperty("spline.persistence.factory", "za.co.absa.spline.persistence.mongo.MongoPersistenceFactory")
System.setProperty("spline.mongodb.url", "<<the primary connection string from the Azure Cosmos DB account>>")
System.setProperty("spline.mongodb.name", "<<Cosmos DB database name>>")

Enable lineage tracking for that Spark session:

import za.co.absa.spline.core.SparkLineageInitializer._
spark.enableLineageTracking()

Then we run a sample aggregation query (we do this as a Python cell just to show that the lineage tracking and persistence works even in PySpark):

%python
rawData = spark.read.option("inferSchema", "true").json("/databricks-datasets/structured-streaming/events/")
rawData.createOrReplaceTempView("rawData")
sql("select r1.action, count(*) as actionCount from rawData as r1 join rawData as r2 on r1.action = r2.action group by r1.action").write.mode('overwrite').csv("/tmp/pyaggaction.csv")

Running the Spline UI

To run the UI, I simply followed the instructions from the Spline documentation page. Since this process just needs Java to run, I can envision a PaaS-only option wherein this UI runs inside a container or an Azure App Service (doing that is left as an exercise to the reader – or maybe for a later blog post)! For now, here’s the syntax I used:

java -D"spline.mongodb.url=<<connection string from Cosmos DB page in Azure Portal>>" -D"spline.mongodb.name=<<Azure Cosmos DB name>>" -jar <<full path to spline-web-0.3.6-exec-war.jar>>

Then by browsing to the port 8080 on the machine running the UI, you can see the Spline UI. Firstly, a lineage view on the above join + aggregate query that we executed:

and then, a more detailed view:

This is of course a simple example; you can try more real-world examples on your own. Spline is under active development and is open-source; the authors are very active and responsive to queries and suggestions. I encourage you to try this out and judge for yourselves.

“To-do” list

As of this moment, I must call out some open items that I am aware of:

[Update 1 Apr 2019] The issue with Search has been understood and there is a reasonable way out. Again, my colleague Alexandre has been instrumental in finding the mitigation.

~~Unfortunately, the Search textbox does not seem to work correctly for me. I have opened an issue with the Spline team and hopefully we can track down why this is breaking.~~

I have not tested this with any major Spark jobs. If you plan to use this for any kind of serious usage, you should thoroughly test it. Please remember that this is an third-party open-source project, provided on an “as-is” basis.
[Update 26 March 2019]: I have verified that the above setup works correctly with VNET Service Endpoints to Azure Cosmos DB, and with corresponding firewall rules set on the Cosmos DB side to only allow traffic from the said VNET where the Databricks workspace is deployed.

My Azure Databricks workspace is deployed into an existing VNET. I still need to test service endpoints and firewall rules on the Cosmos DB side to ensure that traffic to the Azure Cosmos DB is restricted to only that from the VNET.

Last but not the least, as I already mentioned, if the Spline team releases an uber JAR that would reduce the overhead of managing and installing all the dependencies, that would make life a bit easier on the Azure Databricks front.

I hope this was useful; do try it out and leave me any questions / feedback / suggestions you have!

Appendix

Here is the full list of dependencies including all children / transitive dependencies. It may be useful if your Databricks workspace is deployed in a “locked-down” VNET with very restrictive NSG rules in place for outbound traffic:

org.json4s:json4s-native_2.11:3.5.3
org.json4s:json4s-ast_2.11:3.5.3
org.json4s:json4s-scalap_2.11:3.5.3
org.json4s:json4s-core_2.11:3.5.3
org.json4s:json4s-ext_2.11:3.5.3
org.scala-lang:scalap:2.11.12
org.scalaz:scalaz-core_2.11:7.2.27
org.slf4s:slf4s-api_2.11:1.7.25
com.github.salat:salat-core_2.11:1.11.2
com.github.salat:salat-util_2.11:1.11.2
org.mongodb:bson:3.10.1
org.apache.atlas:atlas-common:1.0.0
org.apache.atlas:atlas-intg:1.0.0
org.apache.atlas:atlas-notification:1.0.0
org.mongodb:casbah-core_2.11:3.1.1
org.mongodb:casbah-commons_2.11:3.1.1
org.mongodb:casbah-query_2.11:3.1.1
org.apache.kafka:kafka-clients:2.1.1
org.mongodb:mongo-java-driver:3.10.1
org.mongodb:mongodb-driver-legacy:3.10.1
org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.0
za.co.absa.spline:spline-core:0.3.6
za.co.absa.spline:spline-commons:0.3.6
za.co.absa.spline:spline-core-spark-adapter-2.4:0.3.6
za.co.absa.spline:spline-core-spark-adapter-api:0.3.6
za.co.absa.spline:spline-model:0.3.6
za.co.absa.spline:spline-persistence-api:0.3.6
za.co.absa.spline:spline-persistence-atlas:0.3.6
za.co.absa.spline:spline-persistence-hdfs:0.3.6
za.co.absa.spline:spline-persistence-mongo:0.3.6