July | 2019 | Arvind Shyamsundar's technical blog

Warning: this walkthrough is not to be considered as official guidance or recommendation from Microsoft. It is presented for educational purposes only, and comes “as-is” and confers no rights, no warranties or guarantees.

There are several options for customers who want to deploy Apache Spark based solutions on Microsoft Azure: Azure Databricks and Azure HDInsight being the most popular ones. In addition there is also the open-source Azure Distributed Data Engineering Toolkit (AZTK) option as well if you want a more IaaS experience. Of course, with Spark providing analytical compute capabilities, what you also need is a first-class cloud storage which offers HDFS-like capabilities: distributed data storage, redundancy and security. Azure Data Lake Storage Gen 2 (ADLS Gen 2) offers exactly that with world-wide availability and competitive pricing.

In order to connect to ADLS Gen 2 from Apache Hadoop or Apache Spark, you need to leverage the ABFS driver, which was shipped publicly with Apache Hadoop 3.2.0. The associated work item HADOOP-15407 has some more information about this implementation, and best of all, the ABFS driver is part of the Hadoop source.

Given that most distributions of Spark tend to come with Hadoop 2.x versions, the ABFS driver is absent in those cases, leading to a blocker for customers who want to “roll their own” Spark infrastructure but also want to use ADLS Gen 2. I was curious to find out if there is a way to get (let’s say) Spark 2.3.3 to work with Hadoop 3.2.0 (which does include the ABFS driver) and thereby offer at least a path forward (albeit subject to the disclaimers around supportability and stability).

The good news is that Spark comes with a “Hadoop-free” binary distribution which does allow users to associate it with any release of Hadoop, thereby allowing them to “mix and match” Spark and Hadoop versions. Here’s a set of commands that I used to do exactly this on a dev setup, just to see if it works.

The first few steps are just to get the binary tarballs for Spark 2.3.3 (without Hadoop) and separately, for Hadoop 3.2.0. Then extract those as well:

cd ~
wget https://www-eu.apache.org/dist/spark/spark-2.3.3/spark-2.3.3-bin-without-hadoop.tgz
wget https://www-eu.apache.org/dist/hadoop/common/hadoop-3.2.0/hadoop-3.2.0.tar.gz
tar -zxvf spark-2.3.3-bin-without-hadoop.tgz
tar -zxvf hadoop-3.2.0.tar.gz

Then we proceed to setup environment variables. The below also assumes that you have OpenJDK 8 installed. The crucial step is to specify SPARK_DIST_CLASSPATH which as described in the Spark documentation, tells Spark to look within the appropriate Hadoop lib folders to get the JARs needed by appropriate Spark code. Further, you will notice that we also add the hadoop/tools/lib/* into the classpath. That is where the ABFS driver lives. Unfortunately, the Spark documentation does not include this vital step.

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 
export HADOOP_HOME=~/hadoop-3.2.0
export PATH=${HADOOP_HOME}/bin:${PATH}
export SPARK_DIST_CLASSPATH=$(hadoop classpath):~/hadoop-3.2.0/share/hadoop/tools/lib/*
export SPARK_HOME=~/spark-2.3.3-bin-without-hadoop
export PATH=${SPARK_HOME}/bin:${PATH}

Then running spark-shell and trying to read from ADLS Gen 2 works fine, out of the box! I used the below sample code to test with the SharedKey authentication option. I have not tested OAuth 2.0 authentication using this custom deployment, though.

spark.conf.set("fs.azure.account.key.<<storageaccount>>.dfs.core.windows.net",  "<<key>>")
spark.read.csv("abfss://<<container>>@<<storageaccount>>.dfs.core.windows.net/<<topfolder>>/<<subfolder>>/file").count

In closing, I want to re-emphasize that the above should strictly be considered as an experiment and is by no means production-ready. For production workloads, I strongly recommend using services like Azure Databricks or Azure HDInsight, which are tested much more and are fully supported by Microsoft CSS.

This Sample Code is provided for the purpose of illustration only and is not intended to be used in a production environment. THIS SAMPLE CODE AND ANY RELATED INFORMATION ARE PROVIDED “AS IS” WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE IMPLIED WARRANTIES OF MERCHANTABILITY AND/OR FITNESS FOR A PARTICULAR PURPOSE. We grant You a nonexclusive, royalty-free right to use and modify the Sample Code and to reproduce and distribute the object code form of the Sample Code, provided that You agree: (i) to not use Our name, logo, or trademarks to market Your software product in which the Sample Code is embedded; (ii) to include a valid copyright notice on Your software product in which the Sample Code is embedded; and (iii) to indemnify, hold harmless, and defend Us and Our suppliers from and against any claims or lawsuits, including attorneys’ fees, that arise or result from the use or distribution of the Sample Code. This posting is provided “AS IS” with no warranties, and confers no rights.

Arvind Shyamsundar's technical blog

Arvind Shyamsundar is a Principal PM @ MSFT Azure Data, working on Azure SQL. Data geek. Apache Accumulo and Fluo PMC. SQL MCM, ex-Principal PFE (MSFT Services). These are my own opinions and not those of Microsoft.

Month / July 2019

DIY: Apache Spark and ADLS Gen 2 support

Warning: this walkthrough is not to be considered as official guidance or recommendation from Microsoft. It is presented for educational purposes only, and comes “as-is” and confers no rights, no warranties or guarantees.