Create your own database with all PASS Summit 2017 Sessions using SQL Server 2017 ML Services (Python)

Next week is the biggest event for the SQL Server community: PASS Summit 2017! I will be presenting the Real-world SQL Server R Services session with two of our customers – APT and Financial Fabric. I will also be part of the SQL Clinic, so I’m really excited and pumped about next week!

Being a Data Nut, it suddenly occurred to me today: what if we can import all the PASS Summit 2017 session titles, abstracts and other details into a SQL Server table? And once this thought was in my mind, there was no stopping me! The wonderful thing is that with Python in SQL Server 2017 you can leverage rich libraries such as Beautiful Soup 4, URLLib3 to parse HTML and then present it as a structured table (using Pandas) which SQL can then consume.

The code below will do exactly that for you. It leverages the above mentioned Python libraries, so prior to executing the script you must install two libraries (bs4 & urllib3) using PIP. (Note that pip.exe is present under the C:\Program Files\Microsoft SQL Server\MSSQL14.MSSQLSERVER\PYTHON_SERVICES\scripts folder; change drive letters as appropriate for your installation.) Also note that PIP must be executed from within an administrative CMD prompt in Windows.

The other step for the below script to run correctly is that outgoing Internet access must be allowed for the SQL instance’s R / Python scripts. By default, we block such access. To edit the rule temporarily (we strongly recommend to not disable this rule on a permanent basis) use wf.msc to open the Firewall with Advanced Security console, and then locate the rule “Block network access for R local user accounts in SQL Server instance ” in the Outbound Rules section. Right click and Disable the rule for now; and DO NOT FORGET to enable it later on!

Alright, here is the code!

CREATE DATABASE PASS2017
GO

USE PASS2017
GO

DROP TABLE IF EXISTS PASS2017Sessions

CREATE TABLE PASS2017Sessions
(Abstract nvarchar(max)
		, SessionLevel int
		, Speaker1 nvarchar(100)
		, Speaker2 nvarchar(100)
		, Speaker3 nvarchar(100)
		, Title nvarchar(4000)
		, Track nvarchar(50)
		, SessionId int
		);
GO

INSERT PASS2017Sessions
exec sp_execute_external_script @language = N'Python',
@script = N'
from bs4 import BeautifulSoup
import urllib3
import re
import pandas as pd

http = urllib3.PoolManager()
schedpage = http.request("GET", "http://www.pass.org/summit/2017/Sessions/Schedule.aspx")
schedpage.status
soup_schedpage = BeautifulSoup(schedpage.data, "lxml")
schedpage.close()

documents = []
processedsessions = []

sessioncells = soup_schedpage.find_all("div", class_ = "session-schedule-cell", recursive=True)
for currsess in sessioncells:
    hrefs = currsess.find_all("a")
    if (len(hrefs) >= 1):
        rowdict = {}

        # session title

        rowdict["Title"] = hrefs[0].text

        # session level
        sesslevel = currsess.find("p", id = re.compile("plcLevel"))
        if (sesslevel != None):
            rowdict["Level"] = sesslevel.text.replace("Level: ", "")
        else:
            rowdict["Level"] = None

        # session track
        allps = currsess.find_all("p")
        rowdict["Track"] = allps[len(allps) -2].text

        # get into session page itself
        if ("href" in hrefs[0].attrs):
            sessurl = hrefs[0].attrs["href"]

            # session ID
            mtch = re.search(r"sid=(d+)", sessurl)
            if (mtch is None):
                continue

            # check if this session ID was already processed
            sessionid = mtch.group(1)

            if (sessionid in processedsessions):
                continue

            processedsessions.append(sessionid)

            rowdict["sessionid"] = sessionid

            sesspage = http.request("GET", sessurl)
            soup_sesspage = BeautifulSoup(sesspage.data, "lxml")
            sesspage.close()

            # session abstract
            sessabstract = soup_sesspage.find("pre", class_ ="abstract")
            rowdict["Abstract"] = sessabstract.text
            if (len(rowdict["Abstract"]) == 0):
                continue

            # speakers
            allspeakers = soup_sesspage.find_all("a", id=re.compile("Detail.+lnkSpeaker"))

            rowdict["Speaker1"] = None
            rowdict["Speaker2"] = None
            rowdict["Speaker3"] = None

            if (len(allspeakers) >= 1):
                rowdict["Speaker1"] = allspeakers[0].text

            if (len(allspeakers) >= 2):
                rowdict["Speaker2"] = allspeakers[1].text

            if (len(allspeakers) == 3):
                rowdict["Speaker3"] = allspeakers[2].text
        else:
            continue

        documents.append(rowdict)

OutputDataSet = pd.DataFrame(documents)'
GO

select *
from PASS2017Sessions
GO

If all goes well you should see the results:

Isn’t that so cool! Play around with it and let me know what you think. And if you are at the Summit and interested in SQL Server ML Services, a friendly reminder to come to the Real-world SQL Server R Services session. See you later!

Disclaimer

This Sample Code is provided for the purpose of illustration only and is not intended to be used in a production environment. THIS SAMPLE CODE AND ANY RELATED INFORMATION ARE PROVIDED “AS IS” WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE IMPLIED WARRANTIES OF MERCHANTABILITY AND/OR FITNESS FOR A PARTICULAR PURPOSE. We grant You a nonexclusive, royalty-free right to use and modify the Sample Code and to reproduce and distribute the object code form of the Sample Code, provided that You agree: (i) to not use Our name, logo, or trademarks to market Your software product in which the Sample Code is embedded; (ii) to include a valid copyright notice on Your software product in which the Sample Code is embedded; and (iii) to indemnify, hold harmless, and defend Us and Our suppliers from and against any claims or lawsuits, including attorneys’ fees, that arise or result from the use or distribution of the Sample Code. This posting is provided “AS IS” with no warranties, and confers no rights.

Arvind Shyamsundar's technical blog

Arvind Shyamsundar is a Principal PM @ MSFT Azure Data, working on Azure SQL. Data geek. Apache Accumulo and Fluo PMC. SQL MCM, ex-Principal PFE (MSFT Services). These are my own opinions and not those of Microsoft.

Create your own database with all PASS Summit 2017 Sessions using SQL Server 2017 ML Services (Python)

Leave a comment Cancel reply

Share this:

Related

Leave a comment Cancel reply