Next week is the biggest event for the SQL Server community: PASS Summit 2017! I will be presenting the Real-world SQL Server R Services session with two of our customers – APT and Financial Fabric. I will also be part of the SQL Clinic, so I’m really excited and pumped about next week!
Being a Data Nut, it suddenly occurred to me today: what if we can import all the PASS Summit 2017 session titles, abstracts and other details into a SQL Server table? And once this thought was in my mind, there was no stopping me! The wonderful thing is that with Python in SQL Server 2017 you can leverage rich libraries such as Beautiful Soup 4, URLLib3 to parse HTML and then present it as a structured table (using Pandas) which SQL can then consume.
The code below will do exactly that for you. It leverages the above mentioned Python libraries, so prior to executing the script you must install two libraries (bs4 & urllib3) using PIP. (Note that pip.exe is present under the C:\Program Files\Microsoft SQL Server\MSSQL14.MSSQLSERVER\PYTHON_SERVICES\scripts folder; change drive letters as appropriate for your installation.) Also note that PIP must be executed from within an administrative CMD prompt in Windows.
The other step for the below script to run correctly is that outgoing Internet access must be allowed for the SQL instance’s R / Python scripts. By default, we block such access. To edit the rule temporarily (we strongly recommend to not disable this rule on a permanent basis) use wf.msc to open the Firewall with Advanced Security console, and then locate the rule “Block network access for R local user accounts in SQL Server instance ” in the Outbound Rules section. Right click and Disable the rule for now; and DO NOT FORGET to enable it later on!
Alright, here is the code!
CREATE DATABASE PASS2017 GO USE PASS2017 GO DROP TABLE IF EXISTS PASS2017Sessions CREATE TABLE PASS2017Sessions (Abstract nvarchar(max) , SessionLevel int , Speaker1 nvarchar(100) , Speaker2 nvarchar(100) , Speaker3 nvarchar(100) , Title nvarchar(4000) , Track nvarchar(50) , SessionId int ); GO INSERT PASS2017Sessions exec sp_execute_external_script @language = N'Python', @script = N' from bs4 import BeautifulSoup import urllib3 import re import pandas as pd http = urllib3.PoolManager() schedpage = http.request("GET", "http://www.pass.org/summit/2017/Sessions/Schedule.aspx") schedpage.status soup_schedpage = BeautifulSoup(schedpage.data, "lxml") schedpage.close() documents = [] processedsessions = [] sessioncells = soup_schedpage.find_all("div", class_ = "session-schedule-cell", recursive=True) for currsess in sessioncells: hrefs = currsess.find_all("a") if (len(hrefs) >= 1): rowdict = {} # session title rowdict["Title"] = hrefs[0].text # session level sesslevel = currsess.find("p", id = re.compile("plcLevel")) if (sesslevel != None): rowdict["Level"] = sesslevel.text.replace("Level: ", "") else: rowdict["Level"] = None # session track allps = currsess.find_all("p") rowdict["Track"] = allps[len(allps) -2].text # get into session page itself if ("href" in hrefs[0].attrs): sessurl = hrefs[0].attrs["href"] # session ID mtch = re.search(r"sid=(d+)", sessurl) if (mtch is None): continue # check if this session ID was already processed sessionid = mtch.group(1) if (sessionid in processedsessions): continue processedsessions.append(sessionid) rowdict["sessionid"] = sessionid sesspage = http.request("GET", sessurl) soup_sesspage = BeautifulSoup(sesspage.data, "lxml") sesspage.close() # session abstract sessabstract = soup_sesspage.find("pre", class_ ="abstract") rowdict["Abstract"] = sessabstract.text if (len(rowdict["Abstract"]) == 0): continue # speakers allspeakers = soup_sesspage.find_all("a", id=re.compile("Detail.+lnkSpeaker")) rowdict["Speaker1"] = None rowdict["Speaker2"] = None rowdict["Speaker3"] = None if (len(allspeakers) >= 1): rowdict["Speaker1"] = allspeakers[0].text if (len(allspeakers) >= 2): rowdict["Speaker2"] = allspeakers[1].text if (len(allspeakers) == 3): rowdict["Speaker3"] = allspeakers[2].text else: continue documents.append(rowdict) OutputDataSet = pd.DataFrame(documents)' GO select * from PASS2017Sessions GO
If all goes well you should see the results:
Isn’t that so cool! Play around with it and let me know what you think. And if you are at the Summit and interested in SQL Server ML Services, a friendly reminder to come to the Real-world SQL Server R Services session. See you later!
Disclaimer