On The Buses II: Fuzzy String Matching

阿新 • • 發佈：2018-12-29

This is the second part of a series of posts about my pet data science project exploring the availability of transport across different areas of Manchester. For those playing catch-up, you might want to take a look at the first post in this series before continuing.

heatmap — The great Mancunian flying spaghetti monster: density of bus stops for TfGM bus routes across Manchester.

In the first post I looked at how to find out where all the bus routes in Manchester go. In this post I’m going to look at how often they go there.

This all adds in my objective of determining the availability of buses across Manchester. Ultimately I want to define availability as the average number of buses per hour in a day, i.e. a bus stop with one bus every 20 minutes

would have the same availability as a bus stop with three buses once an hour.

Code-wise, there are two key parts to this post:

How to navigate multi-level HTML using Selenium;
How to deal with inconsistent labelling using fuzzy string matching.

Web-crawling multi-level HTML with Selenium

All of the timetable information for bus routes in Greater Manchester is provided by Transport for Greater Manchester (TfGM) on

their website. You can download a PDF timetable from home page for each route and I did think about trying to scrape those, but… they don’t list all the stops and the stops have different names to the ones labelled with longitude and latitudes.

The alternative is to the use the travel planning pages of the TfGM website which render an HTML timetable for each route that includes all the stops. However, this approach is not without slight problems: (1) we need to submit a web-form to enter the route number we want; (2) there are multiple routes with the same number; (3) this whole thing uses a web-page which doesn’t render the whole document object model (DOM) at once.

It’s the last one which really had me stumped for a while.

OK. I chose to use the Selenium library to navigate these pages [my intro to using Selenium can be found here] and the first steps are pretty straight forward. To start with we can define functions to start and stop a selenium web-browser driver:

 

def init_driver():

	driver = webdriver.Chrome()
	driver.wait = WebDriverWait(driver, 5)

	return driver

def close_driver(driver):

	driver.close()

	return

Once our driver is initiated can then navigate to the TfGM timetables webpage, which looks like this:

It’s easy to identify the search box elements in the HTML using “inspect element” in the browser. We only need to fill in the top box, which has id='busServiceSearch'. We enter the bus route number we’re looking for and click the search button (class='btn').

 

def enter_bus_number(driver, number):

	driver.get("https://my.tfgm.com/#/timetables/")

	search_field = driver.find_element_by_id("busServiceSearch")

	search_field.send_keys(number)
	driver.implicitly_wait(1)

	driver.find_element_by_class_name("btn").click()

	return

So far, so good. Once we’ve clicked the search button, the web page we see in the browser will display a list of possible bus routes (search results) that have the number we entered. For example, if we had entered “1” we would see:

tfgm2

Even though it looks completely different this is not a new web-page, it has exactly the same URL as the previous one with the search boxes, and if we downloaded the HTML (either using the browser or using driver.page_source in selenium) we wouldn’t see any elements corresponding to the search results.

However, if we use “inspect element” in the browser to see the HTML for each search result, it will look like this:

	<li class="ng-scope" ng-repeat="bus in bus.routes" ng-keydown="kbClickHandler($event);" ng-click="selectTimetable(bus.uid)" aria-label="1. BLACKBURN - BOLTON. Transdev Lancashire United." tabindex="0">

    <span class="timetable-code ng-binding">
        1
    </span>
    BLACKBURN - BOLTON
    <span class="timetable-operator ng-binding" ng-show="bus.operatorLink === undefined">
          Transdev Lancashire United
    </span>
    <span class="timetable-operator ng-hide" ng-show="bus.operatorLink !== undefined">
        <a href="" target="_blank" class="ng-binding">
            Transdev Lancashire United
        </a>
    </span></li>

But if we ran:

driver.find_element_by_class_name("ng-scope")

in Python, it wouldn’t find anything.

Searching through layers with Selenium

The reason is that this is a complex webpage that runs scripts and has multi-level HTML source. I’m probably going to get the terminology wrong here, but as I understand things, selenium only natively sees the top layer of HTML.

It is possible to see the whole thing but you have to execute a selenium script by using something like this:

fullhtml = driver.execute_script("return document.getElementsByTagName('html')[0].innerHTML")

To search for html object attributes across all levels of HTML, you need to use the selenium find_element command but with a slightly different syntax:

driver.find_elements_by_xpath('//*[@ng-repeat="bus in bus.routes"]')

(I’ve just used the ng-repeat attribute here because the class ng-scope is not unique to the search result items.)

Putting it into a function looks like this:

 

def select_route(driver, inroute):

	# set a wait time for the driver [10 sec here]:
	wait = WebDriverWait(driver, 10)
	try:
		# keep checking to see if the page has loaded yet:
		wait.until(EC.visibility_of_element_located((By.XPATH, '//*[@ng-repeat="bus in bus.routes"]')))

		# when it's loaded, extract the search result elements:
		routes = driver.find_elements_by_xpath('//*[@ng-repeat="bus in bus.routes"]')

		# extract the description of the route from each search result:
		endpoints=[]
		for route in routes:
			endpoints.append(route.text.split('\n')[1])

		# do fuzzy string matching:
		endpoint = process.extractOne(inroute, endpoints)[0]

		# extract matching element and click it:
		for route in routes:
			if (route.text.split('\n')[1]==endpoint):
				route.click()
				time.sleep(5)  # wait
				break

	except TimeoutException:
		print 'timeout'

	return

Fuzzy String Matching

You can see from the HTML extract above that there are three text components associated with each search result, e.g.

1
BLACKBURN - BOLTON
Transdev Lancashire United

I’ve picked out the one corresponding to the route endpoints using route.text.split('\n')[1]. I then need to match it to my own description of the route. That’s where the second part of this post comes in.

The name of each bus route contains a number and an endpoint. For example, the first bus route listed on TfGM is ‘1-blackburn’, the second is ‘1-bolton’ and the third is ‘1-piccadilly’. ‘1-blackburn’ and ‘1-bolton’ are the same bus going in opposite directions; ‘1-piccadilly’ is a circular route.

In order to match the routes I took from the TfGM bus routes web-page to the info on the TfGM timetables web-page I need to match those names to the endpoints. They’re not exactly the same, so I require a fuzzy string matching algorithm. I was going to write my own but, thanks to a pointer from this blog post, I discovered that (of course) there’s already a Python library to do it: the unfortunately named FuzzyWuzzy, which is pip installable:

pip install fuzzywuzzy
pip install python-Levenshtein

Using it is incredibly easy. To test an input string (teststring) against a list of options (listofstrings) and return the one most likely match:

 

from fuzzywuzzy import process

endpoint = process.extractOne(teststring, listofstrings)[0]

…and that’s it. What the function actually does is to measure the Levenshtein distance between the teststring and each option in the listofstrings.

The Levenshtein_distance is the minimum number of single character substitutions that would need to be made to change one of the strings being compared into the other.

That final click should have brought us to the point where we can see the timetable information. To extract it we can just employ the same approach again:

def get_timetable_info(driver):

	wait = WebDriverWait(driver, 10)
	try:
		wait.until(EC.visibility_of_element_located((By.XPATH, '//*[@ng-repeat="stop in timetable.current.stops"]')))

		# extract stop timetable elements from page:
		stops = driver.find_elements_by_xpath('//*[@ng-repeat="stop in timetable.current.stops"]')

		allstops=[]
		for stop in stops:
			# get name of stop:
			stopname = stop.find_element_by_class_name("timetable-stop").get_attribute("title")

			# get actual times bus stops at stop:
			times = stop.find_elements_by_class_name("timetable-time")
			stoptimes = [time.text for time in times]

			# get number of times bus stops at stop in one day:
			ntime = [time for time in stoptimes if time]

			# put info into a dict:
			stopinfo={}
			stopinfo['stop name'] = stopname
			stopinfo['daily freq'] = ntime
			stopinfo['stop times'] = stoptimes

			# add the dict into a list:
			allstops.append(stopinfo)

	except TimeoutException:
		print 'timeout'
		allstops=[]

	return allstops

Putting it all together

With these functions defined we can simply call them one at a time to extract the timetable data.

if __name__ == "__main__":

	driver = init_driver()

	route = '1-blackburn'
	bus = route.split('-')[0]
	origin = route.split('-')[1]

	enter_bus_number(driver, bus)
	select_route(driver, origin)
	allstops = get_timetable_info(driver)

	close_driver(driver)

And that’s it for now. Then for the blog this.

Like this:

Like Loading...

On The Buses II: Fuzzy String Matching

This is the second part of a series of posts about my pet data science project exploring the availability of transport across different areas of Manches

Natural Language Processing for Fuzzy String Matching with Python

Fuzzy string search can be used in various applications, such as:A spell checker and spelling-error, typos corrector. For example, a user types “Missisaga”

[轉]The superclass "javax.servlet.http.HttpServlet" was not found on the Java Build Path

right clas rup b- row 添加按鈕 n) 1-1 自帶完整錯誤信息： THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS"AS IS" AND ANY EXPRESS O

My app status is Ready for Sale but I cannot see my app on the App Store. Why? 為什麽審核通過後 appstore中搜不到我的app

one soci orm event 什麽 live pstore follow following 這是蘋果的官方解答 The following factors could prevent your app from showing up on the App St

Warning: date(): It is not safe to rely on the system's timezone settings.

bsp ron notice zone asi 警告 family one str PHP調試的時候出現了警告: It is not safe to rely on the system解決方法,其實就是時區設置不正確造成的,本文提供了3種方法來解決這個問題。實際上，

新建 jsp異常，The superclass "javax.servlet.http.HttpServlet" was not found on the Java Build Path

prop 選擇 library path per png class pro found 　　新項目，新建jsp頁面的時候報異常： Multiple annotations found at this line: - The superclass "java

【論文：麥克風陣列增強】Speech Enhancement Based on the General Transfer Function GSC and Postfiltering

res transient ice ges nal gen image 增強 reg 作者：桂。時間：2017-06-06 16:10:47 鏈接：http://www.cnblogs.com/xingshansi/p/6951494.html 原文鏈接：http

出現錯誤日誌：The APR based Apache Tomcat Native library which allows optimal performance in production environments was not found on the java.library.path

div 錯誤日誌 a.out library logs openss product arc nec tomcat6出現錯誤日誌：信息: The APR based Apache Tomcat Native library which allows optimal pe

On The Buses II: Fuzzy String Matching

Web-crawling multi-level HTML with Selenium

Searching through layers with Selenium

Fuzzy String Matching

Putting it all together

Like this:

On The Buses II: Fuzzy String Matching

Natural Language Processing for Fuzzy String Matching with Python

[轉]The superclass "javax.servlet.http.HttpServlet" was not found on the Java Build Path

My app status is Ready for Sale but I cannot see my app on the App Store. Why? 為什麽審核通過後 appstore中搜不到我的app

Warning: date(): It is not safe to rely on the system's timezone settings.

新建 jsp異常，The superclass "javax.servlet.http.HttpServlet" was not found on the Java Build Path

【論文：麥克風陣列增強】Speech Enhancement Based on the General Transfer Function GSC and Postfiltering

出現錯誤日誌：The APR based Apache Tomcat Native library which allows optimal performance in production environments was not found on the java.library.path

eclipse更改workspace中出現The superclass "javax.servlet.http.HttpServlet" was not found on the Java----問題》》

inspect a service on the swarm

08_drain a node on the swarm

EXCEL數據匹配：The 'Microsoft.Jet.Oledb.4.0' provider is not registered on the local machin

No 'Access-Control-Allow-Origin' header is present on the requested resource

wireshark錯誤QT: XKEYBOARD extension not present on the X server 和/usr/bin/dumpcap permission denied

The superclass "javax.servlet.http.HttpServlet" was not found on the Java Build Path

Codeforces Round #423 (Div. 2, rated, based on VK Cup Finals) C. String Reconstruction

HDU1306 String Matching 【暴力】

J - Ignatius and the Princess II

HDU 4912 Paths on the tree（LCA+貪心）

MYSQL報警：Warning: Using a password on the command line interface can be insecure.

On The Buses II: Fuzzy String Matching

Web-crawling multi-level HTML with Selenium

Searching through layers with Selenium

Fuzzy String Matching

Putting it all together

Like this:

相關推薦