1. 程式人生 > >On The Buses II: Fuzzy String Matching

On The Buses II: Fuzzy String Matching

This is the second part of a series of posts about my pet data science project exploring the availability of transport across different areas of Manchester. For those playing catch-up, you might want to take a look at the first post in this series before continuing.

heatmap
The great Mancunian flying spaghetti monster: density of bus stops for TfGM bus routes across Manchester.

In the first post I looked at how to find out where all the bus routes in Manchester go. In this post I’m going to look at how often they go there.

This all adds in my objective of determining the availability of buses across Manchester. Ultimately I want to define availability as the average number of buses per hour in a day, i.e. a bus stop with one bus every 20 minutes

would have the same availability as a bus stop with three buses once an hour.

Code-wise, there are two key parts to this post:

  • How to navigate multi-level HTML using Selenium;
  • How to deal with inconsistent labelling using fuzzy string matching.

Web-crawling multi-level HTML with Selenium

All of the timetable information for bus routes in Greater Manchester is provided by Transport for Greater Manchester (TfGM) on

their website. You can download a PDF timetable from home page for each route and I did think about trying to scrape those, but… they don’t list all the stops and the stops have different names to the ones labelled with longitude and latitudes.

The alternative is to the use the travel planning pages of the TfGM website which render an HTML timetable for each route that includes all the stops. However, this approach is not without slight problems: (1) we need to submit a web-form to enter the route number we want; (2) there are multiple routes with the same number; (3) this whole thing uses a web-page which doesn’t render the whole document object model (DOM) at once.

It’s the last one which really had me stumped for a while.

OK. I chose to use the Selenium library to navigate these pages [my intro to using Selenium can be found here] and the first steps are pretty straight forward. To start with we can define functions to start and stop a selenium web-browser driver:

 

def init_driver():

	driver = webdriver.Chrome()
	driver.wait = WebDriverWait(driver, 5)

	return driver

def close_driver(driver):

	driver.close()

	return 

Once our driver is initiated can then navigate to the TfGM timetables webpage, which looks like this:

It’s easy to identify the search box elements in the HTML using “inspect element” in the browser. We only need to fill in the top box, which has id='busServiceSearch'. We enter the bus route number we’re looking for and click the search button (class='btn').

 

def enter_bus_number(driver, number):

	driver.get("https://my.tfgm.com/#/timetables/")

	search_field = driver.find_element_by_id("busServiceSearch")

	search_field.send_keys(number)
	driver.implicitly_wait(1)

	driver.find_element_by_class_name("btn").click()

	return

So far, so good. Once we’ve clicked the search button, the web page we see in the browser will display a list of possible bus routes (search results) that have the number we entered. For example, if we had entered “1” we would see:

tfgm2

Even though it looks completely different this is not a new web-page, it has exactly the same URL as the previous one with the search boxes, and if we downloaded the HTML (either using the browser or using driver.page_source in selenium) we wouldn’t see any elements corresponding to the search results.

However, if we use “inspect element” in the browser to see the HTML for each search result, it will look like this:

	<li class="ng-scope" ng-repeat="bus in bus.routes" ng-keydown="kbClickHandler($event);" ng-click="selectTimetable(bus.uid)" aria-label="1. BLACKBURN - BOLTON. Transdev Lancashire United." tabindex="0">

    <span class="timetable-code ng-binding">
        1
    </span>
    BLACKBURN - BOLTON
    <span class="timetable-operator ng-binding" ng-show="bus.operatorLink === undefined">
          Transdev Lancashire United
    </span>
    <span class="timetable-operator ng-hide" ng-show="bus.operatorLink !== undefined">
        <a href="" target="_blank" class="ng-binding">
            Transdev Lancashire United
        </a>
    </span></li>

But if we ran:

driver.find_element_by_class_name("ng-scope")

in Python, it wouldn’t find anything.

Searching through layers with Selenium

The reason is that this is a complex webpage that runs scripts and has multi-level HTML source. I’m probably going to get the terminology wrong here, but as I understand things, selenium only natively sees the top layer of HTML.

It is possible to see the whole thing but you have to execute a selenium script by using something like this:

fullhtml = driver.execute_script("return document.getElementsByTagName('html')[0].innerHTML")

To search for html object attributes across all levels of HTML, you need to use the selenium find_element command but with a slightly different syntax:

driver.find_elements_by_xpath('//*[@ng-repeat="bus in bus.routes"]')

(I’ve just used the ng-repeat attribute here because the class ng-scope is not unique to the search result items.)

Putting it into a function looks like this:

 

def select_route(driver, inroute):

	# set a wait time for the driver [10 sec here]:
	wait = WebDriverWait(driver, 10)
	try:
		# keep checking to see if the page has loaded yet:
		wait.until(EC.visibility_of_element_located((By.XPATH, '//*[@ng-repeat="bus in bus.routes"]')))

		# when it's loaded, extract the search result elements:
		routes = driver.find_elements_by_xpath('//*[@ng-repeat="bus in bus.routes"]')

		# extract the description of the route from each search result:
		endpoints=[]
		for route in routes:
			endpoints.append(route.text.split('\n')[1])

		# do fuzzy string matching:
		endpoint = process.extractOne(inroute, endpoints)[0]

		# extract matching element and click it:
		for route in routes:
			if (route.text.split('\n')[1]==endpoint):
				route.click()
				time.sleep(5)  # wait
				break

	except TimeoutException:
		print 'timeout'

	return

Fuzzy String Matching

You can see from the HTML extract above that there are three text components associated with each search result, e.g.

1
BLACKBURN - BOLTON
Transdev Lancashire United

I’ve picked out the one corresponding to the route endpoints using route.text.split('\n')[1]. I then need to match it to my own description of the route. That’s where the second part of this post comes in.

The name of each bus route contains a number and an endpoint. For example, the first bus route listed on TfGM is ‘1-blackburn’, the second is ‘1-bolton’ and the third is ‘1-piccadilly’. ‘1-blackburn’ and ‘1-bolton’ are the same bus going in opposite directions; ‘1-piccadilly’ is a circular route.

In order to match the routes I took from the TfGM bus routes web-page to the info on the TfGM timetables web-page I need to match those names to the endpoints. They’re not exactly the same, so I require a fuzzy string matching algorithm. I was going to write my own but, thanks to a pointer from this blog post, I discovered that (of course) there’s already a Python library to do it: the unfortunately named FuzzyWuzzy, which is pip installable:

pip install fuzzywuzzy
pip install python-Levenshtein

Using it is incredibly easy. To test an input string (teststring) against a list of options (listofstrings) and return the one most likely match:

 

from fuzzywuzzy import process

endpoint = process.extractOne(teststring, listofstrings)[0]

…and that’s it. What the function actually does is to measure the Levenshtein distance between the teststring and each option in the listofstrings.

The Levenshtein_distance is the minimum number of single character substitutions that would need to be made to change one of the strings being compared into the other.

That final click should have brought us to the point where we can see the timetable information. To extract it we can just employ the same approach again:

def get_timetable_info(driver):

	wait = WebDriverWait(driver, 10)
	try:
		wait.until(EC.visibility_of_element_located((By.XPATH, '//*[@ng-repeat="stop in timetable.current.stops"]')))

		# extract stop timetable elements from page:
		stops = driver.find_elements_by_xpath('//*[@ng-repeat="stop in timetable.current.stops"]')

		allstops=[]
		for stop in stops:
			# get name of stop:
			stopname = stop.find_element_by_class_name("timetable-stop").get_attribute("title")

			# get actual times bus stops at stop:
			times = stop.find_elements_by_class_name("timetable-time")
			stoptimes = [time.text for time in times]

			# get number of times bus stops at stop in one day:
			ntime = [time for time in stoptimes if time]

			# put info into a dict:
			stopinfo={}
			stopinfo['stop name'] = stopname
			stopinfo['daily freq'] = ntime
			stopinfo['stop times'] = stoptimes

			# add the dict into a list:
			allstops.append(stopinfo)

	except TimeoutException:
		print 'timeout'
		allstops=[]

	return allstops

Putting it all together

With these functions defined we can simply call them one at a time to extract the timetable data.

if __name__ == "__main__":

	driver = init_driver()

	route = '1-blackburn'
	bus = route.split('-')[0]
	origin = route.split('-')[1]

	enter_bus_number(driver, bus)
	select_route(driver, origin)
	allstops = get_timetable_info(driver)

	close_driver(driver)
 

And that’s it for now. Then for the blog this.

Like this:

Like Loading...