1. 程式人生 > >WebCrawling: YouTube Pagination in Python

WebCrawling: YouTube Pagination in Python

A while ago I wrote a blog post about how to scrape videos from YouTube. One question I’ve been asked since is how to navigate between different pages of search results. So here’s how.

YouTube

The pre-amble looks exactly the same:

from bs4 import BeautifulSoup as bs
import requests

base = "https://www.youtube.com/results?search_query="
qstring = "boddingtons+advert"

r = requests.get(base+qstring)

page = r.text
soup=bs(page,'html.parser')

Pagination

Then we need to find the piece of html that corresponds to the page progress buttons. If you print out the “soup”, the section looks like this:

<a aria-label="Go to page 2" class="yt-uix-button vve-check yt-uix-sessionlink yt-uix-button-default yt-uix-button-size-default" data-sessionlink="itct=CAkQnKQBGAciEwjDhY_x4azXAhUUjBUKHXJHBsso9CQ" data-visibility-tracking="CAkQnKQBGAciEwjDhY_x4azXAhUUjBUKHXJHBsso9CQ" href="/results?sp=SBRQFOoDAA%253D%253D&amp;search_query=boddingtons+advert"><span class="yt-uix-button-content">Next »</span></a>

To find it using BeautifulSoup we can simply specify the ‘class’ as a filter:

buttons = soup.findAll('a',attrs={'class':"yt-uix-button vve-check yt-uix-sessionlink yt-uix-button-default yt-uix-button-size-default"})

There are multiple pagination buttons on the page, for pages 2 – 7 and finally “Next >>”. Each one has its own url, you can print these out like this:

for button in buttons:
	print button['href']

The “Next >>” button is normally what you’re looking for and this is helpfully the last one in the list:

nextbutton = buttons[-1]
print nextbutton['href']

We can navigate to it by invoking the requests.get() function once again.

Then for the blog this.

Like this:

Like Loading...