1. 程式人生 > >python爬取網頁資訊

python爬取網頁資訊

一、簡單瞭解html網頁

1.推薦瀏覽器:

使用Chrome瀏覽器,在檢查元素中可以看到HTML程式碼和css樣式。

2.網頁構成:

網頁的內容主要包括三個部分:javascript主要針對功能,html針對結構,css針對樣式。在本地檔案中通常是三部分,html+images+css

3.常用標籤和結構

<div></div> 劃分區域
<div class=”aasdf”></div>說明樣式
<p>wowiji</p>說明文字內容
<li></li>列表
<img>圖片
<h1></h1>....<h6></h6>六種字型不同的標題格式
<a href=”” ></a>超連結


標籤可以互相巢狀

4.實戰做一個網頁

使用工具:pycharm

檔案內容:sample.html

              Main.css

主要框架:head(標題欄+導航欄),content(主體),footer(頁尾)

5.網頁效果


6.html原始碼

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>The blah</title>
    <link rel="stylesheet" type="text/css" href="main.css">
</head>
<body>
    <div class="header">
        <img src="images/blah.png">
        <ul class="nav">
            <li><a href="#">Home</a></li>
            <li><a href="#">Site</a></li>
            <li><a href="#">Other</a></li>
        </ul>
    </div>
    <div class="main-content">
        <h2>Article</h2>
        <ul class="article">
            <li>
                <img src="images/0001.jpg" width="100" height="90">
                <h3><a href="#">The blah</a></h3>
                <p>Say something</p>
            </li>
            <li>
                <img src="images/0002.jpg" width="100" height="90">
                <h3><a href="#">The blah</a></h3>
                <p>Say something</p>
            </li>
            <li>
                <img src="images/0003.jpg" width="100" height="90">
                <h3><a href="#">The blah</a></h3>
                <p>Say something</p>
            </li>
            <li>
                <img src="images/0004.jpg" width="100" height="90">
                <h3><a href="#">The blah</a></h3>
                <p>Say something</p>
            </li>
        </ul>
    </div>
    <div class="footer">
        <p>@xumeng</p>
    </div>
</body>
</html>


7.css原始碼

body {
    padding: 0 0 0 0;
    background-color: #ffffff;
    background-image: url(images/bg3-dark.jpg);
    background-position: top left;
    background-repeat: no-repeat;
    background-size: cover;
    font-family: Helvetica, Arial, sans-serif;
}
.main-content {
    width: 500px;
    padding: 20px 20px 20px 20px;
    border: 1px solid #dddddd;
    border-radius:25px;
    margin: 30px auto 0 auto;
    background: #f1f1f1;
    -webkit-box-shadow: 0 0 22px 0 rgba(50, 50, 50, 1);
    -moz-box-shadow:    0 0 22px 0 rgba(50, 50, 50, 1);
    box-shadow:         0 0 22px 0 rgba(50, 50, 50, 1);
}
.main-content p {
    line-height: 26px;
}
.main-content h2 {
    color: dimgray;
}
 
.nav {
    padding-left: 0;
    margin: 5px 0 20px 0;
    text-align: center;
}
.nav li {
    display: inline;
    padding-right: 10px;
}
.nav li:last-child {
    padding-right: 0;
}
.header {
    padding: 10px 10px 10px 10px;
 
}
 
.header a {
    color: #ffffff;
}
.header img {
    display: block;
    margin: 0 auto 0 auto;
}
.header h1 {
    text-align: center;
}
 
.article {
    list-style-type: none;
    padding: 0;
}
.article li {
    border: 1px solid #f6f8f8;
    background-color: #ffffff;
    height: 90px;
}
.article h3 {
    border-bottom: 0;
    margin-bottom: 5px;
}
.article a {
    color: #37a5f0;
    text-decoration: none;
}
.article img {
    float: left;
    padding-right: 11px;
}
 
.footer {
    margin-top: 20px;
}
.footer p {
    color: #aaaaaa;
    text-align: center;
    font-weight: bold;
    font-size: 12px;
    font-style: italic;
    text-transform: uppercase;
}
 
 
 
 
 
 
.post {
    padding-bottom: 2em;
}
.post-title {
    font-size: 2em;
    color: #222;
    margin-bottom: 0.2em;
}
.post-avatar {
    border-radius: 50px;
    float: right;
    margin-left: 1em;
}
.post-description {
    font-family: Georgia, "Cambria", serif;
    color: #444;
    line-height: 1.8em;
}
.post-meta {
    color: #999;
    font-size: 90%;
    margin: 0;
}
 
.post-category {
    margin: 0 0.1em;
    padding: 0.3em 1em;
    color: #fff;
    background: #999;
    font-size: 80%;
}
.post-category-design {
    background: #5aba59;
}
.post-category-pure {
    background: #4d85d1;
}
.post-category-yui {
    background: #8156a7;
}
.post-category-js {
    background: #df2d4f;
}
 
.post-images {
    margin: 1em 0;
}
.post-image-meta {
    margin-top: -3.5em;
    margin-left: 1em;
    color: #fff;
    text-shadow: 0 1px 1px #333;
}


8.注意:

共有十張圖片,注意路徑關係,CSSHTMLIMages資料夾在同一目錄下。

寫給自己:此專案路徑在:F:\Python實戰:四周實現爬蟲系統\作業程式碼\第一週\上課_1

 

二、解析本地檔案中的元素

1.解析的檔案html原始碼

<html>
<head>
    <link rel="stylesheet" type="text/css" href="new_blah.css">
</head>
<body>
    <div class="header">
        <img src="images/blah.png">
        <ul class="nav">
            <li><a href="#">Home</a></li>
            <li><a href="#">Site</a></li>
            <li><a href="#">Other</a></li>
        </ul>
    </div>
    <div class="main-content">
        <h2>Article</h2>
        <ul class="articles">
            <li>
                <img src="images/0001.jpg" width="100" height="91">
                <div class="article-info">
                    <h3><a href="www.sample.com">Sardinia's top 10 beaches</a></h3>
                    <p class="meta-info">
                        <span class="meta-cate">fun</span>
                        <span class="meta-cate">Wow</span>
                    </p>
                    <p class="description">white sands and turquoise waters</p>
                </div>
                <div class="rate">
                    <span class="rate-score">4.5</span>
                </div>
            </li>
            <li>
                <img src="images/0002.jpg" width="100" height="91">
                <div class="article-info">
                    <h3><a href="www.sample.com">How to get tanned</a></h3>
                    <p class="meta-info">
                        <span class="meta-cate">butt</span><span class="meta-cate">NSFW</span>
                    </p>
                    <p class="description">hot bikini girls on beach</p>
                </div>
                <div class="rate">
                    <img src="images/Fire.png" width="18" height="18">
                    <span class="rate-score">5.0</span>
                </div>
            </li>
            <li>
                <img src="images/0003.jpg" width="100" height="91">
                <div class="article-info">
                    <h3><a href="www.sample.com">How to be an Aussie beach bum</a></h3>
                    <p class="meta-info">
                        <span class="meta-cate">sea</span>
                    </p>
                    <p class="description">To make the most of your visit</p>
                </div>
                <div class="rate">
                    <span class="rate-score">3.5</span>
                </div>
            </li>
            <li>
                <img src="images/0004.jpg" width="100" height="91">
                <div class="article-info">
                    <h3><a href="www.sample.com">Summer's cheat sheet</a></h3>
                    <p class="meta-info">
                        <span class="meta-cate">bay</span>
                        <span class="meta-cate">boat</span>
                        <span class="meta-cate">beach</span>
                    </p>
                    <p class="description">choosing a beach in Cape Cod</p>
                </div>
                <div class="rate">
                    <span class="rate-score">3.0</span>
                </div>
            </li>
        </ul>
    </div>
    <div class="footer">
        <p>© Mugglecoding</p>
    </div>
</body>
</html>


 

 

2.需解析的網頁CSS檔案

body {
    padding: 0 0 0 0;
    background-color: #ffffff;
    background-image: url(images/bg3-dark.jpg);
    background-position: top left;
    background-repeat: no-repeat;
    background-size: cover;
    font-family: Helvetica, Arial, sans-serif;
}
.main-content {
    width: 500px;
    padding: 20px 20px 20px 20px;
    border: 1px solid #dddddd;
    border-radius:15px;
    margin: 30px auto 0 auto;
    background: #fdffff;
    -webkit-box-shadow: 0 0 22px 0 rgba(50, 50, 50, 1);
    -moz-box-shadow:    0 0 22px 0 rgba(50, 50, 50, 1);
    box-shadow:         0 0 22px 0 rgba(50, 50, 50, 1);
}
.main-content p {
    line-height: 26px;
}
.main-content h2 {
    color: #585858;
}
.articles {
    list-style-type: none;
    padding: 0;
}
.articles img {
    float: left;
    padding-right: 11px;
}
.articles li {
    border-top: 1px solid #F1F1F1;
    background-color: #ffffff;
    height: 90px;
    clear: both;
}
.articles h3 {
    margin: 0;
}
.articles a {
    color:#585858;
    text-decoration: none;
}
.articles p {
    margin: 0;
}
 
.article-info {
    float: left;
    display: inline-block;
    margin: 8px 0 8px 0;
}
 
.rate {
    float: right;
    display: inline-block;
    margin:35px 20px 35px 20px;
}
 
.rate-score {
    font-size: 18px;
    font-weight: bold;
    color: #585858;
}
 
.rate-score-hot {
 
 
}
 
.meta-info {
}
 
.meta-cate {
    margin: 0 0.1em;
    padding: 0.1em 0.7em;
    color: #fff;
    background: #37a5f0;
    font-size: 20%;
    border-radius: 10px ;
}
 
.description {
    color: #cccccc;
}
 
.nav {
    padding-left: 0;
    margin: 5px 0 20px 0;
    text-align: center;
}
.nav li {
    display: inline;
    padding-right: 10px;
}
.nav li:last-child {
    padding-right: 0;
}
.header {
    padding: 10px 10px 10px 10px;
 
}
 
.header a {
    color: #ffffff;
}
.header img {
    display: block;
    margin: 0 auto 0 auto;
}
.header h1 {
    text-align: center;
}
 
 
 
.footer {
    margin-top: 20px;
}
.footer p {
    color: #aaaaaa;
    text-align: center;
    font-weight: bold;
    font-size: 12px;
    font-style: italic;
    text-transform: uppercase;
}


 

3.解析步驟

1beautifulsoup解析網頁

2)描述爬取定位

3)從標籤獲取資訊並按照要求裝進容器方便查詢

4.beautifulsoup解析網頁

1)爬取程式碼

標準解析格式為:soup=beautifulsoup(html,lxml)//第一個引數是網頁檔案,第二個是解析方式,解析方式共有五種:lxml,html.parser,lxml HTML,lxml xML,HTML5lib

from bs4 import BeautifulSoup

with open('F:/Python實戰:四周實現爬蟲系統/作業程式碼/第一週/上課_2/web/new_index.html','r') as wb_data:

    Soup = BeautifulSoup(wb_data,'lxml')

    print(Soup)

2)報錯1

can't import beautifulsoup

原因是沒有安裝beautifulsoup庫,解決:在cmd

pip install bs4

3)報錯2

bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library?

原因是沒有安裝解析器,解決:在cmd下:

pip install lxml

4)爬取結果

<html>
<head>
<link href="new_blah.css" rel="stylesheet" type="text/css"/>
</head>
<body>
<div class="header">
<img src="images/blah.png"/>
<ul class="nav">
<li><a href="#">Home</a></li>
<li><a href="#">Site</a></li>
<li><a href="#">Other</a></li>
</ul>
</div>
<div class="main-content">
<h2>Article</h2>
<ul class="articles">
<li>
<img height="91" src="images/0001.jpg" width="100"/>
<div class="article-info">
<h3><a href="www.sample.com">Sardinia's top 10 beaches</a></h3>
<p class="meta-info">
<span class="meta-cate">fun</span>
<span class="meta-cate">Wow</span>
</p>
<p class="description">white sands and turquoise waters</p>
</div>
<div class="rate">
<span class="rate-score">4.5</span>
</div>
</li>
<li>
<img height="91" src="images/0002.jpg" width="100"/>
<div class="article-info">
<h3><a href="www.sample.com">How to get tanned</a></h3>
<p class="meta-info">
<span class="meta-cate">butt</span><span class="meta-cate">NSFW</span>
</p>
<p class="description">hot bikini girls on beach</p>
</div>
<div class="rate">
<img height="18" src="images/Fire.png" width="18"/>
<span class="rate-score">5.0</span>
</div>
</li>
<li>
<img height="91" src="images/0003.jpg" width="100"/>
<div class="article-info">
<h3><a href="www.sample.com">How to be an Aussie beach bum</a></h3>
<p class="meta-info">
<span class="meta-cate">sea</span>
</p>
<p class="description">To make the most of your visit</p>
</div>
<div class="rate">
<span class="rate-score">3.5</span>
</div>
</li>
<li>
<img height="91" src="images/0004.jpg" width="100"/>
<div class="article-info">
<h3><a href="www.sample.com">Summer's cheat sheet</a></h3>
<p class="meta-info">
<span class="meta-cate">bay</span>
<span class="meta-cate">boat</span>
<span class="meta-cate">beach</span>
</p>
<p class="description">choosing a beach in Cape Cod</p>
</div>
<div class="rate">
<span class="rate-score">3.0</span>
</div>
</li>
</ul>
</div>
<div class="footer">
<p>© Mugglecoding</p>
</div>
</body>
</html>


 

5.描述爬取位置

描述位置使用selector位置,獲取方法,選擇->右鍵檢查->右鍵copy->複製selector

#原始碼
from bs4 import BeautifulSoup
with open('F:/Python實戰:四周實現爬蟲系統/作業程式碼/第一週/上課_2/web/new_index.html','r') as wb_data:
    Soup = BeautifulSoup(wb_data,'lxml')
    #print(Soup)
    print("獲取第一張照片")
    #images=Soup.select('body > div.main-content > ul > li:nth-child(1) > img')
    #注意使用上面的地址會報錯,要根據提示修改
    image1 = Soup.select('body > div.main-content > ul > li:nth-of-type(1) > img')
    print(image1)
    print("獲取所有照片")
    #要獲取所有照片需要清除位置資訊
    images = Soup.select('body > div.main-content > ul > li > img')
    #把其他資訊篩選出來
    title=Soup.select('body > div.main-content > ul > li > div.article-info > h3 > a')
    score=Soup.select('body > div.main-content > ul > li > div.rate > span')
    selector=Soup.select('body > div.main-content > ul > li > div.article-info > p.meta-info > span')
    description=Soup.select('body > div.main-content > ul > li > div.article-info > p.description')
    print(images,title,score,selector,description,sep='\n----------------------------------\n')
 


#列印結果
獲取第一張照片
[<img height="91" src="images/0001.jpg" width="100"/>]
獲取所有照片
[<img height="91" src="images/0001.jpg" width="100"/>, <img height="91" src="images/0002.jpg" width="100"/>, <img height="91" src="images/0003.jpg" width="100"/>, <img height="91" src="images/0004.jpg" width="100"/>]
----------------------------------
[<a href="www.sample.com">Sardinia's top 10 beaches</a>, <a href="www.sample.com">How to get tanned</a>, <a href="www.sample.com">How to be an Aussie beach bum</a>, <a href="www.sample.com">Summer's cheat sheet</a>]
----------------------------------
[<span class="rate-score">4.5</span>, <span class="rate-score">5.0</span>, <span class="rate-score">3.5</span>, <span class="rate-score">3.0</span>]
----------------------------------
[<span class="meta-cate">fun</span>, <span class="meta-cate">Wow</span>, <span class="meta-cate">butt</span>, <span class="meta-cate">NSFW</span>, <span class="meta-cate">sea</span>, <span class="meta-cate">bay</span>, <span class="meta-cate">boat</span>, <span class="meta-cate">beach</span>]
----------------------------------
[<p class="description">white sands and turquoise waters</p>, <p class="description">hot bikini girls on beach</p>, <p class="description">To make the most of your visit</p>, <p class="description">choosing a beach in Cape Cod</p>]


6.篩選有關資訊

#打印出所有種類的結果
from bs4 import BeautifulSoup
with open('F:/Python實戰:四周實現爬蟲系統/作業程式碼/第一週/上課_2/web/new_index.html','r') as wb_data:
    Soup = BeautifulSoup(wb_data,'lxml')
    images = Soup.select('body > div.main-content > ul > li > img')
    titles = Soup.select('body > div.main-content > ul > li > div.article-info > h3 > a')
    scores = Soup.select('body > div.main-content > ul > li > div.rate > span')
    #selecs = Soup.select('body > div.main-content > ul > li > div.article-info > p.meta-info > span')
    selecs = Soup.select('body > div.main-content > ul > li > div.article-info > p.meta-info ')
    descrs = Soup.select('body > div.main-content > ul > li > div.article-info > p.description')
 
for title,image,desc,selec,score in zip(titles,images,descrs,selecs,scores):
    data={
        #'selec': selec.get_text(),
        'selec':list(selec.stripped_strings),#獲取子級目錄下所有
        'title':title.get_text(),
        'image':image.get('src'),
        'desc':desc.get_text(),
        'score':score.get_text()
    }
    print(data)
 


#列印結果
['fun', 'Wow'], 'title': "Sardinia's top 10 beaches", 'image': 'images/0001.jpg', 'desc': 'white sands and turquoise waters', 'score': '4.5'}
{'selec': ['butt', 'NSFW'], 'title': 'How to get tanned', 'image': 'images/0002.jpg', 'desc': 'hot bikini girls on beach', 'score': '5.0'}
{'selec': ['sea'], 'title': 'How to be an Aussie beach bum', 'image': 'images/0003.jpg', 'desc': 'To make the most of your visit', 'score': '3.5'}
{'selec': ['bay', 'boat', 'beach'], 'title': "Summer's cheat sheet", 'image': 'images/0004.jpg', 'desc': 'choosing a beach in Cape Cod', 'score': '3.0'}
 


#打印出評分>3分的文章
from bs4 import BeautifulSoup
info=[]
with open('F:/Python實戰:四周實現爬蟲系統/作業程式碼/第一週/上課_2/web/new_index.html','r') as wb_data:
    Soup = BeautifulSoup(wb_data,'lxml')
    images = Soup.select('body > div.main-content > ul > li > img')
    titles = Soup.select('body > div.main-content > ul > li > div.article-info > h3 > a')
    scores = Soup.select('body > div.main-content > ul > li > div.rate > span')
    #selecs = Soup.select('body > div.main-content > ul > li > div.article-info > p.meta-info > span')
    selecs = Soup.select('body > div.main-content > ul > li > div.article-info > p.meta-info ')
    descrs = Soup.select('body > div.main-content > ul > li > div.article-info > p.description')
 
for title,image,desc,selec,score in zip(titles,images,descrs,selecs,scores):
    data={
        #'selec': selec.get_text(),
        'selec':list(selec.stripped_strings),#獲取子級目錄下所有
        'title':title.get_text(),
        'image':image.get('src'),
        'desc':desc.get_text(),
        'score':score.get_text()
 
    }
    info.append(data)
for i in info:
    if float(i['score'])>3:
        print(i['title'],i['score'])
 


#列印結果:
Sardinia's top 10 beaches 4.5
How to get tanned 5.0
How to be an Aussie beach bum 3.5


三、爬取真實網頁

Requests+beautifulsoup爬取tripadvisior

1.伺服器與本地的交換機制

1http協議

點選頁面:向伺服器傳送請求(request

#get:

GET /page_one.html HTTP/1.1 Host:www.sample.com

顯示頁面:responsestatus_code:

檢視:右鍵->檢查->network

 

HTTP1.0:get,post,head

http1.1:get,post,head,options.connect,trace,delete

2)程式碼

pip install requests


2.解析真實網頁的步驟

1requests請求

2)爬取整個介面

from bs4 import BeautifulSoup
import requests
 
url='https://cn.tripadvisor.com/Attractions-g60763-Activities-New_York_City_New_York.html'
 
wb_data=requests.get(url,timeout = 500)
soup=BeautifulSoup(wb_data.text,'lxml')
print(soup)


3)描述爬取的元素位置

#爬取某個標題的selector
from bs4 import BeautifulSoup
import requests
 
url='https://cn.tripadvisor.com/Attractions-g60763-Activities-New_York_City_New_York.html'
 
wb_data=requests.get(url,timeout=500)
soup=BeautifulSoup(wb_data.text,'lxml')
titles=soup.select('#taplc_attraction_coverpage_attraction_0 > div:nth-of-type(4) > div > div > div.shelf_item_container > div:nth-of-type(1) > div.poi > div > div.item.name > a')
print(titles)


結果:

[<a class="poiTitle" data-tpact="shelf_item_click" data-tpatt="4|poi|272517" data-tpid="20" data-tpp="Attractions" href="/Attraction_Review-g60763-d272517-Reviews-Conservatory_Garden-New_York_City_New_York.html" onclick="widgetEvCall('handlers.shelfItemClick', event, this)" target="_blank">溫室花園</a>]


4)描述爬取的所有元素取所有特徵大小的圖片

#爬取所有特徵大小的圖片
from bs4 import BeautifulSoup
import requests
 
url='https://cn.tripadvisor.com/Attractions-g60763-Activities-New_York_City_New_York.html'
 
wb_data=requests.get(url,timeout=500)
soup=BeautifulSoup(wb_data.text,'lxml')
imgs=soup.select('img[width="200"]')
print(imgs)


5)字典方式遍歷

#字典方式遍歷
from bs4 import BeautifulSoup
import requests
url='https://cn.tripadvisor.com/Attractions-g60763-Activities-New_York_City_New_York.html'
 
wb_data=requests.get(url,timeout=500)
soup=BeautifulSoup(wb_data.text,'lxml')
imgs=soup.select('img[width="200"]')
titles=soup.select('#taplc_attraction_coverpage_attraction_0 > div > div > div > div.shelf_item_container > div:nth-of-type(1) > div.poi > div > div.item.name > a')
for title,img in zip(titles,imgs):
    data={
        'title':title.get_text(),
        'img':img.get('src'),
    }
    print(data)


3.跳過登入步驟,在request引數獲取資訊

from bs4 import BeautifulSoup
import requests
import time
 
url_saves = 'http://www.tripadvisor.com/Saves#37685322'
url = 'https://cn.tripadvisor.com/Attractions-g60763-Activities-New_York_City_New_York.html'
urls = ['https://cn.tripadvisor.com/Attractions-g60763-Activities-oa{}-New_York_City_New_York.html#ATTRACTION_LIST'.format(str(i)) for i in range(30,930,30)]
 
headers = {
    'User-Agent':'',
    'Cookie':''
}
 
 
def get_attractions(url,data=None):
    wb_data = requests.get(url)
    time.sleep(4)
    soup = BeautifulSoup(wb_data.text,'lxml')
    titles    = soup.select('div.property_title > a[target="_blank"]')
    imgs      = soup.select('img[width="160"]')
    cates     = soup.select('div.p13n_reasoning_v2')
 
    if data == None:
        for title,img,cate in zip(titles,imgs,cates):
            data = {
                'title'  :title.get_text(),
                'img'    :img.get('src'),
                'cate'   :list(cate.stripped_strings),
                }
        print(data)
 
 
def get_favs(url,data=None):
    wb_data = requests.get(url,headers=headers)
    soup      = BeautifulSoup(wb_data.text,'lxml')
    titles    = soup.select('a.location-name')
    imgs      = soup.select('div.photo > div.sizedThumb > img.photo_image')
    metas = soup.select('span.format_address')
 
    if data == None:
        for title,img,meta in zip(titles,imgs,metas):
            data = {
                'title'  :title.get_text(),
                'img'    :img.get('src'),
                'meta'   :list(meta.stripped_strings)
            }
            print(data)
 
for single_url in urls:
    get_attractions(single_url)


4.反爬蟲

只用檢查->在移動端檢視->解析(保護措施不是非常嚴密)

 

四、獲取動態資料非同步載入

1.非同步載入

不換頁的情況不斷載入

JS 持續載入,與JavaScript不在一起,分批量載入

2. 發現非同步資料

檢查->Network->XHR

Name:出現新請求成功的頁碼->動態請求網址URL(page=x)

Response載入回一組div標籤,包括連結

3.程式碼

from bs4 import BeautifulSoup
import requests
import time
url = 'https://knewone.com/discover?page='
def get_page(url,data=None):
 
    wb_data = requests.get(url)
    soup = BeautifulSoup(wb_data.text,'lxml')
    imgs = soup.select('a.cover-inner > img')
    titles = soup.select('section.content > h4 > a')
    links = soup.select('section.content > h4 > a')
 
    if data==None:
        for img,title,link in zip(imgs,titles,links):
            data = {
                'img':img.get('src'),
                'title':title.get('title'),
                'link':link.get('href')
            }
            print(data)
 
#自控頁碼函式
def get_more_pages(start,end):
    for one in range(start,end):
        get_page(url+str(one))
        time.sleep(2)
 
get_more_pages(1,10)
 


五、作業:爬取商品資訊

from bs4 import BeautifulSoup
import requests
import time
 
url = 'http://bj.58.com/pingbandiannao/24604629984324x.shtml'
 
wb_data = requests.get(url)
soup = BeautifulSoup(wb_data.text,'lxml')
 
def get_links_from(who_sells):
    urls = []
    list_view = 'http://bj.58.com/pbdn/{}/pn2/'.format(str(who_sells))
    wb_data = requests.get(list_view)
    soup = BeautifulSoup(wb_data.text,'lxml')
    for link in soup.select('td.t a.t'):
        urls.append(link.get('href').split('?')[0])
    return urls
 
 
def get_views_from(url):
    id = url.split('/')[-1].strip('x.shtml')
    api = 'http://jst1.58.com/counter?infoid={}'.format(id)
    # 這個是找到了58的查詢介面,不瞭解介面可以參照一下新浪微博介面的介紹
    js = requests.get(api)
    views = js.text.split('=')[-1]
    return views
    # print(views)
 
 
def get_item_info(who_sells=0):
 
    urls = get_links_from(who_sells)
    for url in urls:
 
        wb_data = requests.get(url)
        soup = BeautifulSoup(wb_data.text,'lxml')
        data = {
            'title':soup.title.text,
            'price':soup.select('.price')[0].text,
            'area' :list(soup.select('.c_25d')[0].stripped_strings) if soup.find_all('span','c_25d') else None,
            'date' :soup.select('.time')[0].text,
            'cate' :'個人' if who_sells == 0 else '商家',
            # 'views':get_views_from(url)
        }
        print(data)
 
# get_item_info(url)
 
# get_links_from(1)
 
get_item_info(url)