python requests bs4練習

阿新 • • 發佈：2018-11-03

豆瓣評論分析:
# 1). 獲取豆瓣最新上映的所有電影的前10頁評論資訊；
# 2). 清洗資料；
# 3). 分析每個電影評論資訊分析繪製成詞雲，儲存為png圖片，檔名為: 電影名.png;

import re

import jieba
import requests
import wordcloud
from bs4 import BeautifulSoup

def getpagecomments(id,pageNum):
    start = (pageNum-1)*20
    url = "https://movie.douban.com/subject/%s/comments?start=%s&limit=20&sort=new_score&status=P" %(id,start)
    content = requests.get(url).text
    soup = BeautifulSoup(content,'html5lib')
    commentsList = soup.find_all('span',class_='short')
    comments = ""
    for commentTag in commentsList:
        comments += commentTag.text
    return comments

def getID():
    url = 'https://movie.douban.com/cinema/nowplaying/xian/'
    response = requests.get(url)
    content = response.text
    soup = BeautifulSoup(content, 'html5lib')
    nowplaying_movie_list = soup.find_all('li', class_='list-item')
    # print(nowplaying_movie_list,type(nowplaying_movie_list))
    movies_info = []
    for item in nowplaying_movie_list:
        nowplaying_movie_dict = {}
        nowplaying_movie_dict['title'] = item['data-title']
        nowplaying_movie_dict['id'] = item['id']
        movies_info.append(nowplaying_movie_dict)
    return movies_info

threads = []
movies_info = getID()
for i in movies_info:
    id = i['id']
    comments = ""
    for j in range(3):
        pageNum = j+1
        pagecomments = getpagecomments(id,pageNum)
        comments += pagecomments
    pattern = re.compile(r'([\u4e00-\u9fa5]+|[a-zA-Z]+)')
    deal_comments = re.findall(pattern, comments)
    newComments = ''
    for item in deal_comments:
        newComments += item
    result = jieba.lcut(newComments)
    print("切分結果：", result)
    wc = wordcloud.WordCloud(
        background_color='snow',
        font_path='./font/msyh.ttf',
        min_font_size=5,
        max_font_size=55,
        width=200,
    )
    wc.generate(",".join(result))
    wc.to_file('./font/%s.png' %i['title'])

在這裡插入圖片描述
爬取慕客網所有關於python的課程名及描述資訊，並通過詞雲進行分析展示;
- 網址: https://www.imooc.com/search/course?words=python

import re

import jieba
import requests
import wordcloud
from bs4 import BeautifulSoup
def getclassinform(page):
    url = 'https://www.imooc.com/search/course?words=python&page=%d' %(page)
    content = requests.get(url).text
    soup = BeautifulSoup(content,'html5lib')
    commentList = soup.find_all('a',attrs={'class':"course-detail-title"})
    commentListIntroduce = soup.find_all('div',attrs={'class':"course-item"})
    comments = ""
    for comment in commentList:
        comments += comment.text
    for comment in commentListIntroduce:
        a =  comment.find('p')
        comments += a.text
    return comments
comments = ''
for i in range(2):
    page = i+1
    pagecomments = getclassinform(page)
    comments += pagecomments
pattern = re.compile(r'([\u4e00-\u9fa5]+|[a-zA-Z]+)')
deal_comments = re.findall(pattern, comments)
newComments = ''
for item in deal_comments:
    newComments += item
result = jieba.lcut(newComments)
print("切分結果：", result)
wc = wordcloud.WordCloud(
    background_color='snow',
    font_path='./font/msyh.ttf',
    min_font_size=5,
    max_font_size=55,
    width=300,
)
wc.generate(",".join(result))
wc.to_file('python.png')

在這裡插入圖片描述

python爬取今日百度熱點前10的新聞;

import requests
from bs4 import BeautifulSoup
def getnews():
    url = 'http://top.baidu.com/buzz?b=1'
    content = requests.get(url).content
    soup = BeautifulSoup(content,'lxml')
    news = soup.find_all('a',class_='list-title')
    for new in news:
        print(new.text)
getnews()

在這裡插入圖片描述

python requests bs4練習

豆瓣評論分析: # 1). 獲取豆瓣最新上映的所有電影的前10頁評論資訊； # 2). 清洗資料； # 3). 分析每個電影評論資訊分析繪製成詞雲，儲存為png圖片，檔名為: 電影名.png; import re import jieba import requests import

python requests,bs4應用例項

獲取豆瓣最新電影的ID號和電影名稱 import requests from bs4 import BeautifulSoup url = "https://movie.douban.com/cinema/nowplaying/xian/" # 1). 獲取頁面資訊 response

網路爬蟲：Python+requests+bs4+xlwt 爬取京東商品存入Excel表

學了網路爬蟲兩週了，還是比較喜歡用網頁級庫requests，很靈活方便，scrapy網站級面向物件庫，還不熟悉，可能是原來c++學習面向物件就沒學好，對面向物件程式設計還沒理解好吧...兩週中爬了淘寶，京東，天貓（爬取失敗，反爬蟲把我這種新手難倒了，登入驗證就卡死），爬取搜狗

python爬蟲（五）：實戰【2. 爬創客實驗室（requests + bs4）】

目標：爬取創科實驗室網站中講座的資訊，輸出表：講座標題、報告人、單位、報告時間、講座內容、報告人簡介技術：requests + bs4 檢視爬蟲協議： http://127.0.0.1/lab/robots.txt （創科實驗室是我自己寫的網址，不反爬蟲）

python:使用requests,bs4爬取西刺代理並驗證

爬西刺代理的高匿免費代理，並通過http://www.baidu.com進行驗證代理是否可用，存入到excel檔案中檢視原始碼發現：所有的代理資訊都在tr標籤裡面（只有一個例外）找到所有裡面有td標籤的tr標籤 trs = soup.find_

PYTHON REQUESTS的安裝與簡單運用

Coding import out compile 鏈接客戶端 detail 但我編輯 PYTHON REQUESTS的安裝與簡單運用 2013.07.09 強烈推薦！requests官方文檔已有了中文版，請見http://cn.python-requests.o

python-requests-proxies判斷學習

ima pass php代碼 pytho 技術分享 div .text html echo 1 # coding:utf8 2 import requests 3 def prox(): 4 url = ‘http://115.159.33.177/ima

Python基礎語句練習

python 基礎練習運算符：+ - * / // %腳本要求：編寫用戶登錄認證，用戶名密碼正確則登陸成功，反之則失敗並警告重試值user= "patwang"password= "123123"for i in range(3): users = input("You typ

python requests SSL證書問題

urllib proc cert sts def int 安全 con 不能錯誤信息如下： requests.exceptions.SSLError: ("bad handshake: Error([(‘SSL routines‘, ‘tls_process_serve

Python輸入輸出練習，運算練習，turtle初步練習

span pre lease world! square range oat sum 交互 1、Hello World！print("Hello World!") 2、簡單交互（交互式，文件式）教材P19>>> name = input("pleas

python輸入輸出練習，運算練習和turtle練習

llc forward pytho bgcolor 輸入 pre for port 輸入輸出 >>> print(‘hello‘) hello 　　 >>> a=float(input(‘輸入三角形第一邊長：‘)) 輸入三角形第一

Python輸入輸出練習

簡單兩個個數字 end 輸入輸出 radius -1 技術代碼實現 Hello World！簡單交互（交互式，文件式) 用戶輸入兩個數字，計算並輸出兩個數字之和（嘗試用一行代碼實現）。用戶輸入三角形三邊長度，並計算三角形的面積：（海倫公式） 1 a = fl

python基本語法練習

算數九九乘法 div ase eas 斐波那契數 pre 一個 pri 1、打印九九乘法表 #只打印結果 for i in range(1,10): for j in range(1,i+1): print(i*j,end=" ") pr

python requests抓取貓眼電影

def res b- int nic status () tle proc 1. 網址：http://maoyan.com/board/4? 2. 代碼： 1 import json 2 from multiprocessing import Po

Python 正則練習(一) 爬取國內代理ip

取代替代 use -a int 5.0 tdi col 則表達式簡單的正則表達式練習，爬取代理 ip。僅爬取前三頁，用正則匹配過濾出 ip 地址和端口，分別作為key、value 存入 validip 字典。如果要確定代理 ip 是否真的可用，還需要再對代理

python requests自定義方法

python3今天改造一個iis6 put上傳漏洞的一個腳本，以前使用http.client完成，但個人覺得有一些麻煩，就想用第三方庫requests來實現，卻發現requests中並沒有move方法查了一下，發現requests.requests可以自定義方法使用burp抓包測試，結果成功r = reque

python requests庫學習筆記（下）

mail 接收緩存 nbsp 0.10 基本 eat agen 維基百科 1.請求異常處理請求異常類型：請求超時處理（timeout）：實現代碼： import requestsfrom requests import exceptions #引

python代碼練習（每天一小時）

生成 max lose 生成式練習 == rom rabl utf-8 1 #-*- coding :utf-8 -*- 2 print("#1.函數") 3 def my_abs(x): 4 if x>=0: 5 retu

python編程練習：漏洞百出的四則運算器

wro 看書 eas 記錄發現編程內置函數 pro 接下來在拜讀到《python簡明教程》最後一張“接下來學習什麽”到時候，發現都是圖形工具了，並且聲稱如果熟練掌握了前面章節，能夠編寫大多數程序= =！好吧，我一直在看書籍，還沒有編寫任何程序！那豈不是都不知道自己有

python基礎習題練習

python基礎習題練習#encoding:utf-8 #1.實現用戶輸入用戶名和密碼,當用戶名為 seven 且密碼為 123 時,顯示登陸成功,否則登陸失敗! name=input('name>>: ').strip() password=input('passwd&

python requests bs4練習

相關推薦