python獲取網站信息

阿新 • • 發佈：2018-05-12

python爬蟲學習

#coding:utf-8

import urllib2
import os
import sys
reload(sys)
sys.setdefaultencoding("utf-8")
from bs4 import BeautifulSoup

heads = {}
heads[‘User-Agent‘] = ‘Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36‘

request=urllib2.Request("http://www.kugou.com" ,headers=heads)#創建對酷狗官網get請求
result=urllib2.urlopen(request)#發出請求

soup=BeautifulSoup(result.read(),‘html.parser‘)#生成可分析對象
for i in soup.find_all("div"):#遍歷所有div標簽
    if i.get("id")=="SongtabContent":#判斷id為SongtabContent的div標簽
        s=i.find_all("li")#把所有li標簽內容賦值給s變量

with open(u"C://downloads//lw//a.txt","w") as f:#創建要寫入文件對象
    for i in s:#遍歷所有li標簽對象
        f.write(u"歌曲名稱為: %s    " % i.a.select(".songName")[0].text)#獲取class為songName的值
        f.write(u"歌曲播放連接為: %s    " % i.a.get("href"
)) #獲取標簽為href的值
        f.write(u"歌曲播放時間為: %s" % i.a.select(".songTime")[0].text) #獲取class為songTime的值
        f.write(os.linesep)

def shoufu():

    import requests
    import re

    resq = requests.get("http://www.sohu.com")#請求搜狐網站
    print resq.text[:100]#打印響應結果前一百行
    links = re.findall(r‘href="(.*?)"‘, resq.text)#查找所有包含href內容
    print len(links)
    valid_link = []#保存有效連接
    invalid_link = []#保存無效連接

    for link in links:
        if re.search(r"(\.jpg)|(\.jpeg)|(\.gif)|(\.ico)|(\.png)|(\.js)|(\.css)$", link.strip()):#資源連接篩選出來
            print 6, link
            invalid_link.append(link.strip())
            continue#進入此判斷之後執行完直接執行下一次循環
        elif link.strip() == "" or link.strip() == "#" or link.strip() == "/":#無效內容篩選去除
            # print 1,link
            invalid_link.append(link)
            continue
        elif link.strip().startswith("//"):#把有效相對連接篩選保存
            # print 2,link
            valid_link.append("http:" + link.strip())
            continue
        elif link.strip().count("javascript") >= 1 or link.strip().count("mailto:") >= 1:#引用js連接及郵箱超級連接去除
            # print 3,link
            invalid_link.append(link.strip())
            continue
        elif re.match(r"/\w+", link):#把剩下所有內容連接再做進一步篩選
            # print 5,link
            if re.match(r"http://.*?/", resq.url.strip()):#http開頭連接篩選
                valid_link.append(re.match(r"http://.*?/", resq.url.strip()).group() + link.strip())#把連接以/結尾內容保存
            else:
                valid_link.append(re.match(r"http://.*", resq.url.strip()).group() + link.strip())#把連接以內容結尾保存
            continue
        else:
            # print 7,link
            valid_link.append(link.strip())#篩選剩下的內容都保存到有效列表中

    # for link in valid_link[:100]:
    #    print link
    print len(valid_link)

    # for link in invalid_link:
    #    print link
    print len(invalid_link)

    file_num = 1#為創建文件準備
    for link in list(set(valid_link)):
        # print link
        resq = requests.get(link, verify=True)#允許證書校驗並訪問所有保存的有效連接
        if u"籃球" in resq.text:#篩選網頁內容中是否存在“籃球”內容
            print link
            if u‘meta charset="utf-8"‘ in resq.text:#判斷網頁是否以utf-8編碼
                with open(r"c:\\downloads\\lw\\" + str(file_num) + ".html", "w") as fp:
                    fp.write(resq.text.strip().encode("utf-8"))#編碼內容為utf-8後保存到指定目錄
            else:
                with open(r"c:\\downloads\\lw\\" + str(file_num) + ".html", "w") as fp:
                    fp.write(resq.text.strip().encode("gbk"))#編碼內容為gbk後保存到指定目錄
            file_num += 1

    print "Done!"

技術分享圖片

python獲取網站信息

python爬蟲學習#coding:utf-8 import urllib2 import os import sys reload(sys) sys.setdefaultencoding("utf-8") from bs4 import BeautifulSoup heads = {} heads[‘U

Python獲取系統信息（慢慢補充）

ces 慢慢 hat .get spl 主機名 start serve /etc/ 獲取OS信息： 1. os = " ".join(platform.linux_distribution()) 2. os = subprocess.call([‘cat‘, ‘/etc/r

python獲取系統信息模塊詳解

常用 format thead lin 結果 6.0 print inux 系統類型 python是跨平臺語言，有時候我們的程序需要運行在不同系統上，例如：linux、MacOs、 Windows，為了使程序有更好通用性，需要根據不同系統使用不同操作方式。我們可以使用pla

python獲取機器信息腳本(網上尋找的)

serve boot star sock 機器 data header pen 當前獲取機器信息(待測試) # -*- coding: UTF-8 -*- import psutil import json import os import socket

Python+selenium之獲取驗證信息

button pytho sleep 代碼 ive click gin body spa 通常獲取驗證信息用得最多的幾種驗證信息分別是title，URL和text。text方法用於獲取標簽對之間的文本信息。代碼如下： from selenium import webdri

方案優化：網站實現掃描二維碼關註微信公眾號，自動登陸網站並獲取其信息

用戶 class his onerror 就會 openid display 要點 rac 上一篇《網站實現掃描二維碼關註微信公眾號，自動登陸網站並獲取其信息》中已經實現用戶掃碼登陸網站並獲取其信息但是上一篇方案中存在一個問題，也就是文章末尾指出的可以優化的地方（可

Python練習【爬取銀行網站信息】

pre == sts color mysql 遊標 pattern 保存 ride 功能實現爬取所有銀行的銀行名稱和官網地址(如果沒有官網就忽略)，並寫入數據庫；銀行鏈接: http://www.cbrc.gov.cn/chinese/jrjg/index.html

Python練習【利用線程池爬取電影網站信息】

blog name insert page 處理 RoCE 獲取信息 mat etime 功能實現爬取貓眼電影TOP100(http://maoyan.com/board/4?offset=90) 1). 爬取內容: 電影名稱，主演，上映時間，圖片url地址保存

php定位並且獲取天氣信息

location php定位 city ext ons print map res func 1 header("Content-type: text/html; charset=utf-8"); 2 class getWeather{ 3 private

python獲取網站http://www.weather.com.cn 城市 8-15天天氣

status header none esp user lis [1] bad reat 參考一個前輩的代碼，修改了一個案例開始學習beautifulsoup做爬蟲獲取天氣信息，前輩獲取的是7日內天氣，我看旁邊還有8-15日就模仿修改了下。其實其他都沒有變化，只變換了獲

android 獲取手機信息工具類

telephony == 系統設備 android pack devices 信息 context package com.yqy.yqy_listviewheadview; import android.content.Context; import androi

獲取css信息

com 但是 tex col 不支持 style css 設置 ons 1 一般情況是用style直接獲取css信息但是style只能獲取到卸載行內的樣式外鏈的和嵌入的樣式會獲取不到 2 2.5　　　　　　　用下面方法獲取外鏈和嵌入的css樣式

Android之使用MediaMetadataRetriever類獲取媒體信息

ren sym wid cte pad () 許可 card med 一.昨天。介紹了使用MediaMetadataRetriever類來獲取視頻第一幀：http://blog.csdn.net/u012561176/article/details/47858099，今

常用Request對象獲取請求信息

-a 5.1 操作 ica 請求 put form mil 用戶訪問 Request.ServerVariables(“REMOTE_ADDR”) ‘獲取訪問IPRequest.ServerVariables(“LOCAL

匯編實現獲取CPU信息

sof and rsquo api specific module sel cif 獲取這是文章最後一次更新,加入了TLB與Cache信息等資料前言:論壇上面有人不明白CPUID指令的用法,於是就萌生寫這篇文章的想法,若有錯誤話請大俠指出,謝謝了 ^^論壇的式樣貌似有問題

Android ImageView 獲取圖片信息後進行比較

drawable 取圖 etc android ons imageview 需要 image equals ImageView a=(ImageView)findViewById(R.id.imageView2);

snmp4j 異步獲取節點信息

ble 出現異常使用 ptr address void 意思 int transport 1. 主要代碼如下： public class ResponseListenerTest { public static void main(String[] args)

HTML基礎——網站信息顯示頁面

大小寫 image 代碼 -- width ges 網站 title meta 1、語法和規範 HTML文件都是以.html或者.htm結尾的。建議使用.html結尾。 HTML文件分為頭部分(<head></head>)和體部分(<body&

使用PHP curl模擬瀏覽器抓取網站信息

打開 user 開始密碼認證 tran use 方式網站 body curl是一個利用URL語法在命令行方式下工作的文件傳輸工具。curl是一個利用URL語法在命令行方式下工作的文件傳輸工具。它支持很多協議：FTP, FTPS, HTTP, HTTPS, GOPHER,

C# 獲取系統信息

計算機名 string con foreach tostring inf machine ima spn public string GetMyOSName() { //獲取當前操作系統信息 OperatingS

python獲取網站信息

相關推薦