Twitter資料抓取的方法(一)
EDIT – Since I wrote this post, Twitter has updated how you get the next list of tweets for your result. Rather than using scroll_cursor, it uses max_position. I’ve written a bit more in detail here.
In fairly recent news, Twitter has started indexing it’s entire history of Tweets going all the way back to 2006. Hurrah for data scientists! However, even with this news (at time of writing), their search API is still restricted to the past seven days of Tweets. While I doubt this will be the case permanently, as a useful exercise this post presents how we can search for Tweets from Twitter without necessarily using their API. Besides the indexing, there is also the advantage that Twitter is a little more liberal with rate limits, and you don’t require any authentication keys.
The post will be split up into two parts, this first part looking at what we can extract from Twitter and how we might start to go about it, and the second a tutorial on how we can implement this in Java.
Right, to begin, lets say we want to search Twitter for all tweets
related to the query “Babylon 5”. You can access Twitters advanced
search without being logged in:
If we take a look at the URL that’s constructed when we perform the search we get:
https://twitter.com/search?q=Babylon%205&src=typd
As we can see, there are two query parameters, q (our query encoded) and src (assumed to be the source of the query, i.e. typed). However, by default, Twitter returns top results, rather than all, so on the displayed page, if you click on All the URL changes to:
https://twitter.com/search?f=realtime&q=Babylon%205&src=typd
The difference here is the f=realtime parameter that appears to specify we receive Tweets in realtime as opposed to a subset of top Tweets. Useful to know, but currently we’re only getting the first 25 Tweets back. If we scroll down though, we notice that more Tweets are loaded on the page via AJAX. Logging all XMLHttpRequests in whatever dev tool you choose to use, we can see that everytime we reach the bottom of the page, Twitter makes an AJAX call a URL similar to:
https://twitter.com/i/search/timeline?f=realtime&q=Babylon%205&src=typd&include_available_features=1&include_entities=1&last_note_ts=85&scroll_cursor=TWEET-553069642609344512-553159310448918528-BD1UO2FFu9QAAAAAAAAETAAAAAcAAAASAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
On further inspection, we see that it is also a JSON response, which is very useful! Before we look at the response though, let’s have a look at that URL and some of it’s parameters.
First off, it’s slightly different to the default search URL. The path is /i/search/timeline as opposed to /search. Secondly, while we notice our familiar parameters q, f, and src, from before, there are several additional ones. The most important and essential new one though is scroll_cursor. This is what Twitter uses to paginate the results. If you remove scroll_cursor from that URL, you end up with your first page of results again.
Now lets take a look now at the JSON response that Twitter provides:
{ has_more_items: boolean, items_html: "...", is_scrolling_request: boolean, is_refresh_request: boolean, scroll_cursor: "...", refresh_cursor: "...", focused_refresh_interval: int }
Again, not all parameters for this post are important to take note of, but the ones that are include: has_more_items, items_html, and scroll_cursor.
has_more_items – This lets you know with a boolean value whether or not there are any more results after this query.
items_html – This is where all the tweets are which Twitter uses to append to the bottom of their timeline. It requires parsing, but there is a good amount of information in there to be extracted which we will look at in a minute.
scroll_cursor – A pagination value that allows us to extract the next page of results.
Remember our scroll_cursor parameter from earlier on? Well for each search request you make to twitter, the value of this key in the response provides you with the next set of tweets, allowing you to recursively call Twitter until either has_more_items is false, or your previous scroll_cursor equals the last scroll_cursor you had.
Now that we know how to access Twitters own search functionality, lets turn our attention to the tweets themselves. As mentioned before, items_html in the response is where all the tweets are at. However, it comes in a block of HTML as Twitter injects that block at the bottom of the page each time that call is made. The HTML inside is a list of li elements, each element a Tweet. I won’t post the HTML for one here, as even one tweet has a lot of HTML in it, but if you want to look at it, copy the items_html value (omiting the quotes around the HTML content) and paste it into something like JSBeautifier to see the formatted results for yourself.
If we look over the HTML, aside from the tweets text, there is actually a lot of useful information encapsulated in this data packet. The most important item is the Tweet id itself. If you check, it’s actually in the root li element. Now, we could stop here as with that ID, you can query Twitters official API, and if it’s a public Tweet, you can get all kinds of information. However, that’d defeat the purpose of not using the API, so lets see what we can extract from what we already have.
The table below shows various CSS selector queries that you can use to extract the information with.
Selector | Value |
---|---|
div.original-tweet[data-tweet-id] | The authors twitter handle |
div.original-tweet[data-name] | The name of the author |
div.original-tweet[data-user-id] | The user ID of the author |
span._timestamp[data-time] | Timestamp of the post |
span._timestamp[data-time-ms] | Timestamp of the post in ms |
p.tweet-text | Text of Tweet |
span.ProfileTweet-action–retweet > span.ProfileTweet-actionCount[data-tweet-stat-count] | Number of Retweets |
span.ProfileTweet-action–favorite > span.ProfileTweet-actionCount[data-tweet-stat-count] | Number of Favourites |
That’s quite a sizeable amount of information in that HTML. From looking through, we can extract a bunch of stuff about the author, the time stamp of the tweet, the text, and number of retweets and favourites.
What have we learned here? Well, to summarize, we know how to construct a Twitter URL query, the response we get from said query, and the information we can extract from said response. The second part of this tutorial (to follow shortly) will introduce some code as to how we can implement the above.
相關推薦
Twitter資料抓取的方法(一)
EDIT – Since I wrote this post, Twitter has updated how you get the next list of tweets for your result. Rather than using scroll_cursor, it uses max_pos
Twitter資料抓取的方法(三)
Sorry for my delayed response to this as I’ve seen several comments on this topic, but I’ve been pretty busy with some other stuff recently, and this is
Twitter資料抓取的方法(二)
In the previous post we covered effectively the theory of how we can search and extract tweets from Twitter without having to use their API. First, let’
爬蟲(一):爬蟲原理與資料抓取
1.通用爬蟲和聚焦爬蟲 根據使用場景,網路爬蟲可分為 通用爬蟲 和 聚焦爬蟲 兩種. 通用爬蟲 通用網路爬蟲 是 捜索引擎抓取系統(Baidu、Google、Yahoo等)的重要組成部分。主要目的是將網際網路上的網頁下載到本地,形成一個網際網路內容的映象備份 聚焦爬蟲
汽車之家店鋪資料抓取 DotnetSpider實戰[一]
一、背景春節也不能閒著,一直想學一下爬蟲怎麼玩,網上搜了一大堆,大多都是Python的,大家也比
通過呼叫Twitter API抓取Twitter資料
國內研究weibo的人比較多,資料也相對較多,但是twitter的資料相對較少。今天簡單說一下twitter api的使用。最近一小需求,採集含有指定關鍵詞的twitter資料,瞬間想到寫個爬蟲來抓取,後來突然想到twitter應該有open api可用。使用了vpn翻牆之
一次網頁資料抓取採集儲存我的電子商務業務
最近我注意到許多電子商務指南都關注相同的技巧:增加你的社交活動投資chatbots構建一個AR應用程式雖然這些都是很棒的提示,但我在這裡只給你一個刮傷黑客的資訊,這可以幫助我的公司不再關機。(如果您沒有使用網路抓取您的線上業務,請檢視此部落格)。image: https://
Destoon搜索頁開啟百度蜘蛛抓取方法
搜索 蜘蛛 ref href 目錄 公司 http follow ots 產品和公司搜索頁也是個不錯爭取排名的地方,Destoon默認禁止了搜索引擎對Search頁的訪問,修改辦法:首先修改robots.txt去掉禁止search的那一行,然後查找整站的module目錄搜索
【簡易采集】美團數據抓取方法 八爪魚
方法 IT 情況下 根據 規則 內置 教程 關鍵詞 查看 【簡易采集】美團數據抓取方法 最近學習了 一下 如何爬取數據 然後就接觸了 八爪魚 數據分析 這個軟件 詳細信息訪問這個:http://www.bazhuayu.com/tutorial/hottutoria
爬蟲很簡單麽?直到我抓取了一千億個網頁後我懂!爬蟲真不簡單!
服務 字體 每日 還需要 道理 但是 電子商務 發表 硬件 現在爬蟲技術似乎是很容易的事情,但這種看法是很有迷惑性的。開源的庫/框架、可視化的爬蟲工具以及數據析取工具有很多,從網站抓取數據似乎易如反掌。然而,當你成規模地在網站上抓東西時,事情很快就會變得非常
QueryList免費線上網頁採集資料抓取工具-toolfk.com
本文要推薦的[ToolFk]是一款程式設計師經常使用的線上免費測試工具箱,ToolFk 特色是專注於程式設計師日常的開發工具,不用安裝任何軟體,只要把內容貼上按一個執行按鈕,就能獲取到想要的內容結果。ToolFk還支援 BarCode條形碼線上
spider資料抓取(第二章)
download最完善的指令碼 import urllib2 import urlparse def download(url, user_agent="wswp", proxy=None, num_retries=2): print "DownLoading", url head
Android 使用jsoup 進行資料抓取
一,身為安卓開發人員,在沒有介面的情況下是很操蛋的。索性就抓點資料測試用了。 準備工作:jsoup.jar 這裡 已經 是 已經實現好 邏輯的方法。 public class MianHuanJsoup { public static final String MH
C# NetCore使用AngleSharp爬取周公解夢資料 MySql資料庫的自動建立和頁面資料抓取
這一章詳細講解編碼過程 那麼接下來就是碼程式碼了,GO 新建NetCore WebApi專案 空的就可以 NuGet安裝 Install-Package AngleSharp 或者介面安裝 using。。 預設本地裝有
爬蟲[1]---頁面分析及資料抓取
頁面分析及資料抓取 anaconda + scrapy 安裝:https://blog.csdn.net/dream_dt/article/details/80187916 用 scrapy 初始化一個爬蟲:https://blog.csdn.net/dream_dt/article
爬蟲實戰-酷狗音樂資料抓取--XPath,Pyquery,Beautifulsoup資料提取對比實戰
網站: http://www.kugou.com/yy/html/rank.html 爬取目標: 酷酷狗飆升榜的歌手,歌曲名字,歌曲連結等內容,存到Mysql資料庫中 網頁解析: 此次爬取採用三種解析方式: 程式碼如下: import requests from l
poi資料抓取和下載
網際網路或者企業獲取:直接從一些專業類服務網站上抓取或者購買(例如大眾點評,攜程),或者直接從大家在其公開的地圖服務上的標註中進行篩選和獲取。這就是google,百度,高德自己免費向社會開放其地圖服務所能夠獲得的利益。尤其對於開放API免費企業客戶的使用,這種獲取是很有價值的。
scrapy安裝與資料抓取
scrapy安裝 pip install Scrapy 如果需要vs c++的話可能是因為要用到twisted,可以到 https://www.lfd.uci.edu/~gohlke/pythonlibs/ 下載,然後在本地下載的目錄下在位址列輸入cmd,然後pip
selenium點選連結進入子頁面抓取內容(新聞抓取案例一)
找了一個新聞網站練習爬蟲抓取,目標:逐一點選目錄標題,進入詳細新聞頁面,抓取子頁面的標題和正文內容並打印出來,返回目錄標題頁,點選下一篇文章。注:沒有新開視窗,是在原視窗實現跳轉。新開視窗進行抓取看下一篇文章。 試了很多種方法都抓取不到class=rightContent下
Python 爬蟲工程師必學——App資料抓取實戰
第1章 課程介紹 介紹課程目標、通過課程能學習到的內容、學會這些技能能做什麼,對公司業務有哪些幫助,對個人有哪些幫助。介紹目前app資料抓取有哪些困難,面臨的挑戰,本實戰課程會利用哪些工具來解決這些問題,以及本實戰課程的特點 ... 1-1 python爬蟲工程師必備技