1. 程式人生 > >Twitter資料抓取的方法(一)

Twitter資料抓取的方法(一)

EDIT – Since I wrote this post, Twitter has updated how you get the next list of tweets for your result. Rather than using scroll_cursor, it uses max_position. I’ve written a bit more in detail here.

In fairly recent news, Twitter has started indexing it’s entire history of Tweets going all the way back to 2006. Hurrah for data scientists! However, even with this news (at time of writing), their search API is still restricted to the past seven days of Tweets. While I doubt this will be the case permanently, as a useful exercise this post presents how we can search for Tweets from Twitter without necessarily using their API. Besides the indexing, there is also the advantage that Twitter is a little more liberal with rate limits, and you don’t require any authentication keys.

The post will be split up into two parts, this first part looking at what we can extract from Twitter and how we might start to go about it, and the second a tutorial on how we can implement this in Java.

Right, to begin, lets say we want to search Twitter for all tweets related to the query “Babylon 5”. You can access Twitters advanced search without being logged in: 

https://twitter.com/search-advanced

If we take a look at the URL that’s constructed when we perform the search we get:

https://twitter.com/search?q=Babylon%205&src=typd

As we can see, there are two query parameters, q (our query encoded) and src (assumed to be the source of the query, i.e. typed).  However, by default, Twitter returns top results, rather than all, so on the displayed page, if you click on All the URL changes to:

https://twitter.com/search?f=realtime&q=Babylon%205&src=typd

The difference here is the f=realtime parameter that appears to specify we receive Tweets in realtime as opposed to a subset of top Tweets. Useful to know, but currently we’re only getting the first 25 Tweets back. If we scroll down though, we notice that more Tweets are loaded on the page via AJAX. Logging all XMLHttpRequests in whatever dev tool you choose to use, we can see that everytime we reach the bottom of the page, Twitter makes an AJAX call a URL similar to:

https://twitter.com/i/search/timeline?f=realtime&q=Babylon%205&src=typd&include_available_features=1&include_entities=1&last_note_ts=85&scroll_cursor=TWEET-553069642609344512-553159310448918528-BD1UO2FFu9QAAAAAAAAETAAAAAcAAAASAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

On further inspection, we see that it is also a JSON response, which is very useful! Before we look at the response though, let’s have a look at that URL and some of it’s parameters.

First off, it’s slightly different to the default search URL. The path is /i/search/timeline as opposed to /search. Secondly, while we notice our familiar parameters q, f, and src, from before, there are several additional ones. The most important and essential new one though is scroll_cursor. This is what Twitter uses to paginate the results. If you remove scroll_cursor from that URL, you end up with your first page of results again.

Now lets take a look now at the JSON response that Twitter provides:

{
has_more_items: boolean,
items_html: "...",
is_scrolling_request: boolean,
is_refresh_request: boolean,
scroll_cursor: "...",
refresh_cursor: "...",
focused_refresh_interval: int
}

Again, not all parameters for this post are important to take note of, but the ones that are include: has_more_items, items_html, and scroll_cursor.

has_more_items – This lets you know with a boolean value whether or not there are any more results after this query.

items_html – This is where all the tweets are which Twitter uses to append to the bottom of their timeline. It requires parsing, but there is a good amount of information in there to be extracted which we will look at in a minute.

scroll_cursor – A pagination value that allows us to extract the next page of results.

Remember our scroll_cursor parameter from earlier on? Well for each search request you make to twitter, the value of this key in the response provides you with the next set of tweets, allowing you to recursively call Twitter until either has_more_items is false, or your previous scroll_cursor equals the last scroll_cursor you had.

Now that we know how to access Twitters own search functionality, lets turn our attention to the tweets themselves. As mentioned before, items_html in the response is where all the tweets are at. However, it comes in a block of HTML as Twitter injects that block at the bottom of the page each time that call is made. The HTML inside is a list of li elements, each element a Tweet. I won’t post the HTML for one here, as even one tweet has a lot of HTML in it, but if you want to look at it, copy the items_html value (omiting the quotes around the HTML content) and paste it into something like JSBeautifier to see the formatted results for yourself.

If we look over the HTML, aside from the tweets text, there is actually a lot of useful information encapsulated in this data packet. The most important item is the Tweet id itself. If you check, it’s actually in the root li element. Now, we could stop here as with that ID, you can query Twitters official API, and if it’s a public Tweet, you can get all kinds of information. However, that’d defeat the purpose of not using the API, so lets see what we can extract from what we already have.

The table below shows various CSS selector queries that you can use to extract the information with.

Embedded Tweet Data
SelectorValue
div.original-tweet[data-tweet-id] The authors twitter handle
div.original-tweet[data-name] The name of the author
div.original-tweet[data-user-id] The user ID of the author
span._timestamp[data-time] Timestamp of the post
span._timestamp[data-time-ms] Timestamp of the post in ms
p.tweet-text  Text of Tweet
span.ProfileTweet-action–retweet > span.ProfileTweet-actionCount[data-tweet-stat-count] Number of Retweets
span.ProfileTweet-action–favorite > span.ProfileTweet-actionCount[data-tweet-stat-count]  Number of Favourites

That’s quite a sizeable amount of information in that HTML. From looking through, we can extract a bunch of stuff about the author, the time stamp of the tweet, the text, and number of retweets and favourites.

What have we learned here? Well, to summarize, we know how to construct a Twitter URL query, the response we get from said query, and the information we can extract from said response. The second part of this tutorial (to follow shortly) will introduce some code as to how we can implement the above.

相關推薦

Twitter資料方法()

EDIT – Since I wrote this post, Twitter has updated how you get the next list of tweets for your result. Rather than using scroll_cursor, it uses max_pos

Twitter資料方法(三)

Sorry for my delayed response to this as I’ve seen several comments on this topic, but I’ve been pretty busy with some other stuff recently, and this is

Twitter資料方法(二)

In the previous post we covered effectively the theory of how we can search and extract tweets from Twitter without having to use their API. First, let’

爬蟲():爬蟲原理與資料

1.通用爬蟲和聚焦爬蟲 根據使用場景,網路爬蟲可分為 通用爬蟲 和 聚焦爬蟲 兩種. 通用爬蟲 通用網路爬蟲 是 捜索引擎抓取系統(Baidu、Google、Yahoo等)的重要組成部分。主要目的是將網際網路上的網頁下載到本地,形成一個網際網路內容的映象備份 聚焦爬蟲

汽車之家店鋪資料 DotnetSpider實戰[]

一、背景春節也不能閒著,一直想學一下爬蟲怎麼玩,網上搜了一大堆,大多都是Python的,大家也比

通過呼叫Twitter APITwitter資料

國內研究weibo的人比較多,資料也相對較多,但是twitter的資料相對較少。今天簡單說一下twitter api的使用。最近一小需求,採集含有指定關鍵詞的twitter資料,瞬間想到寫個爬蟲來抓取,後來突然想到twitter應該有open api可用。使用了vpn翻牆之

次網頁資料採集儲存我的電子商務業務

最近我注意到許多電子商務指南都關注相同的技巧:增加你的社交活動投資chatbots構建一個AR應用程式雖然這些都是很棒的提示,但我在這裡只給你一個刮傷黑客的資訊,這可以幫助我的公司不再關機。(如果您沒有使用網路抓取您的線上業務,請檢視此部落格)。image: https://

Destoon搜索頁開啟百度蜘蛛方法

搜索 蜘蛛 ref href 目錄 公司 http follow ots 產品和公司搜索頁也是個不錯爭取排名的地方,Destoon默認禁止了搜索引擎對Search頁的訪問,修改辦法:首先修改robots.txt去掉禁止search的那一行,然後查找整站的module目錄搜索

【簡易采集】美團數據方法 八爪魚

方法 IT 情況下 根據 規則 內置 教程 關鍵詞 查看 【簡易采集】美團數據抓取方法 最近學習了 一下 如何爬取數據 然後就接觸了 八爪魚 數據分析 這個軟件 詳細信息訪問這個:http://www.bazhuayu.com/tutorial/hottutoria

爬蟲很簡單麽?直到我千億個網頁後我懂!爬蟲真不簡單!

服務 字體 每日 還需要 道理 但是 電子商務 發表 硬件 現在爬蟲技術似乎是很容易的事情,但這種看法是很有迷惑性的。開源的庫/框架、可視化的爬蟲工具以及數據析取工具有很多,從網站抓取數據似乎易如反掌。然而,當你成規模地在網站上抓東西時,事情很快就會變得非常

QueryList免費線上網頁採集資料工具-toolfk.com

     本文要推薦的[ToolFk]是一款程式設計師經常使用的線上免費測試工具箱,ToolFk 特色是專注於程式設計師日常的開發工具,不用安裝任何軟體,只要把內容貼上按一個執行按鈕,就能獲取到想要的內容結果。ToolFk還支援  BarCode條形碼線上

spider資料(第二章)

download最完善的指令碼 import urllib2 import urlparse def download(url, user_agent="wswp", proxy=None, num_retries=2): print "DownLoading", url head

Android 使用jsoup 進行資料

一,身為安卓開發人員,在沒有介面的情況下是很操蛋的。索性就抓點資料測試用了。 準備工作:jsoup.jar 這裡 已經 是 已經實現好 邏輯的方法。 public class MianHuanJsoup { public static final String MH

C# NetCore使用AngleSharp爬周公解夢資料 MySql資料庫的自動建立和頁面資料

這一章詳細講解編碼過程 那麼接下來就是碼程式碼了,GO 新建NetCore WebApi專案 空的就可以    NuGet安裝 Install-Package AngleSharp    或者介面安裝   using。。 預設本地裝有

爬蟲[1]---頁面分析及資料

頁面分析及資料抓取 anaconda + scrapy 安裝:https://blog.csdn.net/dream_dt/article/details/80187916 用 scrapy 初始化一個爬蟲:https://blog.csdn.net/dream_dt/article

爬蟲實戰-酷狗音樂資料--XPath,Pyquery,Beautifulsoup資料提取對比實戰

網站: http://www.kugou.com/yy/html/rank.html 爬取目標: 酷酷狗飆升榜的歌手,歌曲名字,歌曲連結等內容,存到Mysql資料庫中 網頁解析: 此次爬取採用三種解析方式: 程式碼如下: import requests from l

poi資料和下載

  網際網路或者企業獲取:直接從一些專業類服務網站上抓取或者購買(例如大眾點評,攜程),或者直接從大家在其公開的地圖服務上的標註中進行篩選和獲取。這就是google,百度,高德自己免費向社會開放其地圖服務所能夠獲得的利益。尤其對於開放API免費企業客戶的使用,這種獲取是很有價值的。

scrapy安裝與資料

scrapy安裝 pip install Scrapy 如果需要vs c++的話可能是因為要用到twisted,可以到 https://www.lfd.uci.edu/~gohlke/pythonlibs/ 下載,然後在本地下載的目錄下在位址列輸入cmd,然後pip

selenium點選連結進入子頁面內容(新聞案例

找了一個新聞網站練習爬蟲抓取,目標:逐一點選目錄標題,進入詳細新聞頁面,抓取子頁面的標題和正文內容並打印出來,返回目錄標題頁,點選下一篇文章。注:沒有新開視窗,是在原視窗實現跳轉。新開視窗進行抓取看下一篇文章。 試了很多種方法都抓取不到class=rightContent下

Python 爬蟲工程師必學——App資料實戰

第1章 課程介紹 介紹課程目標、通過課程能學習到的內容、學會這些技能能做什麼,對公司業務有哪些幫助,對個人有哪些幫助。介紹目前app資料抓取有哪些困難,面臨的挑戰,本實戰課程會利用哪些工具來解決這些問題,以及本實戰課程的特點 ... 1-1 python爬蟲工程師必備技