1. 程式人生 > >Twitter資料抓取的方法(三)

Twitter資料抓取的方法(三)

Sorry for my delayed response to this as I’ve seen several comments on this topic, but I’ve been pretty busy with some other stuff recently, and this is the first chance I’ve had to address this!

As with most web scraping, at some point a provider will change their source code and scrapers will break. This is something that Twitter has done with their recent site redesign. Having gone over the changes, there are two that effect this scraping script.

The first change is tiny. Originally, to get all tweets rather than “top tweet”, we used the type_param “f” to denote “realtime”. However, the value for this has changed to just “tweets”.

Second change is a bit trickier to counter, as the scroll_cursor parameter no longer exists. Instead, if we look at the AJAX call that Twitter makes on its infinite scroll, we get a different parameter:

max_position:TWEET-399159003478908931-606844263347945472-BD1UO2FFu9QAAAAAAAAETAAAAAcAAAASAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

The highlighted parameter there, “max_position”, looks very similar to the original scroll_cursor parameter. However, unlike the scroll_cursor which existed in the response to be extracted, we have to create this one ourself.

As can be seen from the example, we have “TWEET” followed by two sets of numbers, and what appears to be “BD1UO2FFu9” screaming and falling off a cliff. The good news is, we actually only need the first three components.

“TWEET” will always stay the same, but the two sets of numbers are actually tweet ID’s, representing the oldest to most recently created tweets you’ve extracted.

For our newest tweet (2nd number set), we only need to extract this once as we can keep it the same for all calls, as Twitter does.

The oldest tweet (1st number set), we need to extract the last tweet id in our results each time to change our max_position value.

So, lets take a look at some of the code I’ve changed:

String minTweet = null;
while((response = executeSearch(url))!=null && continueSearch && !response.getTweets().isEmpty()) {
    if(minTweet==null) {
        minTweet = response.getTweets().get(0).getId();
    }
    continueSearch = saveTweets(response.getTweets());
    String maxTweet = response.getTweets().get(response.getTweets().size()-1).getId();
    if(!minTweet.equals(maxTweet)) {
        try {
            Thread.sleep(rateDelay);
        } catch (InterruptedException e) {
            e.printStackTrace();
        }
        String maxPosition = "TWEET-" + maxTweet + "-" + minTweet;
        url = constructURL(query, maxPosition);
    }
}
 
...
 
public final static String TYPE_PARAM = "f";
public final static String QUERY_PARAM = "q";
public final static String SCROLL_CURSOR_PARAM = "max_position";
public final static String TWITTER_SEARCH_URL = "https://twitter.com/i/search/timeline";
 
public static URL constructURL(final String query, final String maxPosition) throws InvalidQueryException {
    if(query==null || query.isEmpty()) {
        throw new InvalidQueryException(query);
    }
    try {
        URIBuilder uriBuilder;
        uriBuilder = new URIBuilder(TWITTER_SEARCH_URL);
        uriBuilder.addParameter(QUERY_PARAM, query);
        uriBuilder.addParameter(TYPE_PARAM, "tweets");
        if (maxPosition != null) {
            uriBuilder.addParameter(SCROLL_CURSOR_PARAM, maxPosition);
        }
        return uriBuilder.build().toURL();
    } catch(MalformedURLException | URISyntaxException e) {
        e.printStackTrace();
        throw new InvalidQueryException(query);
    }
}

Rather than our original scroll_cursor value, we now have “minTweet”. Initially this is set to null, as we don’t have one to begin with. On our first call though, we get the first tweet in our response, and set the ID to minTweet, if minTweet is still null.

Next, we need to get the maxTweet. As previously said before, we get this by getting the last tweet in our results, and returning that ID. So we don’t repeat results, we need to make sure that the minTweet does not equal the maxTweet ID, and if not, we construct our “max_position” query with the format “TWEET-{maxTweetId}-{minTweetId}”.

You’ll also notice I changed the SCROLL_CURSOR_PARAM to “max_position” from “scroll_cursor”. Normally I’d change the variable name as well, but for visual reference, I’ve kept it the same for now, so you know where to change it.

Also, in constructUrl, the TYPE_PARAM value has also been set to “tweets”.

Finally, make sure you modify your TwitterResponse class so that it mirrors the parameters that are returned by the JSON file.

All you need to do is replace the original class variables with these, and update the constructor and getter/setter fields:

private boolean has_more_items;
private String items_html;
private String min_position;
private String refresh_cursor;
private long focused_refresh_interval;

相關推薦

Twitter資料方法()

Sorry for my delayed response to this as I’ve seen several comments on this topic, but I’ve been pretty busy with some other stuff recently, and this is

Twitter資料方法(一)

EDIT – Since I wrote this post, Twitter has updated how you get the next list of tweets for your result. Rather than using scroll_cursor, it uses max_pos

Twitter資料方法(二)

In the previous post we covered effectively the theory of how we can search and extract tweets from Twitter without having to use their API. First, let’

二.爬蟲:Python種網頁內容方法

使用 Beautiful Soup 解析 html 檔案 #!/usr/bin/pytho

Python 種網頁方法

摘要:本文講的是利用Python實現網頁資料抓取的三種方法;分別為正則表示式(re)、BeautifulSoup模組和lxml模組。本文所有程式碼均是在python3.5中執行的。 本文抓取的是[中央氣象臺](http://www.nmc.cn/)首頁頭條資

通過呼叫Twitter APITwitter資料

國內研究weibo的人比較多,資料也相對較多,但是twitter的資料相對較少。今天簡單說一下twitter api的使用。最近一小需求,採集含有指定關鍵詞的twitter資料,瞬間想到寫個爬蟲來抓取,後來突然想到twitter應該有open api可用。使用了vpn翻牆之

資料必須學會的種技術

我們正處於一個大資料的時代,在這樣的一個以資料為王的時代,第一步就是如何獲取資料。大概的流程是這樣的:通過Http客戶端獲取html頁面,通過html頁面解析工具解析html頁面,獲取感興趣的資料元素,最後將解析後的資料寫入資料庫。Python為這幾個過程都提供

Destoon搜索頁開啟百度蜘蛛方法

搜索 蜘蛛 ref href 目錄 公司 http follow ots 產品和公司搜索頁也是個不錯爭取排名的地方,Destoon默認禁止了搜索引擎對Search頁的訪問,修改辦法:首先修改robots.txt去掉禁止search的那一行,然後查找整站的module目錄搜索

【簡易采集】美團數據方法 八爪魚

方法 IT 情況下 根據 規則 內置 教程 關鍵詞 查看 【簡易采集】美團數據抓取方法 最近學習了 一下 如何爬取數據 然後就接觸了 八爪魚 數據分析 這個軟件 詳細信息訪問這個:http://www.bazhuayu.com/tutorial/hottutoria

QueryList免費線上網頁採集資料工具-toolfk.com

     本文要推薦的[ToolFk]是一款程式設計師經常使用的線上免費測試工具箱,ToolFk 特色是專注於程式設計師日常的開發工具,不用安裝任何軟體,只要把內容貼上按一個執行按鈕,就能獲取到想要的內容結果。ToolFk還支援  BarCode條形碼線上

spider資料(第二章)

download最完善的指令碼 import urllib2 import urlparse def download(url, user_agent="wswp", proxy=None, num_retries=2): print "DownLoading", url head

Android 使用jsoup 進行資料

一,身為安卓開發人員,在沒有介面的情況下是很操蛋的。索性就抓點資料測試用了。 準備工作:jsoup.jar 這裡 已經 是 已經實現好 邏輯的方法。 public class MianHuanJsoup { public static final String MH

C# NetCore使用AngleSharp爬周公解夢資料 MySql資料庫的自動建立和頁面資料

這一章詳細講解編碼過程 那麼接下來就是碼程式碼了,GO 新建NetCore WebApi專案 空的就可以    NuGet安裝 Install-Package AngleSharp    或者介面安裝   using。。 預設本地裝有

爬蟲[1]---頁面分析及資料

頁面分析及資料抓取 anaconda + scrapy 安裝:https://blog.csdn.net/dream_dt/article/details/80187916 用 scrapy 初始化一個爬蟲:https://blog.csdn.net/dream_dt/article

爬蟲實戰-酷狗音樂資料--XPath,Pyquery,Beautifulsoup資料提取對比實戰

網站: http://www.kugou.com/yy/html/rank.html 爬取目標: 酷酷狗飆升榜的歌手,歌曲名字,歌曲連結等內容,存到Mysql資料庫中 網頁解析: 此次爬取採用三種解析方式: 程式碼如下: import requests from l

poi資料和下載

  網際網路或者企業獲取:直接從一些專業類服務網站上抓取或者購買(例如大眾點評,攜程),或者直接從大家在其公開的地圖服務上的標註中進行篩選和獲取。這就是google,百度,高德自己免費向社會開放其地圖服務所能夠獲得的利益。尤其對於開放API免費企業客戶的使用,這種獲取是很有價值的。

scrapy安裝與資料

scrapy安裝 pip install Scrapy 如果需要vs c++的話可能是因為要用到twisted,可以到 https://www.lfd.uci.edu/~gohlke/pythonlibs/ 下載,然後在本地下載的目錄下在位址列輸入cmd,然後pip

Python 爬蟲工程師必學——App資料實戰

第1章 課程介紹 介紹課程目標、通過課程能學習到的內容、學會這些技能能做什麼,對公司業務有哪些幫助,對個人有哪些幫助。介紹目前app資料抓取有哪些困難,面臨的挑戰,本實戰課程會利用哪些工具來解決這些問題,以及本實戰課程的特點 ... 1-1 python爬蟲工程師必備技

某課《Python 爬蟲工程師必學 App資料實戰》

第1章 課程介紹 介紹課程目標、通過課程能學習到的內容、學會這些技能能做什麼,對公司業務有哪些幫助,對個人有哪些幫助。介紹目前app資料抓取有哪些困難,面臨的挑戰,本實戰課程會利用哪些工具來解決這些問題,以及本實戰課程的特點 ... 1-1 python爬蟲工程師必備技

爬蟲原理與資料-----HTTP和HTTPS的請求與響應

HTTP和HTTPS HTTP協議(HyperText Transfer Protocol,超文字傳輸協議):是一種釋出和接收 HTML頁面的方法。 HTTPS(Hypertext Transfer Protocol over Secure Socket Layer)簡單講是HTTP的安全版,在HTTP下加入