1. 程式人生 > >kgretzky/dcrawl: Simple, but smart, multi-threaded web crawler for randomly gathering huge lists of unique domain names.

kgretzky/dcrawl: Simple, but smart, multi-threaded web crawler for randomly gathering huge lists of unique domain names.

dcrawl

dcrawl is a simple, but smart, multi-threaded web crawler for randomly gathering huge lists of unique domain names.

baby-gopher

demo

How it works?

dcrawl takes one site URL as input and detects all <a href=...> links in the site's body. Each found link is put into the queue. Successively, each queued link is crawled in the same way, branching out to more URLs found in links on each site's body.

How smart crawling works:

  • Branching out only to predefined number of links found per one hostname.
  • Maximum number of allowed different hostnames per one domain (avoids subdomain crawling hell e.g. blogspot.com).
  • Can be restarted with same list of domains - last saved domains are added to the URL queue.
  • Crawls only sites that return text/html Content-Type in HEAD response.
  • Retrieves site body of maximum 1MB size.
  • Does not save inaccessible domains.

How to run?

go build dcrawl.go
./dcrawl -url http://wired.com -out ~/domain_lists/domains1.txt -t 8

Usage

     ___                          __
  __| _/________________ __  _  _|  |
 / __ |/ ___\_  __ \__  \\ \/ \/ /  |
/ /_/ \  \___|  | \// __ \\     /|  |__
\____ |\___  >__|  (____  /\/\_/ |____/
     \/    \/           \/       v.1.0

usage: dcrawl -url URL -out OUTPUT_FILE -t THREADS

  -ms int
        maximum different subdomains for one domain (def. 10) (default 10)
  -mu int
        maximum number of links to spider per hostname (def. 5) (default 5)
  -out string
        output file to save hostnames to
  -t int
        number of concurrent threads (def. 8) (default 8)
  -url string
        URL to start scraping from
  -v bool
        verbose (default false)

License

dcrawl was made by Kuba Gretzky from breakdev.org and released under the MIT license.

相關推薦

kgretzky/dcrawl: Simple, but smart, multi-threaded web crawler for randomly gathering huge lists of unique domain names.

dcrawl dcrawl is a simple, but smart, multi-threaded web crawler for randomly gathering huge lists of unique domain names. How it works? dcrawl takes one

論文閱讀 A SIMPLE BUT TOUGH-TO-BEAT BASELINE FOR SEN- TENCE EMBEDDINGS

數值 ase pdf 超參數 linear 都是 smo 很好 函數 這篇論文提出了SIF sentence embedding方法, 作者提供的代碼在Github. 引入 作為一種無監督計算句子之間相似度的方法, sif sentence embedding使用預訓練好的

multi-threaded server, pthreads, sleep

3 0  multi-threaded server, pthreads, sleep I am trying to writa a multi-client & multi-threaded TCP server. There is a thread poo

java.util.ConcurrentModificationException: KafkaConsumer is not safe for multi-threaded access

最近在專案中遇到一個錯,搞了很長時間才找到原因,記錄一下,主要報錯的程式碼如下: val rdd1 = r.filter(x=>x.value().contains("history_price

Designing Multi-Threaded Applications Using Swift

Designing Multi-Threaded Applications Using SwiftBeing an iOS Developer in the automotive industry, I spend a great deal of time working with real time dat

To be simple but effective!

1.首先要保證你在xp下可以登入目標xp系統;2.下載 rdesktop    rdesktop-1.5.0.tar.gz    http://www.filewatcher.com/m/rdesktop-1.5.0.tar.gz.245137.0.0.html3.安裝:

【ArcGIS】Web AppBuilder For ArcGIS 配置使用

界面 logs pid builder arcgis 9.png alt uil nbsp 一、Portal註冊 2、Web AppBuilder配置 輸入https://XXXX.YYYY.com.cn:3344/webappbuil

Microsoft Azure Tutorial: Build your first movie inventory web app with just a few lines of code

tro options core any call jpg should nav lines Editor’s Note: The following is a guest post from Mustafa Mahmutovi?, a Microsoft Student

《Spring實戰》-- 'cvc-complex-type.2.4.c: The matching wildcard is strict, but no declaration can be found for element' 錯誤的解決

tip c-c 面向切面 ng- join proc ima -a edi 在Eclipse中新建了一個maven項目學習Spring,在 service.xml 中配置 Spring,想要學習‘面向切面的Spring’,service.xml 內容如下: <bean

Web作業:specific word count (index of )

++ oci tel specific dem pre htm fun script 統計文件中某一詞語出現次數: HTML: <p id="p1">start,stop,speed,start,speed ,velocicty,start</p&

Web第九周作業:History of Program(1950--2020)

History of Program(1950--2020) 1957年 約翰·巴科斯(John Backus)建立了是全世界第一套高階語言:FORTRAN。 1959年 葛麗絲·霍普(Grace Hopper)創造了現代第一個編譯器A-0 系統,以及商用電腦程式語言“COBOL”,被譽為COBOL之

關於dubbo創建服務和引用服務時,會報錯:cvc-complex-type.2.4.c: The matching wildcard is strict, but no declaration can be found for element 問題的解決

一個 sch 系統 contex ips 分布 配置文件 prot 商城項目   在跟著做淘淘商城項目時,用到了dubbo,作為一個SOA架構的項目,分為表現層與服務層,自然地,為了各個層之間解耦合(或者最大限度地松耦合),我們使用了dubbo這樣一個alibaba開源的分

論文閱讀 | MIX: Multi-Channel Information Crossing for Text Matching

MIX: Multi-Channel Information Crossing for Text Matching (騰訊2018 KDD) 主要特點: 1.本文中對於句子匹配,考慮了很多不同層面的:詞,短語,句法,詞頻和權重,語法信心等資訊 2.通過多通道將所有資

論文閱讀 | Multi-Cast Attention Networks for Retrieval-based Question Answering and Response Prediction

Multi-Cast Attention Networks for Retrieval-based Question Answering and Response Prediction (KDD 2018) 1.主要特點: 通常,一個句子應用一次attention,然後學習最終表

500 Lines or Less | A Web Crawler With asyncio Coroutines:用協程寫web爬蟲

1 def fetch(url): 2 sock = socket.socket() 3 sock.connect(('xkcd.com', 80)) 4 request = 'GET {} HTTP/1.0\r\nHost: xkcd.com\r\n\r\n'.form

Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments 讀書筆記

1. 介紹        本文主要是將深度強化學習應用於多智慧的控制。作者提出了一種演員評論方法的改進方法,該方法考慮了其他代理人的行動策略。此外,他們引入了一種培訓方案,該方案綜合考慮每個代理策略,以產生更強大的多代理策略,並能夠成功

json web token for Egg.js 實踐

認識json web token 根據維基百科的定義,JSON WEB Token(JWT,讀作 [/dʒɒt/]),是一種基於JSON的、用於在網路上宣告某種主張的令牌(token)。JWT通常由三部分組成: 頭資訊(header), 訊息體(payload)和簽名(signature)。 頭資訊指定了

閱讀筆記之——《Multi-level Wavelet-CNN for Image Restoration》及基於pytorch的復現

本博文是MWCNN的閱讀筆記,論文的連結:https://arxiv.org/pdf/1805.07071.pdf 程式碼:https://github.com/lpj0/MWCNN 通過參考程式碼,對該網路在pytorch框架下進行復現     inco

web.xml中 web-app 報錯了--The content of element type "web-app" must match

web.xml中<web-app>報錯了--The content of element type "web-app" must match。真是活見鬼! 查完資料後發現,原來web-app_2_3.dtd規範有規定,裡面配置的內容要按照規定的順序來,如下:

Multi-Person Pose Estimation for PoseTrack with Enhanced Part Affinity Fields

介紹 進階版的PAF,posetrack map 70!   關鍵點 冗餘的PAF 文中指出,由於PAF使用的聚合方法中,連線N個關節點只用了N-1條邊,要取得完整的聚合需要所有部分都檢測聚合正確,這是很難滿足的,因此設定一些冗餘的連線,可以提高聚合的效果。如