nutch如何修改regex-urlfilter.txt爬取符合條件的鏈接

阿新 • • 發佈：2017-11-27

字符串 cert zip add spa who .cn 分析 sta

例如我在爬取學生在線的時候，發現爬取不到特定的通知，例如《中糧福臨門助學基金申請公告》，通過分析發現原來通知的鏈接被過濾掉了，下面對過濾url的配置文件regex-urlfilter.txt進行分析，以後如果需要修改可以根據自己的情況對該配置文件進行修改：

說明：配置文件中以“#”開頭的行為註釋，以“-" 開頭的表示符合正則表達式就過濾掉，以“+”開頭的表示符合正則表達式則保留。正則表達式中"^"表示字符串的開頭，"$"表示字符串的結尾，"[]"表示集合。中文部分是我添加的註釋

[java] view plain copy print?

# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# The default url filter.
# Better for whole-internet crawling.
# Each non-comment, non-blank line contains a regular expression
# prefixed by ‘+‘ or ‘-‘. The first matching pattern in the file
# determines whether a URL is included or ignored. If no pattern
# matches, the URL is ignored.
# skip file: ftp: and mailto: urls
#過濾掉file：ftp等不是html協議的鏈接
-^(file|ftp|mailto):
# skip image and other suffixes we can‘t yet parse
#過濾掉圖片等格式的鏈接
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
# skip URLs containing certain characters as probable queries, etc.
#-[?*!@=] 過濾掉汗特殊字符的鏈接，因為要爬取更多的鏈接，所以修改過濾條件，使包含？=的鏈接不被過濾掉
-[*!@]
# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
#過濾掉一些特殊格式的鏈接
-.*(/[^/]+)/[^/]+\1/[^/]+\1/
# accept anything else
#接受所有的鏈接，這裏可以做自己的修改，是的只接受自己規定類型的鏈接

# Licensed to the Apache Software Foundation (ASF) under one or more# contributor license agreements.  See the NOTICE file distributed with# this work for additional information regarding copyright ownership.# The ASF licenses this file to You under the Apache License, Version 2.0# (the "License"); you may not use this file except in compliance with# the License.  You may obtain a copy of the License at##     http://www.apache.org/licenses/LICENSE-2.0## Unless required by applicable law or agreed to in writing, software# distributed under the License is distributed on an "AS IS" BASIS,# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.# See the License for the specific language governing permissions and# limitations under the License.# The default url filter.# Better for whole-internet crawling.# Each non-comment, non-blank line contains a regular expression# prefixed by ‘+‘ or ‘-‘.  The first matching pattern in the file# determines whether a URL is included or ignored.  If no pattern# matches, the URL is ignored.# skip file: ftp: and mailto: urls#過濾掉file：ftp等不是html協議的鏈接-^(file|ftp|mailto):# skip image and other suffixes we can‘t yet parse#過濾掉圖片等格式的鏈接-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$# skip URLs containing certain characters as probable queries, etc.#-[?*!@=] 過濾掉汗特殊字符的鏈接，因為要爬取更多的鏈接，所以修改過濾條件，使包含？=的鏈接不被過濾掉-[*!@]# skip URLs with slash-delimited segment that repeats 3+ times, to break loops#過濾掉一些特殊格式的鏈接-.*(/[^/]+)/[^/]+\1/[^/]+\1/# accept anything else#接受所有的鏈接，這裏可以做自己的修改，是的只接受自己規定類型的鏈接

原因解釋：因為爬取的公告鏈接為（http://www.online.sdu.edu.cn/news/article.php?pid=636514943），鏈接中含有？和=字符，所以被過濾特殊字符的正則表達式過濾掉，通過修改regex-urlfilter.txt配置文件（如上），最終可以爬取這類公告的鏈接。

nutch如何修改regex-urlfilter.txt爬取符合條件的鏈接

字符串 cert zip add spa who .cn 分析 sta 例如我在爬取學生在線的時候，發現爬取不到特定的通知，例如《中糧福臨門助學基金申請公告》，通過分析發現原來通知的鏈接被過濾掉了，下面對過濾url的配置文件r

nutch如何修改regex-urlfilter.txt爬取符合條件的鏈接

nutch如何修改regex-urlfilter.txt爬取符合條件的鏈接

Nutch教程——匯入Nutch工程，執行完整爬取 by 逼格DATA

Java分散式爬蟲Nutch教程——匯入Nutch工程，執行完整爬取

皕傑報表鉆取報表超鏈接地址

php將抓取的圖片鏈接下載到本地

【Python3 爬蟲】06_robots.txt查看網站爬取限制情況

爬取博主所有文章並保存到本地（.txt版）--python3.6

將爬取的資料傳入到pipeline中，需要對settings.py進行修改

bs4爬取漫畫並寫入TXT文件

Python爬蟲入門實戰系列（一）--爬取網路小說並存放至txt檔案

[Python爬蟲]爬蟲例項:爬取PEXELS圖片---修改為多程序爬蟲

Python，自己修改的爬去淘寶網頁的程式碼解決Python爬蟲爬取淘寶商品資訊也不報錯，也不輸出資訊

python 3.x 爬蟲基礎---正則表示式（案例：爬取貓眼資訊，寫入txt,csv,下載圖片）

python 爬蟲使用正則爬取51job內容並存入txt

爬取有驗證碼的網站，（爬之前最好看一下君子協定）robots.txt,以人人網為例，每爬100條資料需要驗證一次（需要自己購買一個驗證碼破解會員，不是很貴，我這裡選擇的是超級鷹），簡版

【爬蟲例項1】python3下使用beautifulsoup爬取資料並存儲txt檔案

Python，自己修改的爬去淘寶網頁的程式碼解決Python爬蟲爬取淘寶商品資訊也不報錯，也不輸出資訊

Nutch的Hadoop方式爬取效率優化

Scrapy教程——搭建環境、建立專案、爬取內容、儲存檔案（txt）

php從爬蟲爬取的txt檔案按行讀取並寫入儲存到excel，csv中

nutch如何修改regex-urlfilter.txt爬取符合條件的鏈接

相關推薦