nutch如何修改regex-urlfilter.txt爬取符合條件的鏈接
阿新 • • 發佈:2017-11-27
字符串 cert zip add spa who .cn 分析 sta
例如我在爬取學生在線的時候,發現爬取不到特定的通知,例如《中糧福臨門助學基金申請公告》,通過分析發現原來通知的鏈接被過濾掉了,下面對過濾url的配置文件regex-urlfilter.txt進行分析,以後如果需要修改可以根據自己的情況對該配置文件進行修改:
例如我在爬取學生在線的時候,發現爬取不到特定的通知,例如《中糧福臨門助學基金申請公告》,通過分析發現原來通知的鏈接被過濾掉了,下面對過濾url的配置文件regex-urlfilter.txt進行分析,以後如果需要修改可以根據自己的情況對該配置文件進行修改:
說明:配置文件中以“#”開頭的行為註釋,以“-" 開頭的表示符合正則表達式就過濾掉,以“+”開頭的表示符合正則表達式則保留。正則表達式中"^"表示字符串的開頭,"$"表示字符串的結尾,"[]"表示集合。中文部分是我添加的註釋
[java] view plain copy print?
- # Licensed to the Apache Software Foundation (ASF) under one or more
- # contributor license agreements. See the NOTICE file distributed with
- # this work for additional information regarding copyright ownership.
- # The ASF licenses this file to You under the Apache License, Version 2.0
- # (the "License"); you may not use this file except in compliance with
- # the License. You may obtain a copy of the License at
- #
- # http://www.apache.org/licenses/LICENSE-2.0
- #
- # Unless required by applicable law or agreed to in writing, software
- # distributed under the License is distributed on an "AS IS" BASIS,
- # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- # See the License for the specific language governing permissions and
- # limitations under the License.
- # The default url filter.
- # Better for whole-internet crawling.
- # Each non-comment, non-blank line contains a regular expression
- # prefixed by ‘+‘ or ‘-‘. The first matching pattern in the file
- # determines whether a URL is included or ignored. If no pattern
- # matches, the URL is ignored.
- # skip file: ftp: and mailto: urls
- #過濾掉file:ftp等不是html協議的鏈接
- -^(file|ftp|mailto):
- # skip image and other suffixes we can‘t yet parse
- #過濾掉圖片等格式的鏈接
- -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
- # skip URLs containing certain characters as probable queries, etc.
- #-[?*!@=] 過濾掉汗特殊字符的鏈接,因為要爬取更多的鏈接,所以修改過濾條件,使包含?=的鏈接不被過濾掉
- -[*!@]
- # skip URLs with slash-delimited segment that repeats 3+ times, to break loops
- #過濾掉一些特殊格式的鏈接
- -.*(/[^/]+)/[^/]+\1/[^/]+\1/
- # accept anything else
- #接受所有的鏈接,這裏可以做自己的修改,是的只接受自己規定類型的鏈接
原因解釋:因為爬取的公告鏈接為(http://www.online.sdu.edu.cn/news/article.php?pid=636514943),鏈接中含有?和=字符,所以被過濾特殊字符的正則表達式過濾掉,通過修改regex-urlfilter.txt配置文件(如上),最終可以爬取這類公告的鏈接。
nutch如何修改regex-urlfilter.txt爬取符合條件的鏈接