1. 程式人生 > >nutch如何修改regex-urlfilter.txt爬取符合條件的鏈接

nutch如何修改regex-urlfilter.txt爬取符合條件的鏈接

字符串 cert zip add spa who .cn 分析 sta


例如我在爬取學生在線的時候,發現爬取不到特定的通知,例如《中糧福臨門助學基金申請公告》,通過分析發現原來通知的鏈接被過濾掉了,下面對過濾url的配置文件regex-urlfilter.txt進行分析,以後如果需要修改可以根據自己的情況對該配置文件進行修改:

說明:配置文件中以“#”開頭的行為註釋,以“-" 開頭的表示符合正則表達式就過濾掉,以“+”開頭的表示符合正則表達式則保留。正則表達式中"^"表示字符串的開頭,"$"表示字符串的結尾,"[]"表示集合。中文部分是我添加的註釋

[java] view plain copy print?
  1. # Licensed to the Apache Software Foundation (ASF) under one or more
  2. # contributor license agreements. See the NOTICE file distributed with
  3. # this work for additional information regarding copyright ownership.
  4. # The ASF licenses this file to You under the Apache License, Version 2.0
  5. # (the "License"); you may not use this file except in compliance with
  6. # the License. You may obtain a copy of the License at
  7. #
  8. # http://www.apache.org/licenses/LICENSE-2.0
  9. #
  10. # Unless required by applicable law or agreed to in writing, software
  11. # distributed under the License is distributed on an "AS IS" BASIS,
  12. # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  13. # See the License for the specific language governing permissions and
  14. # limitations under the License.
  15. # The default url filter.
  16. # Better for whole-internet crawling.
  17. # Each non-comment, non-blank line contains a regular expression
  18. # prefixed by ‘+‘ or ‘-‘. The first matching pattern in the file
  19. # determines whether a URL is included or ignored. If no pattern
  20. # matches, the URL is ignored.
  21. # skip file: ftp: and mailto: urls
  22. #過濾掉file:ftp等不是html協議的鏈接
  23. -^(file|ftp|mailto):
  24. # skip image and other suffixes we can‘t yet parse
  25. #過濾掉圖片等格式的鏈接
  26. -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
  27. # skip URLs containing certain characters as probable queries, etc.
  28. #-[?*!@=] 過濾掉汗特殊字符的鏈接,因為要爬取更多的鏈接,所以修改過濾條件,使包含?=的鏈接不被過濾掉
  29. -[*!@]
  30. # skip URLs with slash-delimited segment that repeats 3+ times, to break loops
  31. #過濾掉一些特殊格式的鏈接
  32. -.*(/[^/]+)/[^/]+\1/[^/]+\1/
  33. # accept anything else
  34. #接受所有的鏈接,這裏可以做自己的修改,是的只接受自己規定類型的鏈接

原因解釋:因為爬取的公告鏈接為(http://www.online.sdu.edu.cn/news/article.php?pid=636514943),鏈接中含有?和=字符,所以被過濾特殊字符的正則表達式過濾掉,通過修改regex-urlfilter.txt配置文件(如上),最終可以爬取這類公告的鏈接。

nutch如何修改regex-urlfilter.txt爬取符合條件的鏈接