MAC Nutch+MySQL集成筆記
<div property="schema:text" class="field field--name-body field--type-text-with-summary field--label-hidden field__item"><p> 目的: <em>Nutch爬蟲引擎抓取的數據自動存入MySQL</em> </p>
隸屬: Nutch+Hadoop+HBase(MySQL)+Elasticsearch+PHP 系列實踐
MAC MySQL安裝
不需要什麽配置,就是next最後記住彈出的窗口裏的密碼就行。
下載地址: http://dev.mysql.com/downloads/mysql/
Nutch的安裝與配置以及使用
1、Nutch-2.3.1下載: http://nutch.apache.org/downloads.html 下載,然後解壓至本地安裝目錄,如本地根目錄為${NUTCH_HOME};
2、配置nutch對mysql的支持,修改${APACHE_NUTCH_HOME}/ivy/ivy.xml文件,分別:
1)找到以下行取消註釋
<dependency org="mysql" name="mysql-connector-java" rev="5.1.18" conf="*->default"/>
2)修改以下行
默認為
<dependency org="org.apache.gora" name="gora-core" rev="0.3" conf="*->default"/>
修改後為
<dependency org="org.apache.gora" name="gora-core" rev="0.2.1" conf="*->default"/>
3)取消註釋以下行
<dependency org="org.apache.gora" name="gora-sql" rev="0.1.1-incubating" conf="*->default" />
註釋:上2)、3)如果不修改會有異常異常信息為
Exception in thread “main” Java.lang.ClassNotFoundException:org.apache.gora.sql.store.SqlStore
3、數據庫連接配置
編輯${NUTCH_HOME}/conf/gora.properties文件,註釋掉默認的數據庫連接配置,同時添加以下配置內容:
################################ MySQL properties ###############################gora.sqlstore.jdbc.driver=com.mysql.jdbc.Drivergora.sqlstore.jdbc.url=jdbc:mysql://192.168.58.1:3306/nutch?createDatabaseIfNotExist=truegora.sqlstore.jdbc.user=rootgora.sqlstore.jdbc.password=
寫上你需要連接的數據庫地址以及用戶名密碼
4、修改nutch-site配置文件
將以下內容添加至${NUTCH_HOME}/conf/nutch-site.xml中的configuration節點中
<property><name>http.agent.name</name><value>LiuXun Nutch Spider</value></property> <property><name>http.accept.language</name><value>ja-jp, en-us,en-gb,en;q=0.7,*;q=0.3</value><description>Value of the “Accept-Language” request header field.This allows selecting non-English language as default one to retrieve.It is a useful setting for search engines build for certain national group.</description></property> <property><name>parser.character.encoding.default</name><value>utf-8</value><description>The character encoding to fall back to when no other informationis available</description></property> <property><name>storage.data.store.class</name><value>org.apache.gora.sql.store.SqlStore</value><description>The Gora DataStore class for storing and retrieving data.Currently the following stores are available: ….</description> </property>//特別添加<property> <name>generate.batch.id</name> <value>*</value></property>
5、編譯Nutch-2.3.1
- 進入${NUTCH_HOME}目錄下執行ant命令:ant runtime
- 編譯成功後${NUTCH_HOME}目錄下會有runtime這個目錄
編譯Nutch
? apache-nutch-2.3.1 (master) ? antBuildfile: /Users/hackgyj/apache-nutch-2.3.1/build.xmlTryingto overrideolddefinitionoftaskjavac [taskdef] Couldnot loaddefinitionsfromresourceorg/sonar/ant/antlib.xml. Itcouldnot befound. ivy-probe-antlib: ivy-download: [taskdef] Couldnot loaddefinitionsfromresourceorg/sonar/ant/antlib.xml. Itcouldnot befound. ivy-download-unchecked: ivy-init-antlib: ivy-init: init: [mkdir] Createddir: /Users/hackgyj/apache-nutch-2.3.1/build [mkdir] Createddir: /Users/hackgyj/apache-nutch-2.3.1/build/classes [mkdir] Createddir: /Users/hackgyj/apache-nutch-2.3.1/build/release [mkdir] Createddir: /Users/hackgyj/apache-nutch-2.3.1/build/test [mkdir] Createddir: /Users/hackgyj/apache-nutch-2.3.1/build/test/classes clean-lib: resolve-default:[ivy:resolve] :: ApacheIvy 2.3.0 - 20130110142753 :: http://ant.apache.org/ivy/ ::[ivy:resolve] :: loadingsettings :: file = /Users/hackgyj/apache-nutch-2.3.1/ivy/ivysettings.xml
上面報錯了,需要下載sonar的jar包( sonar-ant-task-2.2.jar ),並將jar包放到解壓好的apache-nutch-2.3.1文件夾內的lib文件內內。由於需要連接網絡下載資源,需要一些時間,根據網絡情況時間不等,我自己用了大概一小時!
然後命令行執行:
ant clear
再執行
ant runtime
OK,沒再出錯,編譯成功,目錄下多出:build、runtime兩個文件夾,其中runtime就是編譯好的目錄。
6、網頁抓取以及配置
- 進入${NUTCH_HOME}/runtime/local目錄下
- 設置抓取的網站
執行命令
mkdir -p urls //建議爬蟲連接文件夾echo ‘http://www.oschina.net/‘ > urls/seed.txt //寫入爬取的連接bin/nutch crawl urls -depth 3 -topN 5 //開始爬蟲工作
Error: JAVA_HOME is not set.
提示JAVA_HOME未設置
MAC OS X El Capitan 10.11.6 查找和設置$JAVA_HOME,命令如下
? ~ (master) ? whichjava/usr/bin/java? ~ (master) ? ls -l /usr/bin/javalrwxr-xr-x 1 root wheel 74 Oct 20 2015 /usr/bin/java -> /System/Library/Frameworks/JavaVM.framework/Versions/Current/Commands/java? ~ (master) ? ls -l /System/Library/Frameworks/JavaVM.framework/Versionstotal 64lrwxr-xr-x 1 root wheel 10 Oct 20 2015 1.4 -> CurrentJDKlrwxr-xr-x 1 root wheel 10 Oct 20 2015 1.4.2 -> CurrentJDKlrwxr-xr-x 1 root wheel 10 Oct 20 2015 1.5 -> CurrentJDKlrwxr-xr-x 1 root wheel 10 Oct 20 2015 1.5.0 -> CurrentJDKlrwxr-xr-x 1 root wheel 10 Oct 20 2015 1.6 -> CurrentJDKlrwxr-xr-x 1 root wheel 10 Oct 20 2015 1.6.0 -> CurrentJDKdrwxr-xr-x 10 root wheel 340 Oct 7 13:55 Alrwxr-xr-x 1 root wheel 1 Oct 20 2015 Current -> Alrwxr-xr-x 1 root wheel 52 Oct 20 2015 CurrentJDK -> /Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents? ~ (master) ? java -versionjavaversion "1.8.0_91"Java(TM) SERuntimeEnvironment (build 1.8.0_91-b14)JavaHotSpot(TM) 64-BitServerVM (build 25.91-b14, mixedmode)? ~ (master) ? /usr/libexec/java_home -VMatchingJavaVirtualMachines (3): 1.8.0_91, x86_64: "Java SE 8" /Library/Java/JavaVirtualMachines/jdk1.8.0_91.jdk/Contents/Home 1.6.0_65-b14-468, x86_64: "Java SE 6" /Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home 1.6.0_65-b14-468, i386: "Java SE 6" /Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home /Library/Java/JavaVirtualMachines/jdk1.8.0_91.jdk/Contents/Home//打開用戶配置文件//添加路徑:export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_91.jdk/Contents/Home? ~ (master) ? open ~/.profile//保存後刷新用戶配置? ~ (master) ? source ~/.profile? ~ (master) ? echo $JAVA_HOME/Library/Java/JavaVirtualMachines/jdk1.8.0_91.jdk/Contents/Home
Command crawl is deprecated, please use bin/crawl instead
當執行bin/nutch crawl urls -depth 3 -topN 5時顯示這個錯誤,經查資料發現是因為Nutch2.3.1不支持這麽寫了。
1.7和2.2.1及以上版本用bin/crawl 取代 bin/nutch crawl .正確的寫法:
bin/crawl url/ test 5
好了,能執行了,但問題又出現:
Exception in thread "main" Java.lang.NoClassDefFoundError: org/apache/avro/ipc/ByteBufferOutputStream at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:191) at org.apache.nutch.storage.StorageUtils.getDataStoreClass(StorageUtils.java:93) at org.apache.nutch.storage.StorageUtils.createWebStore(StorageUtils.java:77) at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:218) at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:252) at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:275) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:284)Caused by: java.lang.ClassNotFoundException: org.apache.avro.ipc.ByteBufferOutputStream at java.NET.URLClassLoader$1.run(URLClassLoader.java:366) at java.Net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) ... 9 more
崩潰的感覺,再查發現答案是,Nutch2.3.1不支持MySQL,What………………
解決方法是:
- 要麽使用2.2x版本,要麽退回使用nutch1.x版本
- 或者更換MySQL為hbase存儲
顯示我的選擇是,放棄nutch2.3.1使用nutch2.2.1。浪費我大量時間!
如出現下面的錯誤,請搜索本文“特別添加”來解決。
Exception in thread "main" java.lang.RuntimeException: job failed: name=generate: null, jobid=job_local200289520_0002 at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:55) at org.apache.nutch.crawl.GeneratorJob.run(GeneratorJob.java:199) at org.apache.nutch.crawl.Crawler.runTool(Crawler.java:68) at org.apache.nutch.crawl.Crawler.run(Crawler.java:152) at org.apache.nutch.crawl.Crawler.run(Crawler.java:250) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.crawl.Crawler.main(Crawler.java:257)
nutch2.2.1成功
? local (master) ? bin/nutch crawl urls -depth 3 -topN 5InjectorJob: Using class org.apache.gora.sql.store.SqlStore as the Gora storage class.InjectorJob: total number of urls rejected by filters: 0InjectorJob: total number of urls injected after normalization and filtering: 1Fetcher: Your ‘http.agent.name‘ value should be listed first in ‘http.robots.agents‘ property.FetcherJob: threads: 10FetcherJob: parsing: falseFetcherJob: resuming: falseFetcherJob : timelimit set for : -1Using queue mode : byHostFetcher: threads: 10QueueFeeder finished: total 1 records. Hit by time limit :0
nutch2.2.1的安裝及配置和上面一樣,其中細節版本號錯誤等看錯誤信息修正就行,最後成功。
nutch命令前面章節介紹到了執行完在mysql中即查看到爬蟲抓取的內容,如下圖:
原文地址:http://www.bigdataway.net/node/502
MAC Nutch+MySQL集成筆記