1. 程式人生 > >windows Hadoop開發環境搭建及遠端提交

windows Hadoop開發環境搭建及遠端提交

這篇文章將介紹如何搭建hadoop的開發環境,並且詳細描述如何通過intellij idea開發hadoop的map-reduce程式以及遠端提交。
前提:

  • 需要在本機下載hadoop,不需要修改配置安裝,但需要設定下hadoop_home,java_home等
  • 下載winutils,並解壓放在$Hadoop_HOME/bin目錄下
  • 如果叢集配置中都是指定的主機名,那麼需要在你本機hosts中加上叢集主機解析(不加也可以,就是不太方便)

方法一:maven專案

1、intellij idea建立maven專案這裡就不多說了,先建立一個maven專案。
2、配置pom.xml檔案,補全pom.xml檔案之後,idea會自動下載jar包並引入。

<dependencies>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>3.8.1</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId
>
<version>2.8.0</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-client</artifactId> <version>2.8.0</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId
>
hadoop-hdfs</artifactId> <version>2.8.0</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-mapreduce-client-core</artifactId> <version>2.8.0</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-mapreduce-client-jobclient</artifactId> <version>2.8.0</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-mapreduce-client-common</artifactId> <version>2.8.0</version> </dependency> </dependencies>

方法二:新建java專案

1、intellij idea建立java專案

2、新增依賴

這裡寫圖片描述
這裡寫圖片描述

匯入成功後

這裡寫圖片描述

3、編寫程式碼

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;
import java.util.StringTokenizer;

public class WordCount {

public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable> {

private final static IntWritable one = new IntWritable(1);
private Text word = new Text();

public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}

public static class IntSumReducer
extends Reducer<Text, IntWritable, Text, IntWritable> {
private IntWritable result = new IntWritable();

public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}

private static void deleteDir(Configuration conf, String dirPath) throws IOException {
FileSystem fs = FileSystem.get(conf);
Path targetPath = new Path(dirPath);
if (fs.exists(targetPath)) {
boolean delResult = fs.delete(targetPath, true);
if (delResult) {
System.out.println(targetPath + " has been deleted sucessfullly.");
} else {
System.out.println(targetPath + " deletion failed.");
}
}

}

public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
/* String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
if (otherArgs.length < 2) {
System.err.println("Usage: wordcount <in> [<in>...] <out>");
System.exit(2);
}
//先刪除output目錄
deleteDir(conf, otherArgs[otherArgs.length - 1]);*/
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

統計 args第一個引數對應的檔案目錄中所有檔案中單詞出現的次數
輸出結果在第二個引數對應的檔案目錄中會自動建立目錄 執行前要保證目錄不存在

4、編輯configuration
這裡寫圖片描述

5、執行成功

這裡寫圖片描述

遠端配置

新建Resource目錄,配置為專案Resources

這裡寫圖片描述

新增core-site.xml檔案到Resource目錄下

這裡寫圖片描述

<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://192.168.89.135:9000</value>
</property>
</configuration>

可以直接從Hadoo的配置檔案複製過來

修改configurations
修改輸入輸入檔案地址為遠端hdfs地址
這裡寫圖片描述

本地提交

如果你的hadoop和idea在同一臺伺服器上,那麼你可以選擇Local提交
1、把coer-site.xml、log4j.properties複製到專案的原始碼根目錄下(保證編譯後在class目錄下能找到該兩個檔案),為什麼要這樣呢?因為你直接在idea中提交job,會載入class資料夾下的配置檔案,如果沒有log4j.properties檔案,則會提示log4j沒有初始化,結果是沒有任務資訊列印。core-site.xml一樣,如果不放到原始檔目錄下,則會報hdfs許可權等問題。
2、在idea中直接執行該類的主方法,就可以提交到本地hadoop偽分佈安裝模式上了,可以對程式碼進行除錯。
3、注意:我們在hadoop的配置檔案mapred-site.xml指定了YARN排程,但是提交job的時候,根據debug之後發現,呼叫的是LocalCluster。並沒有使用YARN.有如下兩點原因:
【原因1:】需要把mared-site.xml檔案和yarm.xml檔案放到resource資料夾下
【原因2:】需要把檔案程式打包才能進行遠端提交job見:下一節遠端提交

遠端提交

如果你的hadoop是叢集或者是其他伺服器,idea在不同的伺服器你可以選擇遠端提交,在hadoop-2.8.0中使用YARN進行排程。
1、把core-site.xml、hdfs-site.xml、mapred-site.xml、yarn.xml、log4j.properties等檔案放到resource目錄,如果不新增這些檔案,相關設定需要在程式碼中指定

conf.set("mapreduce.job.jar", "E:\\hadoop\\myhadoop\\out\\artifacts\\wordcount\\wordcount.jar");//指定Jar包,也可以在job中設定
conf.set("mapreduce.framework.name", "yarn");//以yarn形式提交
conf.set("yarn.resourcemanager.hostname", "master");
conf.set("mapreduce.app-submission.cross-platform", "true");//跨平臺提交

如果叢集設定了hdfs訪問許可權限制,比如開啟了指定使用者xxx才能訪問那麼可以在程式裡設定

System.setProperty("HADOOP_USER_NAME", "xxx")

2、先把該project進行打包,使用maven或者idea的自動打包功能進行打包

  • maven
mvn package
  • Idea自動打包

    因為叢集上已經有了相關的環境,這裡打包就不用新增依賴到了,選擇Empty。這樣除錯時Build速度快。

    Project Structure=>Artifacts=> 點左上角的 + =>Empty =>Output Layout + => Module Output =>選擇專案資料夾=>點選jar包,設定MainClass 即可

3、需要在程式程式碼中設定job.setJar

job.setJar("E:\\hadoop\\myhadoop\\out\\artifacts\\wordcount\\wordcount.jar");

4、程式程式碼中:10020埠是hadoop歷史服務,需要在伺服器端啟動

mr-jobhistory-daemon.sh start historyserver & #啟動歷史服務

5、在idea中執行程式,就提交了job,並且該種job提交方式還可以進行在idea中進行原始碼除錯。

6、自動提交Jar包到叢集上(非必須)
Tools -> Deployment -> Configuration點選左上角 + ,Type選擇SFTP,然後配置伺服器ip和部署路徑,使用者名稱、密碼等選項之後選擇自動部署,這樣每次修改都會自動部署到伺服器,也可以右鍵,選擇Deployment,upload to …

常見問題:

問題1:

Exception in thread "main" java.lang.RuntimeException: java.io.FileNotFoundException: Could not locate Hadoop executable: E:\hadoop-2.8.0\bin\winutils.exe -see https://wiki.apache.org/hadoop/WindowsProblems
    at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:716)
    at org.apache.hadoop.util.Shell.getSetPermissionCommand(Shell.java:250)
    at org.apache.hadoop.util.Shell.getSetPermissionCommand(Shell.java:267)
    at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:771)
    at org.apache.hadoop.fs.RawLocalFileSystem.mkOneDirWithMode(RawLocalFileSystem.java:515)
    at org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:555)
    at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:533)
    at org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:313)
    at org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:133)
    at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:146)
    at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1341)
    at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1338)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1807)
    at org.apache.hadoop.mapreduce.Job.submit(Job.java:1338)
    at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1359)
    at WordCount.main(WordCount.java:92)
Caused by: java.io.FileNotFoundException: Could not locate Hadoop executable: E:\hadoop-2.8.0\bin\winutils.exe -see https://wiki.apache.org/hadoop/WindowsProblems
    at org.apache.hadoop.util.Shell.getQualifiedBinInner(Shell.java:598)
    at org.apache.hadoop.util.Shell.getQualifiedBin(Shell.java:572)
    at org.apache.hadoop.util.Shell.<clinit>(Shell.java:669)
    at org.apache.hadoop.util.GenericOptionsParser.preProcessForWindows(GenericOptionsParser.java:441)
    at org.apache.hadoop.util.GenericOptionsParser.parseGeneralOptions(GenericOptionsParser.java:487)
    at org.apache.hadoop.util.GenericOptionsParser.<init>(GenericOptionsParser.java:170)
    at org.apache.hadoop.util.GenericOptionsParser.<init>(GenericOptionsParser.java:153)
    at WordCount.main(WordCount.java:71)
Process finished with exit code 1

解決辦法:將winutil.exe放在$HADOOP_HOME/bin目錄下

問題2:

2017-08-04 12:31:00,668 WARN  [main] util.NativeCodeLoader (NativeCodeLoader.java:<clinit>(62)) - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2017-08-04 12:31:01,230 INFO  [main] Configuration.deprecation (Configuration.java:warnOnceIfDeprecated(1181)) - session.id is deprecated. Instead, use dfs.metrics.session-id
2017-08-04 12:31:01,230 INFO  [main] jvm.JvmMetrics (JvmMetrics.java:init(79)) - Initializing JVM Metrics with processName=JobTracker, sessionId=
2017-08-04 12:31:01,495 WARN  [main] mapreduce.JobResourceUploader (JobResourceUploader.java:uploadFiles(171)) - No job jar file set.  User classes may not be found. See Job or Job#setJar(String).
2017-08-04 12:31:01,542 INFO  [main] input.FileInputFormat (FileInputFormat.java:listStatus(289)) - Total input files to process : 1
2017-08-04 12:31:01,870 INFO  [main] mapreduce.JobSubmitter (JobSubmitter.java:submitJobInternal(200)) - number of splits:1
2017-08-04 12:31:02,104 INFO  [main] mapreduce.JobSubmitter (JobSubmitter.java:printTokens(289)) - Submitting tokens for job: job_local1047774324_0001
Exception in thread "main" java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z
    at org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Native Method)
    at org.apache.hadoop.io.nativeio.NativeIO$Windows.access(NativeIO.java:606)
    at org.apache.hadoop.fs.FileUtil.canRead(FileUtil.java:958)
    at org.apache.hadoop.util.DiskChecker.checkAccessByFileMethods(DiskChecker.java:203)
    at org.apache.hadoop.util.DiskChecker.checkDirAccess(DiskChecker.java:190)
    at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:124)
    at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.confChanged(LocalDirAllocator.java:314)
    at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:377)
    at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:151)
    at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:132)
    at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:116)
    at org.apache.hadoop.mapred.LocalDistributedCacheManager.setup(LocalDistributedCacheManager.java:125)
    at org.apache.hadoop.mapred.LocalJobRunner$Job.<init>(LocalJobRunner.java:171)
    at org.apache.hadoop.mapred.LocalJobRunner.submitJob(LocalJobRunner.java:758)
    at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:242)
    at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1341)
    at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1338)
    at java.security.AccessController.doPrivileged(Native Method)
2017-08-04 12:31:02,167 INFO  [main] mapreduce.JobSubmitter (JobSubmitter.java:submitJobInternal(251)) - Cleaning up the staging area file:/tmp/hadoop/mapred/staging/alex1047774324/.staging/job_local1047774324_0001
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1807)
    at org.apache.hadoop.mapreduce.Job.submit(Job.java:1338)
    at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1359)
    at WordCount.main(WordCount.java:92)

解決辦法:缺少hadoop.dll,把hadoop.dll放在$HADOOP_HOME/bin目錄下

問題3:

2017-08-04 12:47:49,125 INFO  [main] ipc.Client (Client.java:handleConnectionTimeout(897)) - Retrying connect to server: master/192.168.89.135:9000. Already tried 0 time(s); maxRetries=45

解決辦法:遠端主機沒有啟動hadoop,若啟動了檢查是否關閉了firewalld.service和iptables.service

問題4:

2017-11-29 21:10:22,214 INFO  [main] client.RMProxy (RMProxy.java:createRMProxy(123)) - Connecting to ResourceManager at master/192.168.89.136:8032
2017-11-29 21:10:23,259 INFO  [main] input.FileInputFormat (FileInputFormat.java:listStatus(289)) - Total input files to process : 1
2017-11-29 21:10:24,216 INFO  [main] mapreduce.JobSubmitter (JobSubmitter.java:submitJobInternal(200)) - number of splits:1
2017-11-29 21:10:24,769 INFO  [main] mapreduce.JobSubmitter (JobSubmitter.java:printTokens(289)) - Submitting tokens for job: job_1511957984981_0007
2017-11-29 21:10:24,984 INFO  [main] impl.YarnClientImpl (YarnClientImpl.java:submitApplication(296)) - Submitted application application_1511957984981_0007
2017-11-29 21:10:25,024 INFO  [main] mapreduce.Job (Job.java:submit(1345)) - The url to track the job: http://master:8088/proxy/application_1511957984981_0007/
2017-11-29 21:10:25,024 INFO  [main] mapreduce.Job (Job.java:monitorAndPrintJob(1390)) - Running job: job_1511957984981_0007
2017-11-29 21:10:28,088 INFO  [main] mapreduce.Job (Job.java:monitorAndPrintJob(1411)) - Job job_1511957984981_0007 running in uber mode : false
2017-11-29 21:10:28,090 INFO  [main] mapreduce.Job (Job.java:monitorAndPrintJob(1418)) -  map 0% reduce 0%
2017-11-29 21:10:28,164 INFO  [main] mapreduce.Job (Job.java:monitorAndPrintJob(1431)) - Job job_1511957984981_0007 failed with state FAILED due to: Application application_1511957984981_0007 failed 2 times due to AM Container for appattempt_1511957984981_0007_000002 exited with  exitCode: 1
Failing this attempt.Diagnostics: Exception from container-launch.
Container id: container_1511957984981_0007_02_000001
Exit code: 1
Exception message: /bin/bash: line 0: fg: no job control

Stack trace: ExitCodeException exitCode=1: /bin/bash: line 0: fg: no job control

    at org.apache.hadoop.util.Shell.runCommand(Shell.java:972)
    at org.apache.hadoop.util.Shell.run(Shell.java:869)
    at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1170)
    at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:236)
    at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:305)
    at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:84)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:748)


Container exited with a non-zero exit code 1
For more detailed output, check the application tracking page: http://master:8088/cluster/app/application_1511957984981_0007 Then click on links to logs of each attempt.
. Failing the application.
2017-11-29 21:10:28,199 INFO  [main] mapreduce.Job (Job.java:monitorAndPrintJob(1436)) - Counters: 0

Process finished with exit code 1

這是因為windows 和遠端Linux叢集跨平臺造成的

解決辦法:在程式碼中新增

conf.set("mapreduce.app-submission.cross-platform", "true");