IDEA向hadoop叢集提交作業
1. 說明
2. 配置本機hadoop環境
2.1解壓hadoop-2.6.1.tar.gz至任意一個目錄
我這裡選擇將其解壓到E:\java\hadoop-2.6.1目錄下。
2.2設定hadoop環境變數
注意HADOOP_USER_NAME值設定為Hadoop叢集裡的使用者名稱。不然會報org.apache.hadoop.security.AccessControlException。我的Hadoop叢集的使用者名稱是hadoop
HADOOP_HOME=E:\java\hadoop-2.6.1
HADOOP_BIN_PATH=%HADOOP_HOME%\ bin
HADOOP_PREFIX=%HADOOP_HOME%
在Path後面加上%HADOOP_HOME%\bin;%HADOOP_HOME%\sbin;
HADOOP_USER_NAME=hadoop
2.3配置內網對映
在C:\Windows\System32\drivers\etc\hosts文末追加三行,與centos6.5裡的/etc/hosts配置相同
192.168.48.101 hdp-node-01
192.168.48.102 hdp-node-02
192.168.48.103 hdp-node-03
3. 搭建專案
jdk的安裝在這裡就不做詳細介紹,本機跟Hadoop叢集的jdk安裝的版本儘量一致。
3.1 新建Maven專案
3.2 在pom.xml中加入依賴
<dependencies>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.12</version>
</dependency>
<dependency>
<groupId >org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>2.6.1</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>2.6.1</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
<version>2.6.1</version>
</dependency>
</dependencies>
完成後,如果External Libraries裡沒有依賴的包,在右下角Event Log中有提示Maven projects need to be imported: Import Changes Enable Auto-Import,點選Import Changes。
3.3 設定配置檔案
將hadoop叢集中配置檔案core-site.xml、mapred-site.xml、yarn-site.xml 原封不動地複製到resources目錄下。以下是我的配置檔案
core-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://hdp-node-01:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/hadoop/apps/hadoop-2.6.1/tmp</value>
</property>
</configuration>
mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
yarn-site.xml
<?xml version="1.0"?>
<configuration>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>hdp-node-01</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<!-- Site specific YARN configuration properties -->
</configuration>
log4j.properties
log4j.rootLogger=INFO, stdout
log4j.appender.stdout=org.apache.log4j.ConsoleAppender
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern=%d{ABSOLUTE} | %-5.5p | %-16.16t | %-32.32c{1} | %-32.32C %4L | %m%n
3.4 編寫程式
WordCountMapper.java
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
public class WordCountMapper extends Mapper<LongWritable,Text,Text,IntWritable> {
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
String[] words = line.split(" ");
for (String word : words) {
context.write(new Text(word),new IntWritable(1));
}
}
}
WordCountReducer.java
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
@Override
protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
// super.reduce(key, values, context);
int count = 0 ;
for (IntWritable value:values) {
count += value.get();
}
context.write(key,new IntWritable((count)));
}
}
WordCountRunner.java
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.text.SimpleDateFormat;
import java.util.Date;
public class WordCountRunner {
public static void main(String[] args) throws Exception {
Configuration config = new Configuration();
config.set("mapreduce.framework.name", "yarn");//叢集的方式執行,非本地執行
config.set("mapreduce.app-submission.cross-platform", "true");//意思是跨平臺提交,在windows下如果沒有這句程式碼會報錯 "/bin/bash: line 0: fg: no job control",去網上搜答案很多都說是linux和windows環境不同導致的一般都是修改YarnRunner.java,但是其實添加了這行程式碼就可以了。
config.set("mapreduce.job.jar","D:\\wordcount\\out\\artifacts\\wordcount_jar\\wordcount.jar");
Job job = Job.getInstance(config);
job.setJarByClass(WordCountRunner.class);
job.setMapperClass(WordCountMapper.class);
job.setReducerClass(WordCountReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
//要處理的資料輸入與輸出地址
FileInputFormat.setInputPaths(job,"hdfs://hdp-node-01:9000/wordcount/input/somewords.txt");
SimpleDateFormat simpleDateFormat = new SimpleDateFormat("yyyy_MM_dd_HH_mm_ss");
FileOutputFormat.setOutputPath(job,new Path("hdfs://hdp-node-01:9000/wordcount/output/"+ simpleDateFormat.format(new Date(System.currentTimeMillis()))));
boolean res = job.waitForCompletion(true);
System.exit(res?0:1);
}
}
注意mapreduce.job.jar 引數設定為jar的路徑。
3.5 匯出jar
點選File -》project Structure
注意勾上Build on make選項。3.4裡的mapreduce.job.jar地址跟這Output directory地址字首相同
最後點選Build-》Build Artifacts-》Build後會在根目錄下會生成out目錄。
3.6 執行程式
執行成功會控制檯會顯示:
16:44:07,037 | WARN | main | NativeCodeLoader | che.hadoop.util.NativeCodeLoader 62 | Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16:44:11,203 | INFO | main | RMProxy | pache.hadoop.yarn.client.RMProxy 98 | Connecting to ResourceManager at hdp-node-01/192.168.48.101:8032
16:44:13,785 | WARN | main | JobResourceUploader | op.mapreduce.JobResourceUploader 64 | Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
16:44:17,581 | INFO | main | FileInputFormat | reduce.lib.input.FileInputFormat 281 | Total input paths to process : 1
16:44:18,055 | INFO | main | JobSubmitter | he.hadoop.mapreduce.JobSubmitter 199 | number of splits:1
16:44:18,780 | INFO | main | JobSubmitter | he.hadoop.mapreduce.JobSubmitter 288 | Submitting tokens for job: job_1506933793385_0001
16:44:20,138 | INFO | main | YarnClientImpl | n.client.api.impl.YarnClientImpl 251 | Submitted application application_1506933793385_0001
16:44:20,307 | INFO | main | Job | org.apache.hadoop.mapreduce.Job 1301 | The url to track the job: http://hdp-node-01:8088/proxy/application_1506933793385_0001/
16:44:20,309 | INFO | main | Job | org.apache.hadoop.mapreduce.Job 1346 | Running job: job_1506933793385_0001
16:45:03,829 | INFO | main | Job | org.apache.hadoop.mapreduce.Job 1367 | Job job_1506933793385_0001 running in uber mode : false
16:45:03,852 | INFO | main | Job | org.apache.hadoop.mapreduce.Job 1374 | map 0% reduce 0%
16:45:40,267 | INFO | main | Job | org.apache.hadoop.mapreduce.Job 1374 | map 100% reduce 0%
16:46:08,081 | INFO | main | Job | org.apache.hadoop.mapreduce.Job 1374 | map 100% reduce 100%
16:46:09,121 | INFO | main | Job | org.apache.hadoop.mapreduce.Job 1385 | Job job_1506933793385_0001 completed successfully
16:46:09,562 | INFO | main | Job | org.apache.hadoop.mapreduce.Job 1392 | Counters: 49
File System Counters
FILE: Number of bytes read=256
FILE: Number of bytes written=212341
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=259
HDFS: Number of bytes written=152
HDFS: Number of read operations=6
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=30792
Total time spent by all reduces in occupied slots (ms)=24300
Total time spent by all map tasks (ms)=30792
Total time spent by all reduce tasks (ms)=24300
Total vcore-seconds taken by all map tasks=30792
Total vcore-seconds taken by all reduce tasks=24300
Total megabyte-seconds taken by all map tasks=31531008
Total megabyte-seconds taken by all reduce tasks=24883200
Map-Reduce Framework
Map input records=1
Map output records=18
Map output bytes=214
Map output materialized bytes=256
Input split bytes=118
Combine input records=0
Combine output records=0
Reduce input groups=15
Reduce shuffle bytes=256
Reduce input records=18
Reduce output records=15
Spilled Records=36
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=533
CPU time spent (ms)=5430
Physical memory (bytes) snapshot=311525376
Virtual memory (bytes) snapshot=1680896000
Total committed heap usage (bytes)=136122368
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=141
File Output Format Counters
Bytes Written=152
Process finished with exit code 0
4. 常見問題FAQ
4.1 許可權問題
Exception in thread "main" org.apache.hadoop.security.AccessControlException: org.apache.hadoop.security.AccessControlException: Permission denied: user=dvqfq6prcjdsh4p\hadoop, access=WRITE, inode="hadoop":hadoop:supergroup:rwxr-xr-x
在hdfs-site.xml增加
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
在環境變數里加HADOOP_USER_NAME=hadoop。詳情見2.2
4.2 時間同步問題
Container launch failed for container_1506950816832_0005_01_000002 : org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to start container.
This token is expired. current time is 1506954189368 found 1506953252362
多個datanode與namenode進行時間同步,在每臺伺服器執行:ntpdate time.nist.gov,確認時間同步成功。
最好在每臺伺服器的 /etc/crontab 中加入一行:
0 2 * * * root ntpdate time.nist.gov && hwclock -w
4.3
Stack trace: ExitCodeException exitCode=1: /bin/bash: line 0: fg: no job control
jar地址錯誤,注意mapreduce.job.jar 的配置。