eclipse와 maven을 이용한 hadoop 테스트 환경 설정
1. 사전 준비사항
- java 8 (혹은 8 이후의 버전) 설치
- eclipse (maven plugin 포함) 설치
- windows 10의 경우 환경 설정 필요함 (본 페이지 마지막 항목 참조)
2. pom.xml에 mapreduce dependency 추가
<dependencies>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-mapreduce-client-core</artifactId>
<version>3.0.0</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>3.0.0</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-mapreduce-client-jobclient</artifactId>
<version>3.0.0</version>
</dependency>
</dependencies>
3. MapReduce 코드 작성
wordcount (unigram) 예제코드 (WordCount.java
)
package kmu.bigdata.wordcount;
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class WordCount extends Configured implements Tool{
public static void main(String[] args) throws Exception {
ToolRunner.run(new WordCount(), args);
}
public static class WCMapper extends Mapper<Object, Text, Text, IntWritable>{
Text word = new Text();
IntWritable one = new IntWritable(1);
@Override
protected void map(Object key, Text value, Mapper<Object, Text, Text, IntWritable>.Context context)
throws IOException, InterruptedException {
StringTokenizer st = new StringTokenizer(value.toString());
while(st.hasMoreTokens()) {
word.set(st.nextToken());
context.write(word, one);
}
}
}
public static class WCReducer extends Reducer<Text, IntWritable, Text, IntWritable>{
IntWritable oval = new IntWritable();
@Override
protected void reduce(Text key, Iterable<IntWritable> values,
Reducer<Text, IntWritable, Text, IntWritable>.Context context) throws IOException, InterruptedException {
int sum = 0;
for(IntWritable value : values) {
sum += value.get();
}
oval.set(sum);
context.write(key, oval);
}
}
public int run(String[] args) throws Exception {
Job myjob = Job.getInstance(getConf());
myjob.setJarByClass(WordCount.class);
myjob.setMapperClass(WCMapper.class);
myjob.setReducerClass(WCReducer.class);
myjob.setMapOutputKeyClass(Text.class);
myjob.setMapOutputValueClass(IntWritable.class);
myjob.setOutputFormatClass(TextOutputFormat.class);
myjob.setInputFormatClass(TextInputFormat.class);
FileInputFormat.addInputPath(myjob, new Path(args[0]));
FileOutputFormat.setOutputPath(myjob, new Path(args[0]).suffix(".out"));
myjob.waitForCompletion(true);
return 0;
}
}
Test 코드 (WordCountTest.java
)
package kmu.bigdata.wordcount;
import org.apache.hadoop.util.ToolRunner;
public class WordCountTest {
public static void main(String[] args) throws Exception {
ToolRunner.run(new WordCount(), new String[] {"src/test/resources/testfile.txt"});
}
}
4. Windows에서 설정
- https://github.com/steveloughran/winutils 에서 hadoop-3.x.x 폴더째로 다운로드
- https://wiki.apache.org/hadoop/WindowsProblems 참조
- 1.에서 다운받은 폴더를 적절한 경로에 위치시키고 (예: c:\hadoop-3.x.x) 환경 변수 설정
- 환경변수
HADOOP_HOME
를 다운받은 폴더의 경로로 설정 - 환경변수 PATH에
%HADOOP_HOME%\bin
추가
- 환경변수
ExitCodeException exitCode=-1073741515
에러 발생시- Visual C++ 2010 Redistributable Package 설치하면 에러 해결됨
- 참조1: https://stackoverflow.com/questions/45947375/why-does-starting-a-streaming-query-lead-to-exitcodeexception-exitcode-1073741
- 참조2 (Visual C++ 2010 Redistributable Package 다운로드 경로): https://answers.microsoft.com/en-us/insider/forum/insider_wintp-insider_repair/how-do-i-fix-this-error-msvcp100dll-is-missing/c167d686-044e-44ab-8e8f-968fac9525c5?auth=1