eclipse와 maven을 이용한 hadoop 테스트 환경 설정

1. 사전 준비사항

  1. java 8 (혹은 8 이후의 버전) 설치
  2. eclipse (maven plugin 포함) 설치
  3. windows 10의 경우 환경 설정 필요함 (본 페이지 마지막 항목 참조)

2. pom.xml에 mapreduce dependency 추가

<dependencies>
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-mapreduce-client-core</artifactId>
        <version>3.0.0</version>
    </dependency>
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-common</artifactId>
        <version>3.0.0</version>
    </dependency>
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-mapreduce-client-jobclient</artifactId>
        <version>3.0.0</version>
    </dependency>
</dependencies>

3. MapReduce 코드 작성

wordcount (unigram) 예제코드 (WordCount.java)

package kmu.bigdata.wordcount;

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

public class WordCount extends Configured implements Tool{
    
    public static void main(String[] args) throws Exception {
        ToolRunner.run(new WordCount(), args);
    }
    
    public static class WCMapper extends Mapper<Object, Text, Text, IntWritable>{

        Text word = new Text();
        IntWritable one = new IntWritable(1);
        
        @Override
        protected void map(Object key, Text value, Mapper<Object, Text, Text, IntWritable>.Context context)
                throws IOException, InterruptedException {
            
            StringTokenizer st = new StringTokenizer(value.toString());
            
            while(st.hasMoreTokens()) {
                word.set(st.nextToken());
                context.write(word, one);
            }
            
        }
        
    }
    
    public static class WCReducer extends Reducer<Text, IntWritable, Text, IntWritable>{

        IntWritable oval = new IntWritable();
        
        @Override
        protected void reduce(Text key, Iterable<IntWritable> values,
                Reducer<Text, IntWritable, Text, IntWritable>.Context context) throws IOException, InterruptedException {
            
            int sum = 0;
            for(IntWritable value : values) {
                sum += value.get();
            }
            oval.set(sum);
            
            context.write(key, oval);
        }
        
    }

    public int run(String[] args) throws Exception {
        
        Job myjob = Job.getInstance(getConf());
        myjob.setJarByClass(WordCount.class);
        myjob.setMapperClass(WCMapper.class);
        myjob.setReducerClass(WCReducer.class);
        myjob.setMapOutputKeyClass(Text.class);
        myjob.setMapOutputValueClass(IntWritable.class);
        myjob.setOutputFormatClass(TextOutputFormat.class);
        myjob.setInputFormatClass(TextInputFormat.class);
        FileInputFormat.addInputPath(myjob, new Path(args[0]));
        FileOutputFormat.setOutputPath(myjob, new Path(args[0]).suffix(".out"));
        
        myjob.waitForCompletion(true);
        
        return 0;
    }
    
}

Test 코드 (WordCountTest.java)

package kmu.bigdata.wordcount;

import org.apache.hadoop.util.ToolRunner;

public class WordCountTest {
	public static void main(String[] args) throws Exception {
		ToolRunner.run(new WordCount(), new String[] {"src/test/resources/testfile.txt"});
	}
}

4. Windows에서 설정

  1. https://github.com/steveloughran/winutils 에서 hadoop-3.x.x 폴더째로 다운로드
    • https://wiki.apache.org/hadoop/WindowsProblems 참조
  2. 1.에서 다운받은 폴더를 적절한 경로에 위치시키고 (예: c:\hadoop-3.x.x) 환경 변수 설정
    • 환경변수 HADOOP_HOME를 다운받은 폴더의 경로로 설정
    • 환경변수 PATH에 %HADOOP_HOME%\bin 추가
  3. ExitCodeException exitCode=-1073741515 에러 발생시
    • Visual C++ 2010 Redistributable Package 설치하면 에러 해결됨
    • 참조1: https://stackoverflow.com/questions/45947375/why-does-starting-a-streaming-query-lead-to-exitcodeexception-exitcode-1073741
    • 참조2 (Visual C++ 2010 Redistributable Package 다운로드 경로): https://answers.microsoft.com/en-us/insider/forum/insider_wintp-insider_repair/how-do-i-fix-this-error-msvcp100dll-is-missing/c167d686-044e-44ab-8e8f-968fac9525c5?auth=1