Saturday, April 19, 2008

使用C++编写Hadoop Map Reduce程序

今天在实验室的机器上配置好了Hadoop,因为今后预计会使用C++进行一些开发,
因此模仿word count的example写了个C++测试程序(好简陋 - -!!)。

贴个代码:
Mapper:
// c++ map reduce Mapper
// word count example
// 2008.4.18
// by iveney
#include <string>
#include <iostream>
using namespace std;

int main()
{
        string buf;
        while( cin>>buf )
                cout<<buf<<"\t"<<"1"<<endl;
        return 0;
}

Reducer:
#include <iostream>
#include <string>
#include <map>
using namespace std;

int main()
{
        map<string,int> dict;
        map<string,int>::iterator iter;
        string word;
        int count;
        while( cin>>word>>count )
                dict[word]+=count;
        iter = dict.begin();
        while( iter != dict.end() )
        {
                cout<<iter->first<<"\t"<<iter->second<<endl;
                iter++;
        }
        return 0;
}

编译:
[hadoop@master ~]$ g++ mapper.cpp -o mapper
[hadoop@master ~]$ g++ reducer.cpp -o reducer

简单测试:
[hadoop@master ~]$ echo "ge abc ab  df ge" | ./mapper | ./reducer
ab      1
abc     1
df      1
ge      2

使用Hadoop计算:
(数据使用了Project Gutenberg的几本电子书,不过貌似国内被屏蔽了?www.gutenberg.org
好在在UPenn的图书馆里有资源 ^^V http://digital.library.upenn.edu/webbin/gutbook

[hadoop@master ~]$ ll -h gutenberg
total 5.3M
-rw-r--r-- 1 hadoop hadoop 336K 2008-04-18 07:45 132.txt
-rw-r--r-- 1 hadoop hadoop 154K 2008-04-18 07:45 1ws1510.txt
-rw-r--r-- 1 hadoop hadoop 659K 2008-04-18 07:45 20417-8.txt
-rw-r--r-- 1 hadoop hadoop 1.4M 2008-04-18 07:45 7ldvc09.txt
-rw-r--r-- 1 hadoop hadoop 577K 2008-04-18 07:45 advsh12.txt
-rw-r--r-- 1 hadoop hadoop 377K 2008-04-18 07:45 dvldc10.txt
-rw-r--r-- 1 hadoop hadoop 185K 2008-04-18 07:45 frhnt10.txt
-rw-r--r-- 1 hadoop hadoop 152K 2008-04-18 07:45 rgsyn10.txt
-rw-r--r-- 1 hadoop hadoop 1.5M 2008-04-18 07:45 ulyss12.txt


运行结果:
[hadoop@master hadoop]$ bin/hadoop jar contrib/streaming/hadoop-0.16.3-streaming.jar -mapper /home/hadoop/mapper -reducer /home/hadoop/reducer -input gutenberg/* -output gutenberg-c++-output
additionalConfSpec_:null
null=@@@userJobConfProps_.get(stream.shipped.hadoopstreaming
packageJobJar: [/usr/share/hadoop-datastore/hadoop-hadoop/hadoop-unjar41914/] [] /tmp/streamjob41915.jar tmpDir=null
08/04/18 01:16:13 INFO mapred.FileInputFormat: Total input paths to process : 9
08/04/18 01:16:13 INFO streaming.StreamJob: getLocalDirs(): [/usr/share/hadoop-datastore/hadoop-hadoop/mapred/local]
08/04/18 01:16:13 INFO streaming.StreamJob: Running job: job_200804171952_0005
08/04/18 01:16:13 INFO streaming.StreamJob: To kill this job, run:
08/04/18 01:16:13 INFO streaming.StreamJob: /usr/share/hadoop/bin/../bin/hadoop job  -Dmapred.job.tracker=master:54311 -kill job_200804171952_0005
08/04/18 01:16:13 INFO streaming.StreamJob: Tracking URL: http://master.cluster:50030/jobdetails.jsp?jobid=job_200804171952_0005
08/04/18 01:16:14 INFO streaming.StreamJob:  map 0%  reduce 0%
08/04/18 01:16:17 INFO streaming.StreamJob:  map 11%  reduce 0%
08/04/18 01:16:18 INFO streaming.StreamJob:  map 22%  reduce 0%
08/04/18 01:16:19 INFO streaming.StreamJob:  map 67%  reduce 0%
08/04/18 01:16:20 INFO streaming.StreamJob:  map 78%  reduce 0%
08/04/18 01:16:22 INFO streaming.StreamJob:  map 100%  reduce 0%
08/04/18 01:16:23 INFO streaming.StreamJob:  map 100%  reduce 4%
08/04/18 01:16:26 INFO streaming.StreamJob:  map 100%  reduce 26%
08/04/18 01:16:33 INFO streaming.StreamJob:  map 100%  reduce 30%
08/04/18 01:16:36 INFO streaming.StreamJob:  map 100%  reduce 33%
08/04/18 01:16:38 INFO streaming.StreamJob:  map 100%  reduce 71%
08/04/18 01:16:41 INFO streaming.StreamJob:  map 100%  reduce 80%
08/04/18 01:16:43 INFO streaming.StreamJob:  map 100%  reduce 100%
08/04/18 01:16:43 INFO streaming.StreamJob: Job complete: job_200804171952_0005
08/04/18 01:16:43 INFO streaming.StreamJob: Output: gutenberg-c++-output

跟使用例程测试得到的结果相同~成功~

用了29secs,竟然比Python写的例程还慢5secs。
很可能是数据量大小java虚拟机的启动显得没那么trivial了 -_-|||

另外直接运行其实更快……咔咔......数据量
[hadoop@master ~]$ cat test.sh
#!/bin/bash

cd /home/hadoop/gutenberg
for i in `ls`
do
        cat $i | /home/hadoop/mapper | /home/hadoop/reducer \
             >> /home/hadoop/output
done
[hadoop@master ~]$ time ./test.sh

real    0m4.176s
user    0m5.454s
sys     0m1.611s

No comments: