今天在实验室的机器上配置好了Hadoop,因为今后预计会使用C++进行一些开发,
因此模仿word count的example写了个C++测试程序(好简陋 - -!!)。
贴个代码:
Mapper:
// c++ map reduce Mapper
// word count example
// 2008.4.18
// by iveney
#include <string>
#include <iostream>
using namespace std;
int main()
{
string buf;
while( cin>>buf )
cout<<buf<<"\t"<<"1"<<endl;
return 0;
}
Reducer:
#include <iostream>
#include <string>
#include <map>
using namespace std;
int main()
{
map<string,int> dict;
map<string,int>::iterator iter;
string word;
int count;
while( cin>>word>>count )
dict[word]+=count;
iter = dict.begin();
while( iter != dict.end() )
{
cout<<iter->first<<"\t"<<iter->second<<endl;
iter++;
}
return 0;
}
编译:
[hadoop@master ~]$ g++ mapper.cpp -o mapper
[hadoop@master ~]$ g++ reducer.cpp -o reducer
简单测试:
[hadoop@master ~]$ echo "ge abc ab df ge" | ./mapper | ./reducer
ab 1
abc 1
df 1
ge 2
使用Hadoop计算:
(数据使用了Project Gutenberg的几本电子书,不过貌似国内被屏蔽了?www.gutenberg.org
好在在UPenn的图书馆里有资源 ^^V http://digital.library.upenn.edu/webbin/gutbook )
[hadoop@master ~]$ ll -h gutenberg
total 5.3M
-rw-r--r-- 1 hadoop hadoop 336K 2008-04-18 07:45 132.txt
-rw-r--r-- 1 hadoop hadoop 154K 2008-04-18 07:45 1ws1510.txt
-rw-r--r-- 1 hadoop hadoop 659K 2008-04-18 07:45 20417-8.txt
-rw-r--r-- 1 hadoop hadoop 1.4M 2008-04-18 07:45 7ldvc09.txt
-rw-r--r-- 1 hadoop hadoop 577K 2008-04-18 07:45 advsh12.txt
-rw-r--r-- 1 hadoop hadoop 377K 2008-04-18 07:45 dvldc10.txt
-rw-r--r-- 1 hadoop hadoop 185K 2008-04-18 07:45 frhnt10.txt
-rw-r--r-- 1 hadoop hadoop 152K 2008-04-18 07:45 rgsyn10.txt
-rw-r--r-- 1 hadoop hadoop 1.5M 2008-04-18 07:45 ulyss12.txt
运行结果:
[hadoop@master hadoop]$ bin/hadoop jar contrib/streaming/hadoop-0.16.3-streaming.jar -mapper /home/hadoop/mapper -reducer /home/hadoop/reducer -input gutenberg/* -output gutenberg-c++-output
additionalConfSpec_:null
null=@@@userJobConfProps_.get(stream.shipped.hadoopstreaming
packageJobJar: [/usr/share/hadoop-datastore/hadoop-hadoop/hadoop-unjar41914/] [] /tmp/streamjob41915.jar tmpDir=null
08/04/18 01:16:13 INFO mapred.FileInputFormat: Total input paths to process : 9
08/04/18 01:16:13 INFO streaming.StreamJob: getLocalDirs(): [/usr/share/hadoop-datastore/hadoop-hadoop/mapred/local]
08/04/18 01:16:13 INFO streaming.StreamJob: Running job: job_200804171952_0005
08/04/18 01:16:13 INFO streaming.StreamJob: To kill this job, run:
08/04/18 01:16:13 INFO streaming.StreamJob: /usr/share/hadoop/bin/../bin/hadoop job -Dmapred.job.tracker=master:54311 -kill job_200804171952_0005
08/04/18 01:16:13 INFO streaming.StreamJob: Tracking URL: http://master.cluster:50030/jobdetails.jsp?jobid=job_200804171952_0005
08/04/18 01:16:14 INFO streaming.StreamJob: map 0% reduce 0%
08/04/18 01:16:17 INFO streaming.StreamJob: map 11% reduce 0%
08/04/18 01:16:18 INFO streaming.StreamJob: map 22% reduce 0%
08/04/18 01:16:19 INFO streaming.StreamJob: map 67% reduce 0%
08/04/18 01:16:20 INFO streaming.StreamJob: map 78% reduce 0%
08/04/18 01:16:22 INFO streaming.StreamJob: map 100% reduce 0%
08/04/18 01:16:23 INFO streaming.StreamJob: map 100% reduce 4%
08/04/18 01:16:26 INFO streaming.StreamJob: map 100% reduce 26%
08/04/18 01:16:33 INFO streaming.StreamJob: map 100% reduce 30%
08/04/18 01:16:36 INFO streaming.StreamJob: map 100% reduce 33%
08/04/18 01:16:38 INFO streaming.StreamJob: map 100% reduce 71%
08/04/18 01:16:41 INFO streaming.StreamJob: map 100% reduce 80%
08/04/18 01:16:43 INFO streaming.StreamJob: map 100% reduce 100%
08/04/18 01:16:43 INFO streaming.StreamJob: Job complete: job_200804171952_0005
08/04/18 01:16:43 INFO streaming.StreamJob: Output: gutenberg-c++-output
跟使用例程测试得到的结果相同~成功~
用了29secs,竟然比Python写的例程还慢5secs。
很可能是数据量大小java虚拟机的启动显得没那么trivial了 -_-|||
另外直接运行其实更快……咔咔......数据量
[hadoop@master ~]$ cat test.sh
#!/bin/bash
cd /home/hadoop/gutenberg
for i in `ls`
do
cat $i | /home/hadoop/mapper | /home/hadoop/reducer \
>> /home/hadoop/output
done
[hadoop@master ~]$ time ./test.sh
real 0m4.176s
user 0m5.454s
sys 0m1.611s
No comments:
Post a Comment