python实现Mapreduce的wordcount-白红宇

python实现Mapreduce的wordcount

阅读量：3915 次

发布时间：2019-05-23

本文共 3526 字，大约阅读时间需要 11 分钟。

文章目录

介绍

Hadoop作为Apache的基金项目，解决的大数据处理时间长的问题，其中MapReduce并行处理框架作为Hadoop中重要的成员。由于Hadoop的架构实现是由JAVA实现的，所以在进行大数据处理时，JAVA程序用的较多，但是，想要把深度学习算法用到MapReduce中，Python是深度学习和数据挖掘处理数据较为容易的语言，所以基于以上考虑，本文介绍了使用python实现MapReduce中的WordCount实验，文章内容（代码部分）来自于某一博主CSDN博客，参考链接在最后。

Hadoop Stream

主要使用的Hadoop提供的Hadoop Streaming，首先，介绍一下Hadoop Stream

Streaming 的作用

Hadoop Streaming框架，最大的好处是，让任何语言编写的map, reduce程序能够在hadoop集群上运行；map/reduce程序只要遵循从标准输入stdin读，写出到标准输出stdout即可；

其次，容易进行单机调试，通过管道前后相接的方式就可以模拟streaming, 在本地完成map/reduce程序的调试
#cat inputfile | mapper | sort | reducer > output

最后，streaming框架还提供了作业提交时的丰富参数控制，直接通过streaming参数，而不需要使用java语言修改；很多mapreduce的高阶功能，都可以通过steaming参数的调整来完成。

Streaming 的局限

Streaming默认只能处理文本数据Textfile，对于二进制数据，比较好的方法是将二进制的key, value进行base64编码，转化为文本；

Mapper和reducer的前后都要进行标准输入和标准输出的转化，涉及数据拷贝和解析，带来了一定的开销。

Streaming 命令的相关参数

# hadoop jar hadoop-streaming-2.6.5.jar \ [普通选项] [Streaming选项]

普通选项和Stream选项可以参考如下网址：

Python实现MapReduce的WordCount

首先，编写mapper.py脚本：

#!/usr/bin/env python    import sys    # input comes from STDIN (standard input)  for line in sys.stdin:      # remove leading and trailing whitespace      line = line.strip()      # split the line into words      words = line.split()      # increase counters      for word in words:          # write the results to STDOUT (standard output);          # what we output here will be the input for the          # Reduce step, i.e. the input for reducer.py          #          # tab-delimited; the trivial word count is 1          print '%s\t%s' % (word, 1)

在这个脚本中，并不计算出单词出现的总数，它将输出 " 1" 迅速地，尽管可能会在输入中出现多次，计算是留给后来的Reduce步骤（或叫做程序）来实现。记住为mapper.py赋予可执行权限：chmod 777

reducer.py脚本

#!/usr/bin/env python    from operator import itemgetter  import sys    current_word = None  current_count = 0  word = None    # input comes from STDIN  for line in sys.stdin:      # remove leading and trailing whitespace      line = line.strip()        # parse the input we got from mapper.py      word, count = line.split('\t', 1)        # convert count (currently a string) to int      try:          count = int(count)      except ValueError:          # count was not a number, so silently          # ignore/discard this line          continue        # this IF-switch only works because Hadoop sorts map output      # by key (here: word) before it is passed to the reducer      if current_word == word:          current_count += count      else:          if current_word:              # write result to STDOUT              print '%s\t%s' % (current_word, current_count)          current_count = count          current_word = word    # do not forget to output the last word if needed!  if current_word == word:      print '%s\t%s' % (current_word, current_count)

将代码存储在/usr/local/hadoop/reducer.py 中，的STDIN中读取结果，然后计算每个单词出现次数的总和，并输出结果到STDOUT。

同样，要注意脚本权限：chmod 777

建议在运行MapReduce任务的时候测试一下脚本运行效果正确：

root@localhost:/root/pythonHadoop$ echo "foo foo quux labs foo bar quux" | ./mapper.py  foo      1  foo      1  quux     1  labs     1  foo      1  bar      1  quux     1  root@localhost:/root/pythonHadoop$ echo "foo foo quux labs foo bar quux" |./mapper.py | sort |./reducer.py  bar     1  foo     3  labs    1  quux    2

如果执行效果如上，则证明可行。可以运行MapReduce了。

在Hadoop平台运行python脚本：

[root@node01 pythonHadoop]         hadoop jar contrib/hadoop-streaming-2.6.5.jar    -mapper mapper.py    -file mapper.py    -reducer reducer.py    -file reducer.py    -input /ooxx/*   -output /ooxx/output/

最后执行 hdfs dfs -cat /ooxx/output/part-00000进行输出结果的查看。
结果就不展示了，对于hello.txt文件可以自己用echo 制作，也可以从网上自行下载测试文件，对于测试结果，不同数据集结果不尽相同。

参考文章：

转载地址：http://qscrn.baihongyu.com/

你可能感兴趣的文章