hadoop的python框架指南_Python之——用Mrjob框架编写Hadoop MapReduce程序(基于Hadoop 2.5.2)...
轉載請注明出處:http://blog.csdn.net/l1028386804/article/details/79056120
一、環境準備想了解如何使用原生Python編寫MapReduce程序或者如何搭建Hadoop環境請參考博文《Python之——使用原生Python編寫Hadoop MapReduce程序(基于Hadoop 2.5.2) 》的內容
Mrjob(http://pythonhosted.org/mrjob/index.html) 是一個編寫MapRecuce任務的開源Python框架,它實際上對Hadoop Stream的命令進行了封裝,因此讓開發者接觸不到Hadoop數據流命令行,使我們更輕松、快速編寫MapReduce任務。Mrjob具有如下特點。
1)代碼簡介,map和reduce函數通過一個Python文件就可以搞定;
2)支持多步驟的MapReduce任務工作流;
3)支持多種運行方式,包括內嵌方式、本地環境、Hadoop、遠程亞馬遜;
4)支持亞馬遜網絡數據分析服務Elastic MapReduce(EMR);
5)調試方便,無需任務環境支持
安裝Mrjob要求環境為Python 2.5及以上版本,源碼下載地址為:https://github.com/yelp/mrjob
# pip install mrjob #pip安裝方式
# python setup.py install #源碼安裝方式
二、利用Mrjob實現MapReduce本實例同樣實現統計文本文件(/usr/local/python/source/input.txt)中所有單詞出現的詞頻,Mrjob通過,mapper()與reducer()方法實現了MR操作,具體代碼如下:
【/usr/local/python/source/word_count.py】
# -*- coding:UTF-8 -*-
'''
Created on 2018年1月14日
@author: liuyazhuang
'''
from mrjob.job import MRJob
class MRWordCounter(MRJob):
def mapper(self, key, line):
for word in line.split():
yield word, 1
def reducer(self, word, occurrences):
yield word, sum(occurrences)
if __name__ == '__main__':
MRWordCounter.run()可以看出代碼行數只是原生Python的1/3,邏輯也比較清晰,代碼中包含了mapper、reducer函數。mapper函數接收每一行的輸入數據,處理后返回一對key:value,初始化value為1;reducer接收mapper輸出的key-value對進行整合,把相同key的value作累加操作后輸出。Mrjob利用Python的yield機制將函數變成一個Generators(生成器),通過不斷調用next()實現key-value的初始化或運算操作。
三、運行MapReduce
1、內嵌(-r inline)方式特點是調試方便,啟動單一進程模擬任務執行狀態和結果,默認(-r inline)可以省略,輸出文件使用 > output-file 或-o output-file,比如下面兩種運行方式是等價的:
python word_count.py -r inline input.txt > output.txt
python word_count.py input.txt > output.txt此時我們執行cat output.txt操作
[root@liuyazhuang121 source]# cat output.txt
"test" 2
"welcome" 1
"where" 1
"xxx" 2
"aaa" 1
"ab" 1
"abc" 1
"adc" 1
"bar" 2
"bbb" 2
"xxyy" 1
"you" 1
"your" 1
"yyy" 2
"hello" 2
"home" 2
"iii" 2
"is" 1
"labs" 1
"liuyazhuang" 2
"lyz" 2
"bc" 1
"bec" 1
"by" 1
"ccc" 2
"hadoop" 2
"me" 1
"ooo" 2
"python" 2
"see" 1得出了正確結果。
2、本地(-r local)方式用于本地模擬Hadoop調試,與內嵌(inline)方式的區別是啟動了多進程執行每一個任務。如:
python word_count.py -r local input.txt > output1.txt此時我們cat output1.txt查看結果:
[root@liuyazhuang121 source]# cat output1.txt
"test" 2
"welcome" 1
"where" 1
"xxx" 2
"aaa" 1
"ab" 1
"abc" 1
"adc" 1
"bar" 2
"bbb" 2
"xxyy" 1
"you" 1
"your" 1
"yyy" 2
"hello" 2
"home" 2
"iii" 2
"is" 1
"labs" 1
"liuyazhuang" 2
"lyz" 2
"bc" 1
"bec" 1
"by" 1
"ccc" 2
"hadoop" 2
"me" 1
"ooo" 2
"python" 2
"see" 1得出了正確結果。
3、Hadoop(-r hadoop)方式用于hadoop環境,支持Hadoop運行調度控制參數,如:
1)指定Hadoop任務調度優先級(VERY_HIGH|HIGH),如:--jobconf mapreduce.job.priority=VERY_HIGH。
2)Map及Reduce任務個數限制,如:--jobconf mapreduce.map.tasks=2? --jobconf mapreduce.reduce.tasks=5
注意:執行之前需要配置Hadoop環境變量。
本例中我們依然使用Hadoop HDFS中的/user/root/word/input.txt文件,具體運行命令如下:
python word_count.py -r hadoop --jobconf mapreduce.job.priority=VERY_HIGH --jobconf mapreduce.map.tasks=2 --jobconf mapduce.reduce.tasks=1 -o hdfs://liuyazhuang121:9000/output/hadoop_word hdfs://liuyazhuang121:9000/user/root/word打印的結果如下:
[root@liuyazhuang121 source]#python word_count.py -r hadoop --jobconf mapreduce.job.priority=VERY_HIGH --jobconf mapreduce.map.tasks=2 --jobconf mapduce.reduce.tasks=1 -o hdfs://liuyazhuang121:9000/output/hadoop_word hdfs://liuyazhuang121:9000/user/root/word
No configs found; falling back on auto-configuration
No configs specified for hadoop runner
Looking for hadoop binary in $PATH...
Found hadoop binary: /usr/local/hadoop-2.5.2/bin/hadoop
Using Hadoop version 2.5.2
Looking for Hadoop streaming jar in /usr/local/hadoop-2.5.2...
Found Hadoop streaming jar: /usr/local/hadoop-2.5.2/share/hadoop/tools/lib/hadoop-streaming-2.5.2.jar
Creating temp directory /tmp/word_count.root.20180114.050606.032324
Copying local files to hdfs:///user/root/tmp/mrjob/word_count.root.20180114.050606.032324/files/...
Running step 1 of 1...
packageJobJar: [/usr/local/hadoop-2.5.2/tmp/hadoop-unjar2522703497090634857/] [] /tmp/streamjob1355851303293562830.jar tmpDir=null
Connecting to ResourceManager at liuyazhuang121/192.168.209.121:8032
Connecting to ResourceManager at liuyazhuang121/192.168.209.121:8032
Total input paths to process : 1
number of splits:2
Submitting tokens for job: job_1515893542122_0003
Submitted application application_1515893542122_0003
The url to track the job: http://liuyazhuang121:8088/proxy/application_1515893542122_0003/ Running job: job_1515893542122_0003
Job job_1515893542122_0003 running in uber mode : false
map 0% reduce 0%
map 33% reduce 0%
map 100% reduce 0%
map 100% reduce 100%
Job job_1515893542122_0003 completed successfully
Output directory: hdfs://liuyazhuang121:9000/output/hadoop_word
Counters: 49
File Input Format Counters
Bytes Read=323
File Output Format Counters
Bytes Written=262
File System Counters
FILE: Number of bytes read=486
FILE: Number of bytes written=305876
FILE: Number of large read operations=0
FILE: Number of read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=529
HDFS: Number of bytes written=262
HDFS: Number of large read operations=0
HDFS: Number of read operations=9
HDFS: Number of write operations=2
Job Counters
Data-local map tasks=2
Launched map tasks=2
Launched reduce tasks=1
Total megabyte-seconds taken by all map tasks=23237632
Total megabyte-seconds taken by all reduce tasks=11787264
Total time spent by all map tasks (ms)=22693
Total time spent by all maps in occupied slots (ms)=22693
Total time spent by all reduce tasks (ms)=11511
Total time spent by all reduces in occupied slots (ms)=11511
Total vcore-seconds taken by all map tasks=22693
Total vcore-seconds taken by all reduce tasks=11511
Map-Reduce Framework
CPU time spent (ms)=3150
Combine input records=0
Combine output records=0
Failed Shuffles=0
GC time elapsed (ms)=149
Input split bytes=206
Map input records=1
Map output bytes=392
Map output materialized bytes=492
Map output records=44
Merged Map outputs=2
Physical memory (bytes) snapshot=611057664
Reduce input groups=30
Reduce input records=44
Reduce output records=30
Reduce shuffle bytes=492
Shuffled Maps =2
Spilled Records=88
Total committed heap usage (bytes)=429916160
Virtual memory (bytes) snapshot=2661163008
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
Streaming final output from hdfs://liuyazhuang121:9000/output/hadoop_word...
"aaa" 1
"ab" 1
"abc" 1
"adc" 1
"bar" 2
"bbb" 2
"bc" 1
"bec" 1
"by" 1
"ccc" 2
"hadoop" 2
"hello" 2
"home" 2
"iii" 2
"is" 1
"labs" 1
"liuyazhuang" 2
"lyz" 2
"me" 1
"ooo" 2
"python" 2
"see" 1
"test" 2
"welcome" 1
"where" 1
"xxx" 2
"xxyy" 1
"you" 1
"your" 1
"yyy" 2
Removing HDFS temp directory hdfs:///user/root/tmp/mrjob/word_count.root.20180114.050606.032324...
Removing temp directory /tmp/word_count.root.20180114.050606.032324...結果顯示,打印出了每個單詞的頻次。此時我們輸入命令:
hadoop fs -ls /output/hadoop_word查看生成的文件如下:
[root@liuyazhuang121 source]# hadoop fs -ls /output/hadoop_word
Found 2 items
-rw-r--r-- 1 root supergroup 0 2018-01-14 13:06 /output/hadoop_word/_SUCCESS
-rw-r--r-- 1 root supergroup 262 2018-01-14 13:06 /output/hadoop_word/part-00000此時,我們輸入命令:
hadoop fs -cat /output/hadoop_word/part-00000查看輸出的結果:
[root@liuyazhuang121 source]# hadoop fs -cat /output/hadoop_word/part-00000
"aaa" 1
"ab" 1
"abc" 1
"adc" 1
"bar" 2
"bbb" 2
"bc" 1
"bec" 1
"by" 1
"ccc" 2
"hadoop" 2
"hello" 2
"home" 2
"iii" 2
"is" 1
"labs" 1
"liuyazhuang" 2
"lyz" 2
"me" 1
"ooo" 2
"python" 2
"see" 1
"test" 2
"welcome" 1
"where" 1
"xxx" 2
"xxyy" 1
"you" 1
"your" 1
"yyy" 2我們可以看出,輸出了正確的結果。
總結
以上是生活随笔為你收集整理的hadoop的python框架指南_Python之——用Mrjob框架编写Hadoop MapReduce程序(基于Hadoop 2.5.2)...的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: java泛型循环break contin
- 下一篇: idea 设置java内存_java相关