當前位置：首頁 > 编程语言 > python >内容正文

python

hadoop的python框架指南_Python之——用Mrjob框架编写Hadoop MapReduce程序(基于Hadoop 2.5.2)...

發布時間：2024/9/15 python 35 豆豆

生活随笔收集整理的這篇文章主要介紹了 hadoop的python框架指南_Python之——用Mrjob框架编写Hadoop MapReduce程序(基于Hadoop 2.5.2)... 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

轉載請注明出處：http://blog.csdn.net/l1028386804/article/details/79056120

一、環境準備想了解如何使用原生Python編寫MapReduce程序或者如何搭建Hadoop環境請參考博文《Python之——使用原生Python編寫Hadoop MapReduce程序(基于Hadoop 2.5.2) 》的內容

Mrjob(http://pythonhosted.org/mrjob/index.html) 是一個編寫MapRecuce任務的開源Python框架，它實際上對Hadoop Stream的命令進行了封裝，因此讓開發者接觸不到Hadoop數據流命令行，使我們更輕松、快速編寫MapReduce任務。Mrjob具有如下特點。

1)代碼簡介，map和reduce函數通過一個Python文件就可以搞定；

2)支持多步驟的MapReduce任務工作流；

3)支持多種運行方式，包括內嵌方式、本地環境、Hadoop、遠程亞馬遜；

4)支持亞馬遜網絡數據分析服務Elastic MapReduce(EMR)；

5)調試方便，無需任務環境支持

安裝Mrjob要求環境為Python 2.5及以上版本，源碼下載地址為：https://github.com/yelp/mrjob

# pip install mrjob #pip安裝方式

# python setup.py install #源碼安裝方式

二、利用Mrjob實現MapReduce本實例同樣實現統計文本文件(/usr/local/python/source/input.txt)中所有單詞出現的詞頻，Mrjob通過,mapper()與reducer()方法實現了MR操作，具體代碼如下：

【/usr/local/python/source/word_count.py】

# -*- coding:UTF-8 -*-

'''

Created on 2018年1月14日

@author: liuyazhuang

'''

from mrjob.job import MRJob

class MRWordCounter(MRJob):

def mapper(self, key, line):

for word in line.split():

yield word, 1

def reducer(self, word, occurrences):

yield word, sum(occurrences)

if __name__ == '__main__':

MRWordCounter.run()可以看出代碼行數只是原生Python的1/3，邏輯也比較清晰，代碼中包含了mapper、reducer函數。mapper函數接收每一行的輸入數據，處理后返回一對key:value，初始化value為1；reducer接收mapper輸出的key-value對進行整合，把相同key的value作累加操作后輸出。Mrjob利用Python的yield機制將函數變成一個Generators(生成器)，通過不斷調用next()實現key-value的初始化或運算操作。

三、運行MapReduce

1、內嵌(-r inline)方式特點是調試方便，啟動單一進程模擬任務執行狀態和結果，默認(-r inline)可以省略，輸出文件使用 > output-file 或-o output-file，比如下面兩種運行方式是等價的：

python word_count.py -r inline input.txt > output.txt

python word_count.py input.txt > output.txt此時我們執行cat output.txt操作

[root@liuyazhuang121 source]# cat output.txt

"test" 2

"welcome" 1

"where" 1

"xxx" 2

"aaa" 1

"ab" 1

"abc" 1

"adc" 1

"bar" 2

"bbb" 2

"xxyy" 1

"you" 1

"your" 1

"yyy" 2

"hello" 2

"home" 2

"iii" 2

"is" 1

"labs" 1

"liuyazhuang" 2

"lyz" 2

"bc" 1

"bec" 1

"by" 1

"ccc" 2

"hadoop" 2

"me" 1

"ooo" 2

"python" 2

"see" 1得出了正確結果。

2、本地(-r local)方式用于本地模擬Hadoop調試，與內嵌(inline)方式的區別是啟動了多進程執行每一個任務。如：

python word_count.py -r local input.txt > output1.txt此時我們cat output1.txt查看結果：

[root@liuyazhuang121 source]# cat output1.txt

"test" 2

"welcome" 1

"where" 1

"xxx" 2

"aaa" 1

"ab" 1

"abc" 1

"adc" 1

"bar" 2

"bbb" 2

"xxyy" 1

"you" 1

"your" 1

"yyy" 2

"hello" 2

"home" 2

"iii" 2

"is" 1

"labs" 1

"liuyazhuang" 2

"lyz" 2

"bc" 1

"bec" 1

"by" 1

"ccc" 2

"hadoop" 2

"me" 1

"ooo" 2

"python" 2

"see" 1得出了正確結果。

3、Hadoop(-r hadoop)方式用于hadoop環境，支持Hadoop運行調度控制參數，如：

1)指定Hadoop任務調度優先級(VERY_HIGH|HIGH),如：--jobconf mapreduce.job.priority=VERY_HIGH。

2)Map及Reduce任務個數限制，如：--jobconf mapreduce.map.tasks=2? --jobconf mapreduce.reduce.tasks=5

注意：執行之前需要配置Hadoop環境變量。

本例中我們依然使用Hadoop HDFS中的/user/root/word/input.txt文件，具體運行命令如下：

python word_count.py -r hadoop --jobconf mapreduce.job.priority=VERY_HIGH --jobconf mapreduce.map.tasks=2 --jobconf mapduce.reduce.tasks=1 -o hdfs://liuyazhuang121:9000/output/hadoop_word hdfs://liuyazhuang121:9000/user/root/word打印的結果如下：

[root@liuyazhuang121 source]#python word_count.py -r hadoop --jobconf mapreduce.job.priority=VERY_HIGH --jobconf mapreduce.map.tasks=2 --jobconf mapduce.reduce.tasks=1 -o hdfs://liuyazhuang121:9000/output/hadoop_word hdfs://liuyazhuang121:9000/user/root/word

No configs found; falling back on auto-configuration

No configs specified for hadoop runner

Looking for hadoop binary in $PATH...

Found hadoop binary: /usr/local/hadoop-2.5.2/bin/hadoop

Using Hadoop version 2.5.2

Looking for Hadoop streaming jar in /usr/local/hadoop-2.5.2...

Found Hadoop streaming jar: /usr/local/hadoop-2.5.2/share/hadoop/tools/lib/hadoop-streaming-2.5.2.jar

Creating temp directory /tmp/word_count.root.20180114.050606.032324

Copying local files to hdfs:///user/root/tmp/mrjob/word_count.root.20180114.050606.032324/files/...

Running step 1 of 1...

packageJobJar: [/usr/local/hadoop-2.5.2/tmp/hadoop-unjar2522703497090634857/] [] /tmp/streamjob1355851303293562830.jar tmpDir=null

Connecting to ResourceManager at liuyazhuang121/192.168.209.121:8032

Total input paths to process : 1

number of splits:2

Submitting tokens for job: job_1515893542122_0003

Submitted application application_1515893542122_0003

The url to track the job: http://liuyazhuang121:8088/proxy/application_1515893542122_0003/ Running job: job_1515893542122_0003

Job job_1515893542122_0003 running in uber mode : false

map 0% reduce 0%

map 33% reduce 0%

map 100% reduce 0%

map 100% reduce 100%

Job job_1515893542122_0003 completed successfully

Output directory: hdfs://liuyazhuang121:9000/output/hadoop_word

Counters: 49

File Input Format Counters

Bytes Read=323

File Output Format Counters

Bytes Written=262

File System Counters

FILE: Number of bytes read=486

FILE: Number of bytes written=305876

FILE: Number of large read operations=0

FILE: Number of read operations=0

FILE: Number of write operations=0

HDFS: Number of bytes read=529

HDFS: Number of bytes written=262

HDFS: Number of large read operations=0

HDFS: Number of read operations=9

HDFS: Number of write operations=2

Job Counters

Data-local map tasks=2

Launched map tasks=2

Launched reduce tasks=1

Total megabyte-seconds taken by all map tasks=23237632

Total megabyte-seconds taken by all reduce tasks=11787264

Total time spent by all map tasks (ms)=22693

Total time spent by all maps in occupied slots (ms)=22693

Total time spent by all reduce tasks (ms)=11511

Total time spent by all reduces in occupied slots (ms)=11511

Total vcore-seconds taken by all map tasks=22693

Total vcore-seconds taken by all reduce tasks=11511

Map-Reduce Framework

CPU time spent (ms)=3150

Combine input records=0

Combine output records=0

Failed Shuffles=0

GC time elapsed (ms)=149

Input split bytes=206

Map input records=1

Map output bytes=392

Map output materialized bytes=492

Map output records=44

Merged Map outputs=2

Physical memory (bytes) snapshot=611057664

Reduce input groups=30

Reduce input records=44

Reduce output records=30

Reduce shuffle bytes=492

Shuffled Maps =2

Spilled Records=88

Total committed heap usage (bytes)=429916160

Virtual memory (bytes) snapshot=2661163008

Shuffle Errors

BAD_ID=0

CONNECTION=0

IO_ERROR=0

WRONG_LENGTH=0

WRONG_MAP=0

WRONG_REDUCE=0

Streaming final output from hdfs://liuyazhuang121:9000/output/hadoop_word...

"aaa" 1

"ab" 1

"abc" 1

"adc" 1

"bar" 2

"bbb" 2

"bc" 1

"bec" 1

"by" 1

"ccc" 2

"hadoop" 2

"hello" 2

"home" 2

"iii" 2

"is" 1

"labs" 1

"liuyazhuang" 2

"lyz" 2

"me" 1

"ooo" 2

"python" 2

"see" 1

"test" 2

"welcome" 1

"where" 1

"xxx" 2

"xxyy" 1

"you" 1

"your" 1

"yyy" 2

Removing HDFS temp directory hdfs:///user/root/tmp/mrjob/word_count.root.20180114.050606.032324...

Removing temp directory /tmp/word_count.root.20180114.050606.032324...結果顯示，打印出了每個單詞的頻次。此時我們輸入命令：

hadoop fs -ls /output/hadoop_word查看生成的文件如下：

[root@liuyazhuang121 source]# hadoop fs -ls /output/hadoop_word

Found 2 items

-rw-r--r-- 1 root supergroup 0 2018-01-14 13:06 /output/hadoop_word/_SUCCESS

-rw-r--r-- 1 root supergroup 262 2018-01-14 13:06 /output/hadoop_word/part-00000此時，我們輸入命令：

hadoop fs -cat /output/hadoop_word/part-00000查看輸出的結果：

[root@liuyazhuang121 source]# hadoop fs -cat /output/hadoop_word/part-00000

"aaa" 1

"ab" 1

"abc" 1

"adc" 1

"bar" 2

"bbb" 2

"bc" 1

"bec" 1

"by" 1

"ccc" 2

"hadoop" 2

"hello" 2

"home" 2

"iii" 2

"is" 1

"labs" 1

"liuyazhuang" 2

"lyz" 2

"me" 1

"ooo" 2

"python" 2

"see" 1

"test" 2

"welcome" 1

"where" 1

"xxx" 2

"xxyy" 1

"you" 1

"your" 1

"yyy" 2我們可以看出，輸出了正確的結果。

總結

以上是生活随笔為你收集整理的hadoop的python框架指南_Python之——用Mrjob框架编写Hadoop MapReduce程序(基于Hadoop 2.5.2)...的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： java泛型循环break contin
下一篇： idea 设置java内存_java相关

3atv精品不卡视频,97人人超碰国产精品最新,中文字幕av一区二区三区人妻少妇,久久久精品波多野结衣,日韩一区二区三区精品

python

hadoop的python框架指南_Python之——用Mrjob框架编写Hadoop MapReduce程序(基于Hadoop 2.5.2)...

總結