HADOOP中设置map个数-技术博客集

HADOOP中设置map个数
编程技术 / houtizong 发布于 3年前 73

很多文档中描述，Mapper的数量在默认情况下不可直接控制干预，因为Mapper的数量由输入的大小和个数决定。在默认情况下，最终input占据了多少block，就应该启动多少个Mapper。如果输入的文件数量巨大，但是每个文件的size都小于HDFS的blockSize，那么会造成启动的Mapper等于文件的数量（即每个文件都占据了一个block），那么很可能造成启动的Mapper数量超出限制而导致崩溃。这些逻辑确实是正确的，但都是在默认情况下的逻辑。其实如果进行一些客户化的设置，就可以控制了。

在Hadoop中，设置Map task的数量不像设置Reduce task数量那样直接，即：不能够通过API直接精确的告诉Hadoop应该启动多少个Map task。

你也许奇怪了，在API中不是提供了接口org.apache.hadoop.mapred.JobConf.setNumMapTasks(int n)吗？这个值难道不可以设置Map task的数量吗？这个API的确没错，在文档上解释”Note: This is only a hint to the framework.“，即这个值对Hadoop的框架来说仅仅是个提示，不起决定性的作用。也就是说，即便你设置了，也不一定得到你想要的效果。

1. InputFormat介绍

在具体设置Map task数量之前，非常有必要了解一下与Map-Reduce输入相关的基础知识。

这个接口（org.apache.hadoop.mapred.InputFormat）描述了Map-Reduce job的输入规格说明（input-specification），它将所有的输入文件分割成逻辑上的InputSplit，每一个InputSplit将会分给一个单独的mapper；它还提供RecordReader的具体实现，这个Reader从逻辑的InputSplit上获取input records并传给Mapper处理。

InputFormat有多种具体实现，诸如FileInputFormat（处理基于文件的输入的基础抽象类）, DBInputFormat（处理基于数据库的输入，数据来自于一个能用SQL查询的表）,KeyValueTextInputFormat（特殊的FineInputFormat，处理Plain Text File，文件由回车或者回车换行符分割成行，每一行由key.value.separator.in.input.line分割成Key和Value），CompositeInputFormat，DelegatingInputFormat等。在绝大多数应用场景中都会使用FileInputFormat及其子类型。

通过以上的简单介绍，我们知道InputFormat决定着InputSplit，每个InputSplit会分配给一个单独的Mapper，因此InputFormat决定了具体的Map task数量。

2. FileInputFormat中影响Map数量的因素

在日常使用中，FileInputFormat是最常用的InputFormat，它有很多具体的实现。以下分析的影响Map数量的因素仅对FileInputFormat及其子类有效，其他非FileInputFormat可以去查看相应的 getSplits(JobConf job, int numSplits) 具体实现即可。

请看如下代码段（摘抄自org.apache.hadoop.mapred.FileInputFormat.getSplits，hadoop-0.20.205.0源代码）：

[java]view plaincopy          

上一篇：hive优化（4）

下一篇：如何学习HTML5？

请勿发布不友善或者负能量的内容。与人为善，比聪明更重要！

<div style="font-size: 14px;" >  <p style="font-family: Arial; font-size: 14.3999996185303px; line-height: 26px;">很多文档中描述，Mapper的数量在默认情况下不可直接控制干预，因为Mapper的数量由输入的大小和个数决定。在默认情况下，最终input占据了多少block，就应该启动多少个Mapper。如果输入的文件数量巨大，但是每个文件的size都小于HDFS的blockSize，那么会造成启动的Mapper等于文件的数量（即每个文件都占据了一个block），那么很可能造成启动的Mapper数量超出限制而导致崩溃。这些逻辑确实是正确的，但都是在默认情况下的逻辑。其实如果进行一些客户化的设置，就可以控制了。</p>  <p style="font-family: Arial; font-size: 14.3999996185303px; line-height: 26px;">在Hadoop中，设置Map task的数量不像设置Reduce task数量那样直接，即：不能够通过API直接精确的告诉Hadoop应该启动多少个Map task。</p>  <p style="font-family: Arial; font-size: 14.3999996185303px; line-height: 26px;">你也许奇怪了，在API中不是提供了接口org.apache.hadoop.mapred.JobConf.setNumMapTasks(int n)吗？这个值难道不可以设置Map task的数量吗？这个API的确没错，在文档上解释”Note: This is only a hint to the framework.“，即这个值对Hadoop的框架来说仅仅是个提示，不起决定性的作用。也就是说，即便你设置了，也不一定得到你想要的效果。</p>  <h3 style="margin-bottom: 0px; font-family: Arial; line-height: 26px;"><a name="t1" style="color: #ff9900;"></a></h3>  <h3 style="margin-bottom: 0px; font-family: Arial; line-height: 26px;"> <a name="t2" style="color: #ff9900;"></a>1. InputFormat介绍</h3>  <p style="font-family: Arial; font-size: 14.3999996185303px; line-height: 26px;">在具体设置Map task数量之前，非常有必要了解一下与Map-Reduce输入相关的基础知识。</p>  <p style="font-family: Arial; font-size: 14.3999996185303px; line-height: 26px;">这个接口（org.apache.hadoop.mapred.InputFormat）<span style="text-decoration: underline;">描述了Map-Reduce job的输入规格说明</span>（input-specification），它<span style="text-decoration: underline;">将所有的输入文件分割成逻辑上的InputSplit</span>，每一个InputSplit将会分给一个单独的mapper；它还<span style="text-decoration: underline;">提供RecordReader的具体实现</span>，这个Reader从逻辑的InputSplit上获取input records并传给Mapper处理。</p>  <p style="font-family: Arial; font-size: 14.3999996185303px; line-height: 26px;">InputFormat有多种具体实现，诸如<strong>FileInputFormat</strong>（处理基于文件的输入的基础抽象类）,&nbsp;<strong>DBInputFormat</strong>（处理基于数据库的输入，数据来自于一个能用SQL查询的表）,<strong>KeyValueTextInputFormat</strong>（特殊的FineInputFormat，处理Plain Text File，文件由回车或者回车换行符分割成行，每一行由key.value.separator.in.input.line分割成Key和Value），CompositeInputFormat，DelegatingInputFormat等。在绝大多数应用场景中都会使用FileInputFormat及其子类型。</p>  <p style="font-family: Arial; font-size: 14.3999996185303px; line-height: 26px;">通过以上的简单介绍，我们知道InputFormat决定着InputSplit，每个InputSplit会分配给一个单独的Mapper，<strong>因此InputFormat决定了具体的Map task数量</strong>。</p>  <h3 style="margin-bottom: 0px; font-family: Arial; line-height: 26px;"><a name="t3" style="color: #ff9900;"></a></h3>  <h3 style="margin-bottom: 0px; font-family: Arial; line-height: 26px;"> <a name="t4" style="color: #ff9900;"></a>2. FileInputFormat中影响Map数量的因素</h3>  <p style="font-family: Arial; font-size: 14.3999996185303px; line-height: 26px;">在日常使用中，FileInputFormat是最常用的InputFormat，它有很多具体的实现。以下分析的影响Map数量的因素仅对FileInputFormat及其子类有效，其他非FileInputFormat可以去查看相应的&nbsp;getSplits(JobConf job, int numSplits) 具体实现即可。</p>  <p style="font-family: Arial; font-size: 14.3999996185303px; line-height: 26px;">请看如下代码段（摘抄自org.apache.hadoop.mapred.FileInputFormat.getSplits，hadoop-0.20.205.0源代码）：</p>  <p style="font-family: Arial; font-size: 14.3999996185303px; line-height: 26px;">&nbsp;</p>  <div style="font-family: Consolas, 'Courier New', Courier, mono, serif; font-size: 12px; width: 936.924987792969px; overflow: auto; padding-top: 1px; line-height: 26px; margin: 18px 0px !important; background-color: #e7e5dc;" class="dp-highlighter bg_java">   <div style="padding-left: 45px;" class="bar">    <div style="padding: 3px 8px 10px 10px; font-size: 9px; line-height: normal; font-family: Verdana, Geneva, Arial, Helvetica, sans-serif; color: silver; border-left-width: 3px; border-left-style: solid; border-left-color: #6ce26c; background-color: #f8f8f8;" class="tools">     <strong>[java]</strong>&nbsp;    <a title="view plain" href="http://blog.csdn.net/strongerbit/article/details/7440111" class="ViewSource">view plain</a>    <a title="copy" href="http://blog.csdn.net/strongerbit/article/details/7440111" class="CopyToClipboard">copy</a>     <div style="width: 18px; height: 18px;">     &nbsp;    </div>

留言需要登陆哦

技术博客集 - 网站简介：
前后端技术：
后端基于Hyperf2.1框架开发,前端使用Bootstrap可视化布局系统生成
网站主要作用：
1.编程技术分享及讨论交流，内置聊天系统;
2.测试交流框架问题，比如：Hyperf、Laravel、TP、beego;
3.本站数据是基于大数据采集等爬虫技术为基础助力分享知识，如有侵权请发邮件到站长邮箱，站长会尽快处理;
4.站长邮箱：[email protected];

文章归档

文章标签

友情链接

首页
关于我们

Auther ·HouTiZong: 侯体宗的博客