【Spark九十六】RDD API之combineByKey-技术博客集

【Spark九十六】RDD API之combineByKey
编程技术 / houtizong 发布于 3年前 110

1. combineByKey函数的运行机制

RDD提供了很多针对元素类型为(K,V)的API，这些API封装在PairRDDFunctions类中，通过Scala隐式转换使用。这些API实现上是借助于combineByKey实现的。combineByKey函数本身也是RDD开放给Spark开发人员使用的API之一

首先看一下combineByKey的方法说明：

/**   * Generic function to combine the elements for each key using a custom set of aggregation   * functions. Turns an RDD[(K, V)] into a result of type RDD[(K, C)], for a "combined type" C   * Note that V and C can be different -- for example, one might group an RDD of type   * (Int, Int) into an RDD of type (Int, Seq[Int]). Users provide three functions:   *   * - `createCombiner`, which turns a V into a C (e.g., creates a one-element list)   * - `mergeValue`, to merge a V into a C (e.g., adds it to the end of a list)   * - `mergeCombiners`, to combine two C's into a single one.   *   * In addition, users can control the partitioning of the output RDD, and whether to perform   * map-side aggregation (if a mapper can produce multiple items with the same key).   */  def combineByKey[C](createCombiner: V => C,      mergeValue: (C, V) => C,      mergeCombiners: (C, C) => C,      partitioner: Partitioner,      mapSideCombine: Boolean = true,      serializer: Serializer = null): RDD[(K, C)] = {    require(mergeCombiners != null, "mergeCombiners must be defined") // required as of Spark 0.9.0    if (keyClass.isArray) {      if (mapSideCombine) {        throw new SparkException("Cannot use map-side combining with array keys.")      }      if (partitioner.isInstanceOf[HashPartitioner]) {        throw new SparkException("Default partitioner cannot partition array keys.")      }    }    val aggregator = new Aggregator[K, V, C](      self.context.clean(createCombiner),      self.context.clean(mergeValue),      self.context.clean(mergeCombiners))    if (self.partitioner == Some(partitioner)) {      self.mapPartitions(iter => {        val context = TaskContext.get()        new InterruptibleIterator(context, aggregator.combineValuesByKey(iter, context))      }, preservesPartitioning = true)    } else {      new ShuffledRDD[K, V, C](self, partitioner)        .setSerializer(serializer)        .setAggregator(aggregator)        .setMapSideCombine(mapSideCombine)    }  }

combineByKey的功能是对RDD中的数据集按照Key进行聚合(想象下Hadoop MapReduce的Combiner，用于Map端做Reduce)。聚合的逻辑是通过自定义函数提供给combineByKey。

从上面的源代码中可以看到，combineByKey是把(K,V)类型的RDD转换为(K,C)类型的RDD，C和V可以不一样。

combineByKey函数需要三个重要的函数作为参数

createCombiner：在遍历RDD的数据集合过程中，对于遍历到的(k,v)，如果combineByKey第一次遇到值为k的Key（类型K），那么将对这个(k,v)调用combineCombiner函数，它的作用是将v转换为c(类型是C，聚合对象的类型，c作为局和对象的初始值)

mergeValue：在遍历RDD的数据集合过程中，对于遍历到的(k,v)，如果combineByKey不是第一次(或者第二次，第三次...)遇到值为k的Key（类型K），那么将对这个(k,v)调用mergeValue函数，它的作用是将v累加到聚合对象（类型C）中，mergeValue的

类型是(C,V)=>C,参数中的C遍历到此处的聚合对象，然后对v进行聚合得到新的聚合对象值

mergeCombiners：因为combineByKey是在分布式环境下执行，RDD的每个分区单独进行combineByKey操作，最后需要对各个分区的结果进行最后的聚合，它的函数类型是(C,C)=>C，每个参数是分区聚合得到的聚合对象。

通过上面的分析，combineByKey的流程是：

假设一组具有相同 K 的 <K, V> records 正在一个个流向 combineByKey()，createCombiner 将第一个 record 的value 初始化为 c （比如，c = value），然后从第二个 record 开始，来一个 record 就使用 mergeValue(c,record.value) 来更新 c，比如想要对这些 records 的所有 values 做 sum，那么使用 c = c + record.value。等到records 全部被 mergeValue()，得到结果 c。假设还有一组 records（key 与前面那组的 key 均相同）一个个到来，combineByKey() 使用前面的方法不断计算得到 c'。现在如果要求这两组 records 总的 combineByKey() 后的结果，那么可以使用 final c = mergeCombiners(c, c') 来计算。

2.combineByKey应用举例

2.1 求均值

假设有一组气象数据，每行数据包含日期和当天的气温(比如，20150601 27），那么可以用combineByKey求每月的平均温度，伪代码如下：

C的类型是(Int,Int),表示对于给定的月份，温度累加值和当月的天数

val rdd = sc.textFile("气象数据")val rdd2 = rdd.map(x=>x.split(" ")).map(x => (x(0).substring("从年月日中提取年月"),x(1).toInt))val createCombiner = (k: String, v: Int)=> {  (v,1)}val mergeValue = (c:(Int, Int), v:Int) => { (c._1 + v, c._2 + 1)}val mergeCombiners = (c1:(Int,Int),c2:(Int,Int))=>{  (c1._1 + c2._1, c1._2 + c2._2)}val vdd3 = vdd2.combineByKey(createCombiner,mergeValue,mergeCombiners)rdd3.foreach(x=>println(x._1 + ": average tempreture is " + x._2._1/x._2._2)

上一篇：【Struts2 四】Struts2拦截器

下一篇：【Spark九十七】RDD API之aggregateByKey

请勿发布不友善或者负能量的内容。与人为善，比聪明更重要！

<div > <h2>1. combineByKey函数的运行机制</h2> &nbsp; RDD提供了很多针对元素类型为(K,V)的API，这些API封装在PairRDDFunctions类中，通过Scala隐式转换使用。这些API实现上是借助于combineByKey实现的。combineByKey函数本身也是RDD开放给Spark开发人员使用的API之一 &nbsp; 首先看一下combineByKey的方法说明： &nbsp; <pre class="java" name="code">/** * Generic function to combine the elements for each key using a custom set of aggregation * functions. Turns an RDD[(K, V)] into a result of type RDD[(K, C)], for a &quot;combined type&quot; C * Note that V and C can be different -- for example, one might group an RDD of type * (Int, Int) into an RDD of type (Int, Seq[Int]). Users provide three functions: * * - `createCombiner`, which turns a V into a C (e.g., creates a one-element list) * - `mergeValue`, to merge a V into a C (e.g., adds it to the end of a list) * - `mergeCombiners`, to combine two C's into a single one. * * In addition, users can control the partitioning of the output RDD, and whether to perform * map-side aggregation (if a mapper can produce multiple items with the same key). */ def combineByKey[C](createCombiner: V =&gt; C, mergeValue: (C, V) =&gt; C, mergeCombiners: (C, C) =&gt; C, partitioner: Partitioner, mapSideCombine: Boolean = true, serializer: Serializer = null): RDD[(K, C)] = { require(mergeCombiners != null, &quot;mergeCombiners must be defined&quot;) // required as of Spark 0.9.0 if (keyClass.isArray) { if (mapSideCombine) { throw new SparkException(&quot;Cannot use map-side combining with array keys.&quot;) } if (partitioner.isInstanceOf[HashPartitioner]) { throw new SparkException(&quot;Default partitioner cannot partition array keys.&quot;) } } val aggregator = new Aggregator[K, V, C]( self.context.clean(createCombiner), self.context.clean(mergeValue), self.context.clean(mergeCombiners)) if (self.partitioner == Some(partitioner)) { self.mapPartitions(iter =&gt; { val context = TaskContext.get() new InterruptibleIterator(context, aggregator.combineValuesByKey(iter, context)) }, preservesPartitioning = true) } else { new ShuffledRDD[K, V, C](self, partitioner) .setSerializer(serializer) .setAggregator(aggregator) .setMapSideCombine(mapSideCombine) } }</pre> combineByKey的功能是对RDD中的数据集按照Key进行聚合(想象下Hadoop MapReduce的Combiner，用于Map端做Reduce)。聚合的逻辑是通过自定义函数提供给combineByKey。 从上面的源代码中可以看到，combineByKey是把(K,V)类型的RDD转换为(K,C)类型的RDD，C和V可以不一样。 combineByKey函数需要三个重要的函数作为参数 createCombiner：在遍历RDD的数据集合过程中，对于遍历到的(k,v)，如果combineByKey第一次遇到值为k的Key（类型K），那么将对这个(k,v)调用combineCombiner函数，它的作用是将v转换为c(类型是C，聚合对象的类型，c作为局和对象的初始值) mergeValue：在遍历RDD的数据集合过程中，对于遍历到的(k,v)，如果combineByKey不是第一次(或者第二次，第三次...)遇到值为k的Key（类型K），那么将对这个(k,v)调用mergeValue函数，它的作用是将v累加到聚合对象（类型C）中，mergeValue的 类型是(C,V)=&gt;C,参数中的C遍历到此处的聚合对象，然后对v进行聚合得到新的聚合对象值 mergeCombiners：因为combineByKey是在分布式环境下执行，RDD的每个分区单独进行combineByKey操作，最后需要对各个分区的结果进行最后的聚合，它的函数类型是(C,C)=&gt;C，每个参数是分区聚合得到的聚合对象。 &nbsp; 通过上面的分析，combineByKey的流程是： &nbsp; <pre class="java" name="code">假设一组具有相同 K 的 &lt;K, V&gt; records 正在一个个流向 combineByKey()，createCombiner 将第一个 record 的value 初始化为 c （比如，c = value），然后从第二个 record 开始，来一个 record 就使用 mergeValue(c,record.value) 来更新 c，比如想要对这些 records 的所有 values 做 sum，那么使用 c = c + record.value。等到records 全部被 mergeValue()，得到结果 c。假设还有一组 records（key 与前面那组的 key 均相同）一个个到来，combineByKey() 使用前面的方法不断计算得到 c'。现在如果要求这两组 records 总的 combineByKey() 后的结果，那么可以使用 final c = mergeCombiners(c, c') 来计算。</pre> &nbsp; <h2>2.combineByKey应用举例</h2> <h3>2.1 求均值</h3> 假设有一组气象数据，每行数据包含日期和当天的气温(比如，20150601 27），那么可以用combineByKey求每月的平均温度，伪代码如下： &nbsp; C的类型是(Int,Int),表示对于给定的月份，温度累加值和当月的天数 &nbsp; <pre class="java" name="code">val rdd = sc.textFile(&quot;气象数据&quot;)val rdd2 = rdd.map(x=&gt;x.split(&quot; &quot;)).map(x =&gt; (x(0).substring(&quot;从年月日中提取年月&quot;),x(1).toInt))val createCombiner = (k: String, v: Int)=&gt; { (v,1)}val mergeValue = (c:(Int, Int), v:Int) =&gt; { (c._1 + v, c._2 + 1)}val mergeCombiners = (c1:(Int,Int),c2:(Int,Int))=&gt;{ (c1._1 + c2._1, c1._2 + c2._2)}val vdd3 = vdd2.combineByKey(createCombiner,mergeValue,mergeCombiners)rdd3.foreach(x=&gt;println(x._1 + &quot;: average tempreture is &quot; + x._2._1/x._2._2)</pre> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </div>

留言需要登陆哦

技术博客集 - 网站简介：
前后端技术：
后端基于Hyperf2.1框架开发,前端使用Bootstrap可视化布局系统生成
网站主要作用：
1.编程技术分享及讨论交流，内置聊天系统;
2.测试交流框架问题，比如：Hyperf、Laravel、TP、beego;
3.本站数据是基于大数据采集等爬虫技术为基础助力分享知识，如有侵权请发邮件到站长邮箱，站长会尽快处理;
4.站长邮箱：[email protected];

文章归档

文章标签

友情链接

首页
关于我们

Auther ·HouTiZong: 侯体宗的博客