Mongodb MapReduce编程模型-技术博客集

Mongodb MapReduce编程模型
MongoDB / houtizong 发布于 3年前 282

Mongodb官网对MapReduce介绍：

Map/reduce in MongoDB is useful for batch processing of data and aggregation operations. It is similar in spirit to using something like Hadoop with all input coming from a collection and output going to a collection. Often, in a situation where you would have used GROUP BY in SQL, map/reduce is the right tool in MongoDB.

翻译过来大致意思就是：Mongodb中的Map/reduce主要是用来对数据进行批量处理和聚合操作，有点类似于使用Hadoop对集合数据进行处理，所有输入数据都是从集合中获取，而MapReduce后输出的数据也都会写入到集合中。通常类似于我们在SQL中使用Group By语句一样。

使用MapReduce要实现两个函数：Map和Reduce。Map函数调用emit(key,value)遍历集合中所有的记录，将key与value传给Reduce函数进行处理。Map函数和Reduce函数是使用Javascript编写的，并可以通过db.runCommand或mapreduce命令来执行MapReduce操作。

MapReduce命令语法如下：

db.runCommand( { mapreduce : <collection>,   map : <mapfunction>,   reduce : <reducefunction>,   out : <see output options below>   [, query : <query filter object>]   [, sort : <sorts the input objects using this key. Useful for optimization, like sorting by the emit key for fewer reduces>]   [, limit : <number of objects to return from collection, not supported with sharding>]   [, keeptemp: <true|false>]   [, finalize : <finalizefunction>]   [, scope : <object where fields go into javascript global scope >]   [, jsMode : true]   [, verbose : true] });

参数说明：

mapreduce：要操作的目标集合

map：映射函数（生成键值对序列，作为Reduce函数的参数）

reduce：统计函数

query：目标记录过滤

sort：对目标记录排序

limit：限制目标记录数量

out：统计结果存放集合（如果不指定则使用临时集合，在客户端断开后自动删除）

keeptemp：是否保留临时集合

finalize：最终处理函数（对reduce返回结果执行最终整理后存入结果集合）

scope：向map、reduce、finalize导入外部变量

verbose：显示详细的时间统计信息

下面使用一个实例来说明MapReduce的具体使用。

应用场景：对students集合中的数据进行统计，根据classid显示每个班级的学生数量。初始数据如下：

> db.students.find(){ "_id" : ObjectId("5031143350f2481577ea81e5"), "classid" : 1, "age" : 20, "name" : "kobe" }{ "_id" : ObjectId("5031144a50f2481577ea81e6"), "classid" : 1, "age" : 23, "name" : "nash" }{ "_id" : ObjectId("5031145a50f2481577ea81e7"), "classid" : 2, "age" : 18, "name" : "james" }{ "_id" : ObjectId("5031146a50f2481577ea81e8"), "classid" : 2, "age" : 19, "name" : "wade" }{ "_id" : ObjectId("5031147450f2481577ea81e9"), "classid" : 2, "age" : 19, "name" : "bosh" }{ "_id" : ObjectId("5031148650f2481577ea81ea"), "classid" : 2, "age" : 25, "name" : "allen" }{ "_id" : ObjectId("5031149b50f2481577ea81eb"), "classid" : 1, "age" : 19, "name" : "howard" }{ "_id" : ObjectId("503114a750f2481577ea81ec"), "classid" : 1, "age" : 22, "name" : "paul" }{ "_id" : ObjectId("503114cd50f2481577ea81ed"), "classid" : 2, "age" : 24, "name" : "shane" }>

Map分组

Map函数必须调用emit(key,value)返回键值对，使用this访问当前待处理的Document。下面我们使用Map函数对students表按classid进行分组。

> map=function(){emit(this.classid,1)}function () {    emit(this.classid, 1);}>

Reduce聚合

Reduce函数接收Map函数返回的结果作为参数，Map函数返回的键值序列组合成{key,[value1,value2,value3,……]}传递给reduce，代码如下：

> reduce=function(key,values){... var x = 0;... values.forEach(function(v){x+=v});... return x;... }function (key, values) {    var x = 0;    values.forEach(function (v) {x += v;});    return x;}>

Reduce函数对values进行统计，从上面的代码可以看出Reduce函数主要是对1班和2班的记录数量进行求和运算。

Result获取结果

Result函数的作用是用来获取计算后的结果，使用命令为：db.结果集.find()。其中的”结果集“可以通过out参数来指定。代码如下所示：

> result=db.runCommand({... mapreduce:"students",... map:map,... reduce:reduce,... out:"students_result"... });{"result" : "students_result","timeMillis" : 297,"counts" : {"input" : 9,"emit" : 9,"reduce" : 2,"output" : 2},"ok" : 1}> db.students_result.find(){ "_id" : 1, "value" : 4 }{ "_id" : 2, "value" : 5 }>

通过MapReduce处理后的结果存放在students_result集合中。

Finalize格式化输出

利用finalize()可以对reduce()的结果进行输出样式的格式化处理。代码如下：

> finalize=function(key,value){return {classid:key,count:value};}function (key, value) {    return {classid:key, count:value};}>

定义好了finalize函数后，重新执行MapReduce，在函数定义中加入"finalize"参数，即可使用上面定义的finalize函数对返回结果进行格式化，代码如下：

> result=db.runCommand({... mapreduce:"students",... map:map,... reduce:reduce,... out:"students_result",... finalize:finalize... });{"result" : "students_result","timeMillis" : 137,"counts" : {"input" : 9,"emit" : 9,"reduce" : 2,"output" : 2},"ok" : 1}> db.students_result.find(){ "_id" : 1, "value" : { "classid" : 1, "count" : 4 } }{ "_id" : 2, "value" : { "classid" : 2, "count" : 5 } }>

Query对目标记录进行过滤

前面提到了MapReduce语法中有一个query参数是用来对目标集合进行条件过滤，我们只需要在result函数中加入"query"参数即可对结果集进行过滤，代码如下：

> result=db.runCommand({... mapreduce:"students",... map:map,... reduce:reduce,... out:"students_result",... finalize:finalize,... query:{age:{$gt:22}}... });{"result" : "students_result","timeMillis" : 776,"counts" : {"input" : 3,"emit" : 3,"reduce" : 1,"output" : 2},"ok" : 1}> db.students_result.find(){ "_id" : 1, "value" : { "classid" : 1, "count" : 1 } }{ "_id" : 2, "value" : { "classid" : 2, "count" : 2 } }>

从上面代码可以看到，我们在result函数中增加了query参数，只对age>22的document进行统计，输出结果每个班的人数就比原来的少了。

对于MapReduce更多信息，参见官网：http://www.mongodb.org/display/DOCS/MapReduce

上一篇：Mongodb后台daemon方式启动

下一篇：Session共享实现方案调研

请勿发布不友善或者负能量的内容。与人为善，比聪明更重要！

Mongodb官网对MapReduce介绍： Map/reduce in MongoDB is useful for batch processing of data and aggregation operations. It is similar in spirit to using something like Hadoop with all input coming from a collection and output going to a collection. Often, in a situation where you would have used GROUP BY in SQL, map/reduce is the right tool in MongoDB. 翻译过来大致意思就是：Mongodb中的Map/reduce主要是用来对数据进行批量处理和聚合操作，有点类似于使用Hadoop对集合数据进行处理，所有输入数据都是从集合中获取，而MapReduce后输出的数据也都会写入到集合中。通常类似于我们在SQL中使用Group By语句一样。 &nbsp; 使用MapReduce要实现两个函数：Map和Reduce。Map函数调用emit(key,value)遍历集合中所有的记录，将key与value传给Reduce函数进行处理。Map函数和Reduce函数是使用Javascript编写的，并可以通过db.runCommand或mapreduce命令来执行MapReduce操作。 MapReduce命令语法如下： <pre class="js" name="code">db.runCommand( { mapreduce : &lt;collection&gt;, map : &lt;mapfunction&gt;, reduce : &lt;reducefunction&gt;, out : &lt;see output options below&gt; [, query : &lt;query filter object&gt;] [, sort : &lt;sorts the input objects using this key. Useful for optimization, like sorting by the emit key for fewer reduces&gt;] [, limit : &lt;number of objects to return from collection, not supported with sharding&gt;] [, keeptemp: &lt;true|false&gt;] [, finalize : &lt;finalizefunction&gt;] [, scope : &lt;object where fields go into javascript global scope &gt;] [, jsMode : true] [, verbose : true] });</pre> 参数说明： mapreduce：要操作的目标集合 map：映射函数（生成键值对序列，作为Reduce函数的参数）&nbsp; reduce：统计函数 query：目标记录过滤 sort：对目标记录排序 limit：限制目标记录数量 out：统计结果存放集合（如果不指定则使用临时集合，在客户端断开后自动删除） keeptemp：是否保留临时集合 finalize：最终处理函数（对reduce返回结果执行最终整理后存入结果集合） scope：向map、reduce、finalize导入外部变量 verbose：显示详细的时间统计信息 &nbsp; 下面使用一个实例来说明MapReduce的具体使用。 应用场景：对students集合中的数据进行统计，根据classid显示每个班级的学生数量。初始数据如下： <pre class="shell" name="code">&gt; db.students.find(){ &quot;_id&quot; : ObjectId(&quot;5031143350f2481577ea81e5&quot;), &quot;classid&quot; : 1, &quot;age&quot; : 20, &quot;name&quot; : &quot;kobe&quot; }{ &quot;_id&quot; : ObjectId(&quot;5031144a50f2481577ea81e6&quot;), &quot;classid&quot; : 1, &quot;age&quot; : 23, &quot;name&quot; : &quot;nash&quot; }{ &quot;_id&quot; : ObjectId(&quot;5031145a50f2481577ea81e7&quot;), &quot;classid&quot; : 2, &quot;age&quot; : 18, &quot;name&quot; : &quot;james&quot; }{ &quot;_id&quot; : ObjectId(&quot;5031146a50f2481577ea81e8&quot;), &quot;classid&quot; : 2, &quot;age&quot; : 19, &quot;name&quot; : &quot;wade&quot; }{ &quot;_id&quot; : ObjectId(&quot;5031147450f2481577ea81e9&quot;), &quot;classid&quot; : 2, &quot;age&quot; : 19, &quot;name&quot; : &quot;bosh&quot; }{ &quot;_id&quot; : ObjectId(&quot;5031148650f2481577ea81ea&quot;), &quot;classid&quot; : 2, &quot;age&quot; : 25, &quot;name&quot; : &quot;allen&quot; }{ &quot;_id&quot; : ObjectId(&quot;5031149b50f2481577ea81eb&quot;), &quot;classid&quot; : 1, &quot;age&quot; : 19, &quot;name&quot; : &quot;howard&quot; }{ &quot;_id&quot; : ObjectId(&quot;503114a750f2481577ea81ec&quot;), &quot;classid&quot; : 1, &quot;age&quot; : 22, &quot;name&quot; : &quot;paul&quot; }{ &quot;_id&quot; : ObjectId(&quot;503114cd50f2481577ea81ed&quot;), &quot;classid&quot; : 2, &quot;age&quot; : 24, &quot;name&quot; : &quot;shane&quot; }&gt; </pre> Map分组 Map函数必须调用emit(key,value)返回键值对，使用this访问当前待处理的Document。下面我们使用Map函数对students表按classid进行分组。 <pre class="js" name="code">&gt; map=function(){emit(this.classid,1)}function () { emit(this.classid, 1);}&gt; </pre> Reduce聚合 Reduce函数接收Map函数返回的结果作为参数，Map函数返回的键值序列组合成{key,[value1,value2,value3,……]}传递给reduce，代码如下： <pre class="js" name="code">&gt; reduce=function(key,values){... var x = 0;... values.forEach(function(v){x+=v});... return x;... }function (key, values) { var x = 0; values.forEach(function (v) {x += v;}); return x;}&gt; </pre> Reduce函数对values进行统计，从上面的代码可以看出Reduce函数主要是对1班和2班的记录数量进行求和运算。 Result获取结果 &nbsp; Result函数的作用是用来获取计算后的结果，使用命令为：db.结果集.find()。其中的”结果集“可以通过out参数来指定。代码如下所示： <pre class="js" name="code">&gt; result=db.runCommand({... mapreduce:&quot;students&quot;,... map:map,... reduce:reduce,... out:&quot;students_result&quot;... });{&quot;result&quot; : &quot;students_result&quot;,&quot;timeMillis&quot; : 297,&quot;counts&quot; : {&quot;input&quot; : 9,&quot;emit&quot; : 9,&quot;reduce&quot; : 2,&quot;output&quot; : 2},&quot;ok&quot; : 1}&gt; db.students_result.find(){ &quot;_id&quot; : 1, &quot;value&quot; : 4 }{ &quot;_id&quot; : 2, &quot;value&quot; : 5 }&gt; </pre> 通过MapReduce处理后的结果存放在students_result集合中。 Finalize格式化输出 利用finalize()可以对reduce()的结果进行输出样式的格式化处理。代码如下： <pre class="js" name="code">&gt; finalize=function(key,value){return {classid:key,count:value};}function (key, value) { return {classid:key, count:value};}&gt; </pre> 定义好了finalize函数后，重新执行MapReduce，在函数定义中加入&quot;finalize&quot;参数，即可使用上面定义的finalize函数对返回结果进行格式化，代码如下： <pre class="js" name="code">&gt; result=db.runCommand({... mapreduce:&quot;students&quot;,... map:map,... reduce:reduce,... out:&quot;students_result&quot;,... finalize:finalize... });{&quot;result&quot; : &quot;students_result&quot;,&quot;timeMillis&quot; : 137,&quot;counts&quot; : {&quot;input&quot; : 9,&quot;emit&quot; : 9,&quot;reduce&quot; : 2,&quot;output&quot; : 2},&quot;ok&quot; : 1}&gt; db.students_result.find(){ &quot;_id&quot; : 1, &quot;value&quot; : { &quot;classid&quot; : 1, &quot;count&quot; : 4 } }{ &quot;_id&quot; : 2, &quot;value&quot; : { &quot;classid&quot; : 2, &quot;count&quot; : 5 } }&gt; </pre> Query对目标记录进行过滤 前面提到了MapReduce语法中有一个query参数是用来对目标集合进行条件过滤，我们只需要在result函数中加入&quot;query&quot;参数即可对结果集进行过滤，代码如下： <pre class="js" name="code">&gt; result=db.runCommand({... mapreduce:&quot;students&quot;,... map:map,... reduce:reduce,... out:&quot;students_result&quot;,... finalize:finalize,... query:{age:{$gt:22}}... });{&quot;result&quot; : &quot;students_result&quot;,&quot;timeMillis&quot; : 776,&quot;counts&quot; : {&quot;input&quot; : 3,&quot;emit&quot; : 3,&quot;reduce&quot; : 1,&quot;output&quot; : 2},&quot;ok&quot; : 1}&gt; db.students_result.find(){ &quot;_id&quot; : 1, &quot;value&quot; : { &quot;classid&quot; : 1, &quot;count&quot; : 1 } }{ &quot;_id&quot; : 2, &quot;value&quot; : { &quot;classid&quot; : 2, &quot;count&quot; : 2 } }&gt; </pre> 从上面代码可以看到，我们在result函数中增加了query参数，只对age&gt;22的document进行统计，输出结果每个班的人数就比原来的少了。 &nbsp; 对于MapReduce更多信息，参见官网：<a href="http://www.mongodb.org/display/DOCS/MapReduce">http://www.mongodb.org/display/DOCS/MapReduce</a> </div>

留言需要登陆哦

技术博客集 - 网站简介：
前后端技术：
后端基于Hyperf2.1框架开发,前端使用Bootstrap可视化布局系统生成
网站主要作用：
1.编程技术分享及讨论交流，内置聊天系统;
2.测试交流框架问题，比如：Hyperf、Laravel、TP、beego;
3.本站数据是基于大数据采集等爬虫技术为基础助力分享知识，如有侵权请发邮件到站长邮箱，站长会尽快处理;
4.站长邮箱：[email protected];

文章归档

文章标签

友情链接

首页
关于我们

Auther ·HouTiZong: 侯体宗的博客