问题描述
现在我有一个 4 阶段的 MapReduce 作业,如下所示:
Now I have a 4-phase MapReduce job as follows:
Input-> Map1 -> Reduce1 -> Reducer2 -> Reduce3 -> Reduce4 -> Output
我注意到 Hadoop 中有一个 ChainMapper 类,它可以将多个映射器链接成一个大映射器,并节省映射阶段之间的磁盘 I/O 成本.还有一个 ChainReducer 类,但它不是真正的Chain-Reducer".它只能支持以下工作:
I notice that there is ChainMapper class in Hadoop which can chain several mappers into one big mapper, and save the disk I/O cost between map phases. There is also a ChainReducer class, however it is not a real "Chain-Reducer". It can only support jobs like:
[Map+/ Reduce Map*]
我知道我可以为我的任务设置四个 MR 作业,并为最后三个作业使用默认映射器.但这将花费大量磁盘 I/O,因为 reducer 应该将结果写入磁盘以让以下映射器访问它.是否有任何其他 Hadoop 内置功能可以链接我的 reducer 以降低 I/O 成本?
I know I can set four MR jobs for my task, and use default mappers for the last three jobs. But that will cost a lot of disk I/O, since reducers should write the result into disk to let the following mapper access it. Is there any other Hadoop built-in feature to chain my reducers to lower the I/O cost?
我使用的是 Hadoop 1.0.4.
I am using Hadoop 1.0.4.
推荐答案
我不认为你可以将一个reducer的o/p直接交给另一个reducer.我会为此而努力的:
I dont think that you can have the o/p of a reducer being given to another reducer directly. I would have gone for this:
Input-> Map1 -> Reduce1 ->
Identity mapper -> Reducer2 ->
Identity mapper -> Reduce3 ->
Identity mapper -> Reduce4 -> Output
在 Hadoop 2.X 系列中,在内部,您可以使用 ChainMapper 在 reducer 之前链接 mapper,在 reducer 之后使用 ChainReducer.
In Hadoop 2.X series, internally you can chain mappers before reducer with ChainMapper and chain Mappers after reducer with ChainReducer.
这篇关于在 Hadoop MapReduce 作业中链接 Multi-Reducer的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持跟版网!


大气响应式网络建站服务公司织梦模板
高端大气html5设计公司网站源码
织梦dede网页模板下载素材销售下载站平台(带会员中心带筛选)
财税代理公司注册代理记账网站织梦模板(带手机端)
成人高考自考在职研究生教育机构网站源码(带手机端)
高端HTML5响应式企业集团通用类网站织梦模板(自适应手机端)