Reduce a key-value pair into a key-list pair with Apache Spark(使用 Apache Spark 将键值对缩减为键列表对)
问题描述
我正在编写一个 Spark 应用程序,并希望将一组键值对 (K, V1), (K, V2), ..., (K, Vn) 组合成一个键-多值对(K, [V1, V2, ..., Vn]).我觉得我应该能够使用具有某种风味的 reduceByKey 函数来做到这一点:
I am writing a Spark application and want to combine a set of Key-Value pairs (K, V1), (K, V2), ..., (K, Vn) into one Key-Multivalue pair (K, [V1, V2, ..., Vn]). I feel like I should be able to do this using the reduceByKey function with something of the flavor:
My_KMV = My_KV.reduce(lambda a, b: a.append([b]))
发生这种情况时我得到的错误是:
The error that I get when this occurs is:
NoneType"对象没有附加"属性.
'NoneType' object has no attribue 'append'.
我的键是整数,值 V1,...,Vn 是元组.我的目标是使用键和值列表(元组)创建一对.
My keys are integers and values V1,...,Vn are tuples. My goal is to create a single pair with the key and a list of the values (tuples).
推荐答案
Map和ReduceByKey
reduce的输入类型和输出类型必须相同,所以如果你想聚合一个列表,你必须map输入到列表.然后将这些列表合并为一个列表.
Input type and output type of reduce must be the same, therefore if you want to aggregate a list, you have to map the input to lists. Afterwards you combine the lists into one list.
组合列表
您需要一种将列表合并为一个列表的方法.Python 提供了一些组合列表的方法.
You'll need a method to combine lists into one list. Python provides some methods to combine lists.
append 修改第一个列表,并且总是返回 None.
append modifies the first list and will always return None.
x = [1, 2, 3]
x.append([4, 5])
# x is [1, 2, 3, [4, 5]]
extend 做同样的事情,但解开列表:
extend does the same, but unwraps lists:
x = [1, 2, 3]
x.extend([4, 5])
# x is [1, 2, 3, 4, 5]
这两种方法都返回 None,但您需要一个返回组合列表的方法,因此只需 使用加号.
Both methods return None, but you'll need a method that returns the combined list, therefore just use the plus sign.
x = [1, 2, 3] + [4, 5]
# x is [1, 2, 3, 4, 5]
火花
file = spark.textFile("hdfs://...")
counts = file.flatMap(lambda line: line.split(" "))
.map(lambda actor: (actor.split(",")[0], actor))
# transform each value into a list
.map(lambda nameTuple: (nameTuple[0], [ nameTuple[1] ]))
# combine lists: ([1,2,3] + [4,5]) becomes [1,2,3,4,5]
.reduceByKey(lambda a, b: a + b)
<小时>
组合键
也可以使用 combineByKey 来解决这个问题,它在内部用于实现 reduceByKey,但它更复杂并且 在 Spark 中使用一种专门的按键组合器可以更快"一个>.对于上面的解决方案,您的用例已经足够简单了.
It's also possible to solve this with combineByKey, which is used internally to implement reduceByKey, but it's more complex and "using one of the specialized per-key combiners in Spark can be much faster". Your use case is simple enough for the upper solution.
GroupByKey
也可以使用 groupByKey、但它会减少并行化,因此对于大数据集可能会慢得多.
It's also possible to solve this with groupByKey, but it reduces parallelization and therefore could be much slower for big data sets.
这篇关于使用 Apache Spark 将键值对缩减为键列表对的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!
本文标题为:使用 Apache Spark 将键值对缩减为键列表对
基础教程推荐
- 求两个直方图的卷积 2022-01-01
- Plotly:如何设置绘图图形的样式,使其不显示缺失日期的间隙? 2022-01-01
- 修改列表中的数据帧不起作用 2022-01-01
- PANDA VALUE_COUNTS包含GROUP BY之前的所有值 2022-01-01
- 在同一图形上绘制Bokeh的烛台和音量条 2022-01-01
- 无法导入 Pytorch [WinError 126] 找不到指定的模块 2022-01-01
- 使用大型矩阵时禁止 Pycharm 输出中的自动换行符 2022-01-01
- 在Python中从Azure BLOB存储中读取文件 2022-01-01
- PermissionError: pip 从 8.1.1 升级到 8.1.2 2022-01-01
- 包装空间模型 2022-01-01
