Python: Sliding windowed mean, ignoring missing data(Python:滑动窗口均值,忽略缺失数据)
问题描述
我目前正在尝试处理具有缺失值的实验时间序列数据集.我想计算这个数据集的滑动窗口平均值,同时处理 nan 值.我这样做的正确方法是在每个窗口内计算有限元素的总和并将其除以它们的数量.这种非线性迫使我使用非卷积方法来面对这个问题,因此我在这部分过程中遇到了严重的时间瓶颈.作为我正在尝试完成的代码示例,我提出以下内容:
I am currently trying to process an experimental timeseries dataset, which has missing values. I would like to calculate the sliding windowed mean of this dataset along time, while handling nan values. The correct way for me to do it is to compute inside each window the sum of the finite elements and divide it with their number. This nonlinearity forces me to use non convolutional methods to face this problem, thus I have a severe time bottleneck in this part of the process. As a code example of what I am trying to accomplish I present the following:
import numpy as np
#Construct sample data
n = 50
n_miss = 20
win_size = 3
data= np.random.random(50)
data[np.random.randint(0,n-1, n_miss)] = None
#Compute mean
result = np.zeros(data.size)
for count in range(data.size):
part_data = data[max(count - (win_size - 1) / 2, 0): min(count + (win_size + 1) / 2, data.size)]
mask = np.isfinite(part_data)
if np.sum(mask) != 0:
result[count] = np.sum(part_data[mask]) / np.sum(mask)
else:
result[count] = None
print 'Input: ',data
print 'Output: ',result
带输出:
Input: [ 0.47431791 0.17620835 0.78495647 0.79894688 0.58334064 0.38068788
0.87829696 nan 0.71589171 nan 0.70359557 0.76113969
0.13694387 0.32126573 0.22730891 nan 0.35057169 nan
0.89251851 0.56226354 0.040117 nan 0.37249799 0.77625334
nan nan nan nan 0.63227417 0.92781944
0.99416471 0.81850753 0.35004997 nan 0.80743783 0.60828597
nan 0.01410721 nan nan 0.6976317 nan
0.03875394 0.60924066 0.22998065 nan 0.34476729 0.38090961
nan 0.2021964 ]
Output: [ 0.32526313 0.47849424 0.5867039 0.72241466 0.58765847 0.61410849
0.62949242 0.79709433 0.71589171 0.70974364 0.73236763 0.53389305
0.40644977 0.22850617 0.27428732 0.2889403 0.35057169 0.6215451
0.72739103 0.49829968 0.30119027 0.20630749 0.57437567 0.57437567
0.77625334 nan nan 0.63227417 0.7800468 0.85141944
0.91349722 0.7209074 0.58427875 0.5787439 0.7078619 0.7078619
0.31119659 0.01410721 0.01410721 0.6976317 0.6976317 0.36819282
0.3239973 0.29265842 0.41961066 0.28737397 0.36283845 0.36283845
0.29155301 0.2021964 ]
这个结果可以通过 numpy 操作产生,而不使用 for 循环吗?
Can this result be produced by numpy operations, without using a for loop?
推荐答案
这是一个基于卷积的方法,使用 np.convolve
-
Here's a convolution based approach using np.convolve
-
mask = np.isnan(data)
K = np.ones(win_size,dtype=int)
out = np.convolve(np.where(mask,0,data), K)/np.convolve(~mask,K)
请注意,这将在两侧有一个额外的元素.
Please note that this would have one extra element on either sides.
如果您使用的是 2D
数据,我们可以使用 Scipy 的二维卷积
.
If you are working with 2D
data, we can use Scipy's 2D convolution
.
方法-
def original_app(data, win_size):
#Compute mean
result = np.zeros(data.size)
for count in range(data.size):
part_data = data[max(count - (win_size - 1) / 2, 0):
min(count + (win_size + 1) / 2, data.size)]
mask = np.isfinite(part_data)
if np.sum(mask) != 0:
result[count] = np.sum(part_data[mask]) / np.sum(mask)
else:
result[count] = None
return result
def numpy_app(data, win_size):
mask = np.isnan(data)
K = np.ones(win_size,dtype=int)
out = np.convolve(np.where(mask,0,data), K)/np.convolve(~mask,K)
return out[1:-1] # Slice out the one-extra elems on sides
示例运行 -
In [118]: #Construct sample data
...: n = 50
...: n_miss = 20
...: win_size = 3
...: data= np.random.random(50)
...: data[np.random.randint(0,n-1, n_miss)] = np.nan
...:
In [119]: original_app(data, win_size = 3)
Out[119]:
array([ 0.88356487, 0.86829731, 0.85249541, 0.83776219, nan,
nan, 0.61054015, 0.63111926, 0.63111926, 0.65169837,
0.1857301 , 0.58335324, 0.42088104, 0.5384565 , 0.31027752,
0.40768907, 0.3478563 , 0.34089655, 0.55462903, 0.71784816,
0.93195716, nan, 0.41635575, 0.52211653, 0.65053379,
0.76762282, 0.72888574, 0.35250449, 0.35250449, 0.14500637,
0.06997668, 0.22582318, 0.18621848, 0.36320784, 0.19926647,
0.24506199, 0.09983572, 0.47595439, 0.79792941, 0.5982114 ,
0.42389375, 0.28944089, 0.36246113, 0.48088139, 0.71105449,
0.60234163, 0.40012839, 0.45100475, 0.41768466, 0.41768466])
In [120]: numpy_app(data, win_size = 3)
__main__:36: RuntimeWarning: invalid value encountered in divide
Out[120]:
array([ 0.88356487, 0.86829731, 0.85249541, 0.83776219, nan,
nan, 0.61054015, 0.63111926, 0.63111926, 0.65169837,
0.1857301 , 0.58335324, 0.42088104, 0.5384565 , 0.31027752,
0.40768907, 0.3478563 , 0.34089655, 0.55462903, 0.71784816,
0.93195716, nan, 0.41635575, 0.52211653, 0.65053379,
0.76762282, 0.72888574, 0.35250449, 0.35250449, 0.14500637,
0.06997668, 0.22582318, 0.18621848, 0.36320784, 0.19926647,
0.24506199, 0.09983572, 0.47595439, 0.79792941, 0.5982114 ,
0.42389375, 0.28944089, 0.36246113, 0.48088139, 0.71105449,
0.60234163, 0.40012839, 0.45100475, 0.41768466, 0.41768466])
运行时测试-
In [122]: #Construct sample data
...: n = 50000
...: n_miss = 20000
...: win_size = 3
...: data= np.random.random(n)
...: data[np.random.randint(0,n-1, n_miss)] = np.nan
...:
In [123]: %timeit original_app(data, win_size = 3)
1 loops, best of 3: 1.51 s per loop
In [124]: %timeit numpy_app(data, win_size = 3)
1000 loops, best of 3: 1.09 ms per loop
In [125]: import pandas as pd
# @jdehesa's pandas solution
In [126]: %timeit pd.Series(data).rolling(window=3, min_periods=1).mean()
100 loops, best of 3: 3.34 ms per loop
这篇关于Python:滑动窗口均值,忽略缺失数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!
本文标题为:Python:滑动窗口均值,忽略缺失数据


基础教程推荐
- PANDA VALUE_COUNTS包含GROUP BY之前的所有值 2022-01-01
- Plotly:如何设置绘图图形的样式,使其不显示缺失日期的间隙? 2022-01-01
- 无法导入 Pytorch [WinError 126] 找不到指定的模块 2022-01-01
- 包装空间模型 2022-01-01
- 在Python中从Azure BLOB存储中读取文件 2022-01-01
- PermissionError: pip 从 8.1.1 升级到 8.1.2 2022-01-01
- 在同一图形上绘制Bokeh的烛台和音量条 2022-01-01
- 使用大型矩阵时禁止 Pycharm 输出中的自动换行符 2022-01-01
- 求两个直方图的卷积 2022-01-01
- 修改列表中的数据帧不起作用 2022-01-01