问题描述
我有一个(非常简单的)熊猫数据框,看起来像这样:
I have a (very simplyfied here) pandas dataframe which looks like this:
df
datetime user type msg
0 2012-11-11 15:41:08 u1 txt hello world
1 2012-11-11 15:41:11 u2 txt hello world
2 2012-11-21 17:00:08 u3 txt hello world
3 2012-11-22 18:08:35 u4 txt hello you
4 2012-11-22 18:08:37 u5 txt hello you
我现在想做的是获取所有时间戳在 3 秒内的重复消息.期望的输出是:
What I would like to do now is to get all the duplicate messages which have their timestamp within 3 seconds. The desired output would be:
datetime user type msg
0 2012-11-11 15:41:08 u1 txt hello world
1 2012-11-11 15:41:11 u2 txt hello world
3 2012-11-22 18:08:35 u4 txt hello you
4 2012-11-22 18:08:37 u5 txt hello you
没有第三行,因为它的文本与第一行和第二行相同,但它的时间戳不是3秒以内.
without the third row, as its text is the same as in row one and two, but its timestamp is not within the range of 3 seconds.
我尝试将列 datetime 和 msg 定义为 duplicate() 方法的参数,但它返回一个空数据帧,因为时间戳不相同:
I tried to define the columns datetime and msg as parameters for the duplicate() method, but it returns an empty dataframe because the timestamps are not identical:
mask = df.duplicated(subset=['datetime', 'msg'], keep=False)
print(df[mask])
Empty DataFrame
Columns: [datetime, user, type, msg, MD5]
Index: []
有没有一种方法可以为我的日期时间"参数定义一个范围?为了说明,某事喜欢:
Is there a way where I can define a range for my "datetime" parameter? To illustrate, something like:
mask = df.duplicated(subset=['datetime_between_3_seconds', 'msg'], keep=False)
我们将一如既往地为您提供任何帮助.
Any help here would as always be very much appreciated.
推荐答案
这段代码给出了预期的输出
This Piece of code gives the expected output
df[(df.groupby(["msg"], as_index=False)["datetime"].diff().fillna(0).dt.seconds <= 3).reset_index(drop=True)]
我已对数据框的msg"列进行分组,然后选择该数据框的日期时间"列并使用内置函数 差异.Diff 函数查找该列的值之间的差异.用零填充 NaT 值并仅选择那些值小于 3 秒的索引.
I have grouped on "msg" column of dataframe and then selected "datetime" column of that dataframe and used inbuilt function diff. Diff function finds the difference between values of that column. Filled the NaT values with zero and selected only those indexes which have values less than 3 seconds.
在使用上述代码之前,请确保您的数据框按日期时间升序排序.
Before using above code make sure that your dataframe is sorted on datetime in ascending order.
这篇关于 pandas 数据框:基于列和时间范围的重复的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持跟版网!


大气响应式网络建站服务公司织梦模板
高端大气html5设计公司网站源码
织梦dede网页模板下载素材销售下载站平台(带会员中心带筛选)
财税代理公司注册代理记账网站织梦模板(带手机端)
成人高考自考在职研究生教育机构网站源码(带手机端)
高端HTML5响应式企业集团通用类网站织梦模板(自适应手机端)