pandas 中的大而持久的 DataFrame

2022-10-13Python开发问题
20

本文介绍了pandas 中的大而持久的 DataFrame的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着跟版网的小编来一起学习吧!

问题描述

作为长期 SAS 用户,我正在探索切换到 python 和 pandas.

I am exploring switching to python and pandas as a long-time SAS user.

然而,今天在运行一些测试时,我很惊讶 python 在尝试 pandas.read_csv() 一个 128mb 的 csv 文件时内存不足.它有大约 200,000 行和 200 列主要是数字数据.

However, when running some tests today, I was surprised that python ran out of memory when trying to pandas.read_csv() a 128mb csv file. It had about 200,000 rows and 200 columns of mostly numeric data.

使用 SAS,我可以将 csv 文件导入 SAS 数据集,它可以和我的硬盘一样大.

With SAS, I can import a csv file into a SAS dataset and it can be as large as my hard drive.

pandas 中有类似的东西吗?

我经常处理大文件,但无法访问分布式计算网络.

I regularly work with large files and do not have access to a distributed computing network.

推荐答案

原则上不应该用完内存,但是目前read_csv对大文件存在内存问题,原因是一些复杂的Python 内部问题(这个很模糊,但是早就知道了:http://github.com/pydata/pandas/问题/407).

In principle it shouldn't run out of memory, but there are currently memory problems with read_csv on large files caused by some complex Python internal issues (this is vague but it's been known for a long time: http://github.com/pydata/pandas/issues/407).

目前还没有完美的解决方案(这是一个乏味的解决方案:您可以将文件逐行转录成预先分配的 NumPy 数组或内存映射文件--np.mmap),但这是我将在不久的将来进行的工作.另一种解决方案是读取较小的文件(使用 iterator=True, chunksize=1000)然后与 pd.concat 连接.当您一口气将整个文本文件拉入内存时,问题就出现了.

At the moment there isn't a perfect solution (here's a tedious one: you could transcribe the file row-by-row into a pre-allocated NumPy array or memory-mapped file--np.mmap), but it's one I'll be working on in the near future. Another solution is to read the file in smaller pieces (use iterator=True, chunksize=1000) then concatenate then with pd.concat. The problem comes in when you pull the entire text file into memory in one big slurp.

这篇关于pandas 中的大而持久的 DataFrame的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持跟版网!

The End
pandas

相关推荐

在xarray中按单个维度的多个坐标分组
groupby multiple coords along a single dimension in xarray(在xarray中按单个维度的多个坐标分组)...
2024-08-22 Python开发问题
15

Pandas中的GROUP BY AND SUM不丢失列
Group by and Sum in Pandas without losing columns(Pandas中的GROUP BY AND SUM不丢失列)...
2024-08-22 Python开发问题
17

pandas 有从特定日期开始的按月分组的方式吗?
Is there a way of group by month in Pandas starting at specific day number?( pandas 有从特定日期开始的按月分组的方式吗?)...
2024-08-22 Python开发问题
10

GROUP BY+新列+基于条件的前一行抓取值
Group by + New Column + Grab value former row based on conditionals(GROUP BY+新列+基于条件的前一行抓取值)...
2024-08-22 Python开发问题
18

PANDA中的Groupby算法和插值算法
Groupby and interpolate in Pandas(PANDA中的Groupby算法和插值算法)...
2024-08-22 Python开发问题
11

PANAS-基于列对行进行分组,并将NaN替换为非空值
Pandas - Group Rows based on a column and replace NaN with non-null values(PANAS-基于列对行进行分组,并将NaN替换为非空值)...
2024-08-22 Python开发问题
10