问题描述
作为长期 SAS 用户,我正在探索切换到 python 和 pandas.
I am exploring switching to python and pandas as a long-time SAS user.
然而,今天在运行一些测试时,我很惊讶 python 在尝试 pandas.read_csv() 一个 128mb 的 csv 文件时内存不足.它有大约 200,000 行和 200 列主要是数字数据.
However, when running some tests today, I was surprised that python ran out of memory when trying to pandas.read_csv() a 128mb csv file. It had about 200,000 rows and 200 columns of mostly numeric data.
使用 SAS,我可以将 csv 文件导入 SAS 数据集,它可以和我的硬盘一样大.
With SAS, I can import a csv file into a SAS dataset and it can be as large as my hard drive.
pandas 中有类似的东西吗?
我经常处理大文件,但无法访问分布式计算网络.
I regularly work with large files and do not have access to a distributed computing network.
推荐答案
原则上不应该用完内存,但是目前read_csv对大文件存在内存问题,原因是一些复杂的Python 内部问题(这个很模糊,但是早就知道了:http://github.com/pydata/pandas/问题/407).
In principle it shouldn't run out of memory, but there are currently memory problems with read_csv on large files caused by some complex Python internal issues (this is vague but it's been known for a long time: http://github.com/pydata/pandas/issues/407).
目前还没有完美的解决方案(这是一个乏味的解决方案:您可以将文件逐行转录成预先分配的 NumPy 数组或内存映射文件--np.mmap),但这是我将在不久的将来进行的工作.另一种解决方案是读取较小的文件(使用 iterator=True, chunksize=1000)然后与 pd.concat 连接.当您一口气将整个文本文件拉入内存时,问题就出现了.
At the moment there isn't a perfect solution (here's a tedious one: you could transcribe the file row-by-row into a pre-allocated NumPy array or memory-mapped file--np.mmap), but it's one I'll be working on in the near future. Another solution is to read the file in smaller pieces (use iterator=True, chunksize=1000) then concatenate then with pd.concat. The problem comes in when you pull the entire text file into memory in one big slurp.
这篇关于pandas 中的大而持久的 DataFrame的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持跟版网!


大气响应式网络建站服务公司织梦模板
高端大气html5设计公司网站源码
织梦dede网页模板下载素材销售下载站平台(带会员中心带筛选)
财税代理公司注册代理记账网站织梦模板(带手机端)
成人高考自考在职研究生教育机构网站源码(带手机端)
高端HTML5响应式企业集团通用类网站织梦模板(自适应手机端)