Efficient file reading in python with need to split on #39;#39;(在 python 中高效的文件读取需要在 上拆分)
问题描述
我一直在阅读以下文件:
I've traditionally been reading in files with:
file = open(fullpath, "r")
allrecords = file.read()
delimited = allrecords.split('
')
for record in delimited[1:]:
record_split = record.split(',')
和
with open(os.path.join(txtdatapath,pathfilename), "r") as data:
datalines = (line.rstrip('
') for line in data)
for record in datalines:
split_line = record.split(',')
if len(split_line) > 1:
但似乎当我在多处理线程中处理这些文件时,我得到了 MemoryError.当我正在阅读的文本文件需要在 '
' 上拆分时,我如何才能最好地逐行读取文件.
But it seems when I process these files in a multiprocessing thread I get MemoryError. How can I best readin files line by line, when the text file I'm reading needs to be split on '
'.
这里是多处理代码:
pool = Pool()
fixed_args = (targetdirectorytxt, value_dict)
varg = ((filename,) + fixed_args for filename in readinfiles)
op_list = pool.map_async(PPD_star, list(varg), chunksize=1)
while not op_list.ready():
print("Number of files left to process: {}".format(op_list._number_left))
time.sleep(60)
op_list = op_list.get()
pool.close()
pool.join()
这是错误日志
Exception in thread Thread-3:
Traceback (most recent call last):
File "C:Python27lib hreading.py", line 810, in __bootstrap_inner
self.run()
File "C:Python27lib hreading.py", line 763, in run
self.__target(*self.__args, **self.__kwargs)
File "C:Python27libmultiprocessingpool.py", line 380, in _handle_results
task = get()
MemoryError
我正在尝试按照 Mike 的建议安装 pathos,但我遇到了问题.这是我的安装命令:
I'm trying to install pathos as Mike has kindly suggested but I'm running into issues. Here is my install command:
pip install https://github.com/uqfoundation/pathos/zipball/master --allow-external pathos --pre
但这是我收到的错误消息:
But here are the error messages that I get:
Downloading/unpacking https://github.com/uqfoundation/pathos/zipball/master
Running setup.py (path:c:usersxxxappdatalocal emp2pip-1e4saj-b
uildsetup.py) egg_info for package from https://github.com/uqfoundation/pathos/
zipball/master
Downloading/unpacking ppft>=1.6.4.5 (from pathos==0.2a1.dev0)
Running setup.py (path:c:usersxxxappdatalocal emp2pip_build_jp
tyuserppftsetup.py) egg_info for package ppft
warning: no files found matching 'python-restlib.spec'
Requirement already satisfied (use --upgrade to upgrade): dill>=0.2.2 in c:pyth
on27libsite-packagesdill-0.2.2-py2.7.egg (from pathos==0.2a1.dev0)
Requirement already satisfied (use --upgrade to upgrade): pox>=0.2.1 in c:pytho
n27libsite-packagespox-0.2.1-py2.7.egg (from pathos==0.2a1.dev0)
Downloading/unpacking pyre==0.8.2.0-pathos (from pathos==0.2a1.dev0)
Could not find any downloads that satisfy the requirement pyre==0.8.2.0-pathos
(from pathos==0.2a1.dev0)
Some externally hosted files were ignored (use --allow-external pyre to allow)
.
Cleaning up...
No distributions at all found for pyre==0.8.2.0-pathos (from pathos==0.2a1.dev0)
Storing debug log for failure in C:Usersxxxpippip.log
我在 Windows 7 64 位上安装.最后,我设法使用 easy_install 进行了安装.
I'm installing on Windows 7 64 bit. In the end I managed to install with easy_install.
但是现在我失败了,因为我无法打开那么多文件:
But Now I have a failure as I cannot open that many files:
Finished reading in Exposures...
Reading Samples from: C:XXXXXXXXX
Traceback (most recent call last):
File "events.py", line 568, in <module>
mdrcv_dict = ReadDamages(damage_dir, value_dict)
File "events.py", line 185, in ReadDamages
res = thpool.amap(mppool.map, [rstrip]*len(readinfiles), files)
File "C:Python27libsite-packagespathos-0.2a1.dev0-py2.7.eggpathosmultipr
ocessing.py", line 230, in amap
return _pool.map_async(star(f), zip(*args)) # chunksize
File "events.py", line 184, in <genexpr>
files = (open(name, 'r') for name in readinfiles[0:])
IOError: [Errno 24] Too many open files: 'C:\xx.csv'
当前使用多处理库,我将参数和字典传递到我的函数中并打开映射文件,然后输出字典.这是我目前如何做的一个例子,如何用 pathos 做这个聪明的方法?
Currently using the multiprocessing library, I am passing in parameters and dictionaries into my function and opening a mapped file and then outputting a dictionary. Here is an example of how I currently do it, how would the smart way to do this with pathos?
def PP_star(args_flat):
return PP(*args_flat)
def PP(pathfilename, txtdatapath, my_dict):
return com_dict
fixed_args = (targetdirectorytxt, my_dict)
varg = ((filename,) + fixed_args for filename in readinfiles)
op_list = pool.map_async(PP_star, list(varg), chunksize=1)
如何使用 pathos.multiprocessing
推荐答案
假设我们有 file1.txt:
hello35
1234123
1234123
hello32
2492wow
1234125
1251234
1234123
1234123
2342bye
1234125
1251234
1234123
1234123
1234125
1251234
1234123
file2.txt:
1234125
1251234
1234123
hello35
2492wow
1234125
1251234
1234123
1234123
hello32
1234125
1251234
1234123
1234123
1234123
1234123
2342bye
等等,通过file5.txt:
1234123
1234123
1234125
1251234
1234123
1234123
1234123
1234125
1251234
1234125
1251234
1234123
1234123
hello35
hello32
2492wow
2342bye
我建议使用分层并行 map 来快速读取您的文件.multiprocessing 的一个分支(称为 pathos.multiprocessing)可以做到这一点.
I'd suggest to use a hierarchical parallel map to read your files quickly.
A fork of multiprocessing (called pathos.multiprocessing) can do this.
>>> import pathos
>>> thpool = pathos.multiprocessing.ThreadingPool()
>>> mppool = pathos.multiprocessing.ProcessingPool()
>>>
>>> def rstrip(line):
... return line.rstrip()
...
# get your list of files
>>> fnames = ['file1.txt', 'file2.txt', 'file3.txt', 'file4.txt', 'file5.txt']
>>> # open the files
>>> files = (open(name, 'r') for name in fnames)
>>> # read each file in asynchronous parallel
>>> # while reading and stripping each line in parallel
>>> res = thpool.amap(mppool.map, [rstrip]*len(fnames), files)
>>> # get the result when it's done
>>> res.ready()
True
>>> data = res.get()
>>> # if not using a files iterator -- close each file by uncommenting the next line
>>> # files = [file.close() for file in files]
>>> data[0]
['hello35', '1234123', '1234123', 'hello32', '2492wow', '1234125', '1251234', '1234123', '1234123', '2342bye', '1234125', '1251234', '1234123', '1234123', '1234125', '1251234', '1234123']
>>> data[1]
['1234125', '1251234', '1234123', 'hello35', '2492wow', '1234125', '1251234', '1234123', '1234123', 'hello32', '1234125', '1251234', '1234123', '1234123', '1234123', '1234123', '2342bye']
>>> data[-1]
['1234123', '1234123', '1234125', '1251234', '1234123', '1234123', '1234123', '1234125', '1251234', '1234125', '1251234', '1234123', '1234123', 'hello35', 'hello32', '2492wow', '2342bye']
但是,如果您想检查还有多少文件要完成,您可能需要使用迭代"映射 (imap) 而不是异步"映射 (地图).有关详细信息,请参阅此帖子:Python 多处理 - 跟踪pool.map操作过程
However, if you want to check how many files you have left to finish, you might want to use an "iterated" map (imap) instead of an "asynchronous" map (amap). See this post for details: Python multiprocessing - tracking the process of pool.map operation
在此处获取 pathos:https://github.com/uqfoundation
这篇关于在 python 中高效的文件读取需要在 ' ' 上拆分的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!
本文标题为:在 python 中高效的文件读取需要在 ' ' 上拆分
基础教程推荐
- 修改列表中的数据帧不起作用 2022-01-01
- 求两个直方图的卷积 2022-01-01
- 包装空间模型 2022-01-01
- 在同一图形上绘制Bokeh的烛台和音量条 2022-01-01
- 在Python中从Azure BLOB存储中读取文件 2022-01-01
- PermissionError: pip 从 8.1.1 升级到 8.1.2 2022-01-01
- Plotly:如何设置绘图图形的样式,使其不显示缺失日期的间隙? 2022-01-01
- PANDA VALUE_COUNTS包含GROUP BY之前的所有值 2022-01-01
- 无法导入 Pytorch [WinError 126] 找不到指定的模块 2022-01-01
- 使用大型矩阵时禁止 Pycharm 输出中的自动换行符 2022-01-01
