0%

如何用pandas高速处理数据

因为需要对大量数据进行处理,有必要探讨比较下怎么操作具有较高的性能。

创建Dataframe数据

使用QA创建了一个包含6列和61006行的Dataframe。它包含了2020年11月16日至12月4日的OHLC和成交量数据。

1
2
3
4
5
6
7
import QUANTAXIS as QA

code = QA.QA_fetch_stock_list_adv().code.tolist()
daydata = QA.QA_fetch_stock_day_adv(code, '2020-11-16', '2020-12-04')
df = daydata.data
print(df.shape)
df.head(10)
you are using non-interactive mdoel quantaxis
(61001, 6)
open high low close volume amount
date code
2020-11-16 000001 17.08 17.43 16.90 17.37 759856.0 1.308190e+09
000002 29.39 29.50 29.00 29.20 516576.0 1.509810e+09
000004 31.15 31.46 30.11 30.61 72456.0 2.223127e+08
000005 2.68 2.70 2.65 2.69 64372.0 1.725762e+07
000006 5.66 5.74 5.62 5.72 98253.0 5.592563e+07
000007 9.42 9.42 9.18 9.20 22567.0 2.094628e+07
000008 2.72 2.74 2.70 2.73 171930.0 4.678304e+07
000009 7.71 7.88 7.66 7.82 320180.0 2.492149e+08
000010 4.19 4.27 4.10 4.24 71661.0 3.004320e+07
000011 13.76 14.40 13.58 14.39 105639.0 1.489735e+08

显示dataframe的内存占用情况。

1
df.info(memory_usage = 'deep')
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 61001 entries, (2020-11-16 00:00:00, 000001) to (2020-12-04 00:00:00, 689009)
Data columns (total 6 columns):
open      61001 non-null float64
high      61001 non-null float64
low       61001 non-null float64
close     61001 non-null float64
volume    61001 non-null float64
amount    61001 non-null float64
dtypes: float64(6)
memory usage: 3.2 MB
1
2
3
4
5
6
7
%%time

df1 = df
col_name = df1.columns.tolist()
col_name.insert(6,'status')
df1 = df1.reindex(columns=col_name)
df1.head(10)
CPU times: user 1.72 ms, sys: 1.99 ms, total: 3.71 ms
Wall time: 3.21 ms
open high low close volume amount status
date code
2020-11-16 000001 17.08 17.43 16.90 17.37 759856.0 1.308190e+09 NaN
000002 29.39 29.50 29.00 29.20 516576.0 1.509810e+09 NaN
000004 31.15 31.46 30.11 30.61 72456.0 2.223127e+08 NaN
000005 2.68 2.70 2.65 2.69 64372.0 1.725762e+07 NaN
000006 5.66 5.74 5.62 5.72 98253.0 5.592563e+07 NaN
000007 9.42 9.42 9.18 9.20 22567.0 2.094628e+07 NaN
000008 2.72 2.74 2.70 2.73 171930.0 4.678304e+07 NaN
000009 7.71 7.88 7.66 7.82 320180.0 2.492149e+08 NaN
000010 4.19 4.27 4.10 4.24 71661.0 3.004320e+07 NaN
000011 13.76 14.40 13.58 14.39 105639.0 1.489735e+08 NaN

效率测试

测试1:使用下标循环

下标循环是通过循环一个下标数列,通过iloc去不断获取数据。

1
2
3
4
5
6
7
8
%%time

for i in range(len(df1)):
if (df1.iloc[i]['close'] - df1.iloc[i]['open']) >=0:
df1.iloc[i]['status'] = 1
else:
df1.iloc[i]['status'] = -1
df1.head(10)
CPU times: user 33.6 s, sys: 136 ms, total: 33.8 s
Wall time: 34.1 s
open high low close volume amount status
date code
2020-11-16 000001 17.08 17.43 16.90 17.37 759856.0 1.308190e+09 1.0
000002 29.39 29.50 29.00 29.20 516576.0 1.509810e+09 -1.0
000004 31.15 31.46 30.11 30.61 72456.0 2.223127e+08 -1.0
000005 2.68 2.70 2.65 2.69 64372.0 1.725762e+07 1.0
000006 5.66 5.74 5.62 5.72 98253.0 5.592563e+07 1.0
000007 9.42 9.42 9.18 9.20 22567.0 2.094628e+07 -1.0
000008 2.72 2.74 2.70 2.73 171930.0 4.678304e+07 1.0
000009 7.71 7.88 7.66 7.82 320180.0 2.492149e+08 1.0
000010 4.19 4.27 4.10 4.24 71661.0 3.004320e+07 1.0
000011 13.76 14.40 13.58 14.39 105639.0 1.489735e+08 1.0

测试2:Iterrows循环

该循环方式是通过iterrows进行循环,ind和row分别代表了每一行的index和内容。测试例子大概需要s,比起下标循环速度提升了321倍。

1
2
3
4
5
6
7
8
9
10
%%time

i = 0
for ind, row in df1.iterrows():
if row['close'] - row['open'] >=0:
df1.iloc[i]['status'] = 1
else:
df1.iloc[i]['status'] = -1
i += 1
df1.head(10)
CPU times: user 18.5 s, sys: 82.7 ms, total: 18.6 s
Wall time: 18.8 s
open high low close volume amount status
date code
2020-11-16 000001 17.08 17.43 16.90 17.37 759856.0 1.308190e+09 1.0
000002 29.39 29.50 29.00 29.20 516576.0 1.509810e+09 -1.0
000004 31.15 31.46 30.11 30.61 72456.0 2.223127e+08 -1.0
000005 2.68 2.70 2.65 2.69 64372.0 1.725762e+07 1.0
000006 5.66 5.74 5.62 5.72 98253.0 5.592563e+07 1.0
000007 9.42 9.42 9.18 9.20 22567.0 2.094628e+07 -1.0
000008 2.72 2.74 2.70 2.73 171930.0 4.678304e+07 1.0
000009 7.71 7.88 7.66 7.82 320180.0 2.492149e+08 1.0
000010 4.19 4.27 4.10 4.24 71661.0 3.004320e+07 1.0
000011 13.76 14.40 13.58 14.39 105639.0 1.489735e+08 1.0

测试3:Apply循环

1
2
3
4
%%time

df1['status'] = (df1['close']-df['open']).apply(lambda x: 1 if x >=0 else -1)
df1.head(10)
CPU times: user 23.8 ms, sys: 1.99 ms, total: 25.8 ms
Wall time: 35.6 ms
open high low close volume amount status
date code
2020-11-16 000001 17.08 17.43 16.90 17.37 759856.0 1.308190e+09 1
000002 29.39 29.50 29.00 29.20 516576.0 1.509810e+09 -1
000004 31.15 31.46 30.11 30.61 72456.0 2.223127e+08 -1
000005 2.68 2.70 2.65 2.69 64372.0 1.725762e+07 1
000006 5.66 5.74 5.62 5.72 98253.0 5.592563e+07 1
000007 9.42 9.42 9.18 9.20 22567.0 2.094628e+07 -1
000008 2.72 2.74 2.70 2.73 171930.0 4.678304e+07 1
000009 7.71 7.88 7.66 7.82 320180.0 2.492149e+08 1
000010 4.19 4.27 4.10 4.24 71661.0 3.004320e+07 1
000011 13.76 14.40 13.58 14.39 105639.0 1.489735e+08 1

测试4:numpy内置向量化函数

1
2
3
4
5
%%time
import numpy as np

df1['status'] = np.where((df1.close - df1.open)>=0, 1, -1)
df1.head(10)
CPU times: user 1.74 ms, sys: 2.99 ms, total: 4.72 ms
Wall time: 10.4 ms
open high low close volume amount status
date code
2020-11-16 000001 17.08 17.43 16.90 17.37 759856.0 1.308190e+09 1
000002 29.39 29.50 29.00 29.20 516576.0 1.509810e+09 -1
000004 31.15 31.46 30.11 30.61 72456.0 2.223127e+08 -1
000005 2.68 2.70 2.65 2.69 64372.0 1.725762e+07 1
000006 5.66 5.74 5.62 5.72 98253.0 5.592563e+07 1
000007 9.42 9.42 9.18 9.20 22567.0 2.094628e+07 -1
000008 2.72 2.74 2.70 2.73 171930.0 4.678304e+07 1
000009 7.71 7.88 7.66 7.82 320180.0 2.492149e+08 1
000010 4.19 4.27 4.10 4.24 71661.0 3.004320e+07 1
000011 13.76 14.40 13.58 14.39 105639.0 1.489735e+08 1

结论

for循环及向量化的使用对性能有着较大的影响:
1、如果确定需要使用循环,则应始终选择apply方法。
2、否则,vectorization总是更好的,因为它更快!