因为需要对大量数据进行处理,有必要探讨比较下怎么操作具有较高的性能。
创建Dataframe数据
使用QA创建了一个包含6列和61006行的Dataframe。它包含了2020年11月16日至12月4日的OHLC和成交量数据。
1 2 3 4 5 6 7 import QUANTAXIS as QAcode = QA.QA_fetch_stock_list_adv().code.tolist() daydata = QA.QA_fetch_stock_day_adv(code, '2020-11-16' , '2020-12-04' ) df = daydata.data print(df.shape) df.head(10 )
you are using non-interactive mdoel quantaxis
(61001, 6)
open
high
low
close
volume
amount
date
code
2020-11-16
000001
17.08
17.43
16.90
17.37
759856.0
1.308190e+09
000002
29.39
29.50
29.00
29.20
516576.0
1.509810e+09
000004
31.15
31.46
30.11
30.61
72456.0
2.223127e+08
000005
2.68
2.70
2.65
2.69
64372.0
1.725762e+07
000006
5.66
5.74
5.62
5.72
98253.0
5.592563e+07
000007
9.42
9.42
9.18
9.20
22567.0
2.094628e+07
000008
2.72
2.74
2.70
2.73
171930.0
4.678304e+07
000009
7.71
7.88
7.66
7.82
320180.0
2.492149e+08
000010
4.19
4.27
4.10
4.24
71661.0
3.004320e+07
000011
13.76
14.40
13.58
14.39
105639.0
1.489735e+08
显示dataframe的内存占用情况。
1 df.info(memory_usage = 'deep' )
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 61001 entries, (2020-11-16 00:00:00, 000001) to (2020-12-04 00:00:00, 689009)
Data columns (total 6 columns):
open 61001 non-null float64
high 61001 non-null float64
low 61001 non-null float64
close 61001 non-null float64
volume 61001 non-null float64
amount 61001 non-null float64
dtypes: float64(6)
memory usage: 3.2 MB
1 2 3 4 5 6 7 %%time df1 = df col_name = df1.columns.tolist() col_name.insert(6 ,'status' ) df1 = df1.reindex(columns=col_name) df1.head(10 )
CPU times: user 1.72 ms, sys: 1.99 ms, total: 3.71 ms
Wall time: 3.21 ms
open
high
low
close
volume
amount
status
date
code
2020-11-16
000001
17.08
17.43
16.90
17.37
759856.0
1.308190e+09
NaN
000002
29.39
29.50
29.00
29.20
516576.0
1.509810e+09
NaN
000004
31.15
31.46
30.11
30.61
72456.0
2.223127e+08
NaN
000005
2.68
2.70
2.65
2.69
64372.0
1.725762e+07
NaN
000006
5.66
5.74
5.62
5.72
98253.0
5.592563e+07
NaN
000007
9.42
9.42
9.18
9.20
22567.0
2.094628e+07
NaN
000008
2.72
2.74
2.70
2.73
171930.0
4.678304e+07
NaN
000009
7.71
7.88
7.66
7.82
320180.0
2.492149e+08
NaN
000010
4.19
4.27
4.10
4.24
71661.0
3.004320e+07
NaN
000011
13.76
14.40
13.58
14.39
105639.0
1.489735e+08
NaN
效率测试
测试1:使用下标循环
下标循环是通过循环一个下标数列,通过iloc去不断获取数据。
1 2 3 4 5 6 7 8 %%time for i in range (len (df1)): if (df1.iloc[i]['close' ] - df1.iloc[i]['open' ]) >=0 : df1.iloc[i]['status' ] = 1 else : df1.iloc[i]['status' ] = -1 df1.head(10 )
CPU times: user 33.6 s, sys: 136 ms, total: 33.8 s
Wall time: 34.1 s
open
high
low
close
volume
amount
status
date
code
2020-11-16
000001
17.08
17.43
16.90
17.37
759856.0
1.308190e+09
1.0
000002
29.39
29.50
29.00
29.20
516576.0
1.509810e+09
-1.0
000004
31.15
31.46
30.11
30.61
72456.0
2.223127e+08
-1.0
000005
2.68
2.70
2.65
2.69
64372.0
1.725762e+07
1.0
000006
5.66
5.74
5.62
5.72
98253.0
5.592563e+07
1.0
000007
9.42
9.42
9.18
9.20
22567.0
2.094628e+07
-1.0
000008
2.72
2.74
2.70
2.73
171930.0
4.678304e+07
1.0
000009
7.71
7.88
7.66
7.82
320180.0
2.492149e+08
1.0
000010
4.19
4.27
4.10
4.24
71661.0
3.004320e+07
1.0
000011
13.76
14.40
13.58
14.39
105639.0
1.489735e+08
1.0
测试2:Iterrows循环
该循环方式是通过iterrows进行循环,ind和row分别代表了每一行的index和内容。测试例子大概需要s,比起下标循环速度提升了321倍。
1 2 3 4 5 6 7 8 9 10 %%time i = 0 for ind, row in df1.iterrows(): if row['close' ] - row['open' ] >=0 : df1.iloc[i]['status' ] = 1 else : df1.iloc[i]['status' ] = -1 i += 1 df1.head(10 )
CPU times: user 18.5 s, sys: 82.7 ms, total: 18.6 s
Wall time: 18.8 s
open
high
low
close
volume
amount
status
date
code
2020-11-16
000001
17.08
17.43
16.90
17.37
759856.0
1.308190e+09
1.0
000002
29.39
29.50
29.00
29.20
516576.0
1.509810e+09
-1.0
000004
31.15
31.46
30.11
30.61
72456.0
2.223127e+08
-1.0
000005
2.68
2.70
2.65
2.69
64372.0
1.725762e+07
1.0
000006
5.66
5.74
5.62
5.72
98253.0
5.592563e+07
1.0
000007
9.42
9.42
9.18
9.20
22567.0
2.094628e+07
-1.0
000008
2.72
2.74
2.70
2.73
171930.0
4.678304e+07
1.0
000009
7.71
7.88
7.66
7.82
320180.0
2.492149e+08
1.0
000010
4.19
4.27
4.10
4.24
71661.0
3.004320e+07
1.0
000011
13.76
14.40
13.58
14.39
105639.0
1.489735e+08
1.0
测试3:Apply循环
1 2 3 4 %%time df1['status' ] = (df1['close' ]-df['open' ]).apply(lambda x: 1 if x >=0 else -1 ) df1.head(10 )
CPU times: user 23.8 ms, sys: 1.99 ms, total: 25.8 ms
Wall time: 35.6 ms
open
high
low
close
volume
amount
status
date
code
2020-11-16
000001
17.08
17.43
16.90
17.37
759856.0
1.308190e+09
1
000002
29.39
29.50
29.00
29.20
516576.0
1.509810e+09
-1
000004
31.15
31.46
30.11
30.61
72456.0
2.223127e+08
-1
000005
2.68
2.70
2.65
2.69
64372.0
1.725762e+07
1
000006
5.66
5.74
5.62
5.72
98253.0
5.592563e+07
1
000007
9.42
9.42
9.18
9.20
22567.0
2.094628e+07
-1
000008
2.72
2.74
2.70
2.73
171930.0
4.678304e+07
1
000009
7.71
7.88
7.66
7.82
320180.0
2.492149e+08
1
000010
4.19
4.27
4.10
4.24
71661.0
3.004320e+07
1
000011
13.76
14.40
13.58
14.39
105639.0
1.489735e+08
1
测试4:numpy内置向量化函数
1 2 3 4 5 %%time import numpy as npdf1['status' ] = np.where((df1.close - df1.open )>=0 , 1 , -1 ) df1.head(10 )
CPU times: user 1.74 ms, sys: 2.99 ms, total: 4.72 ms
Wall time: 10.4 ms
open
high
low
close
volume
amount
status
date
code
2020-11-16
000001
17.08
17.43
16.90
17.37
759856.0
1.308190e+09
1
000002
29.39
29.50
29.00
29.20
516576.0
1.509810e+09
-1
000004
31.15
31.46
30.11
30.61
72456.0
2.223127e+08
-1
000005
2.68
2.70
2.65
2.69
64372.0
1.725762e+07
1
000006
5.66
5.74
5.62
5.72
98253.0
5.592563e+07
1
000007
9.42
9.42
9.18
9.20
22567.0
2.094628e+07
-1
000008
2.72
2.74
2.70
2.73
171930.0
4.678304e+07
1
000009
7.71
7.88
7.66
7.82
320180.0
2.492149e+08
1
000010
4.19
4.27
4.10
4.24
71661.0
3.004320e+07
1
000011
13.76
14.40
13.58
14.39
105639.0
1.489735e+08
1
结论
for
循环及向量化的使用对性能有着较大的影响:
1、如果确定需要使用循环,则应始终选择apply方法。
2、否则,vectorization总是更好的,因为它更快!