0%

【Python】PyMongoArrow实践

PyMongoArrow简介

PyMongoArrow 是 PyMongo 包含用于加载 MongoDB 的工具的扩展,以 Apache Arrow 形式查询结果集、 表格、 NumPy 数组和 Pandas 或 Polars DataFrames。 PyMongoArrow 是将 MongoDB 查询结果集物化为适合内存中分析处理应用程序的连续内存类型数组的推荐方法。[^1] 1. PyMongoArrow使用手册[^2]

  1. PyMongoArrow简明教程[^3]

安装

1
pip install pymongoarrow

1
conda install --channel conda-forge pymongoarrow

使用

1
2
3
4
5
6
7
8
9
10
11
import pymongoarrow
import pymongo
import pandas as pd
import polars as pl
from pymongoarrow.monkey import patch_all

# establish connection with our MongoDB database
client = pymongo.MongoClient('localhost', 27017)
db = client['quantaxis']
collection = db['stock_day']
patch_all()

查询结果转化为Pandas数据帧

1
2
df_pandas = collection.find_pandas_all({})
df_pandas
输出:
_id open close high low vol amount date code date_stamp
0 5f0fd6243530c3ea823c573e 49.00 49.00 49.00 49.00 32768.5 5.000000e+03 1991-04-03 000001 6.706080e+08
1 5f0fd6243530c3ea823c573f 48.76 48.76 48.76 48.76 4098.0 1.500000e+04 1991-04-04 000001 6.706944e+08
2 5f0fd6243530c3ea823c5740 48.52 48.52 48.52 48.52 2.0 1.000000e+04 1991-04-05 000001 6.707808e+08
3 5f0fd6243530c3ea823c5741 48.28 48.28 48.28 48.28 7.0 3.400000e+04 1991-04-06 000001 6.708672e+08
4 5f0fd6243530c3ea823c5742 48.04 48.04 48.04 48.04 2.0 1.000000e+04 1991-04-08 000001 6.710400e+08
... ... ... ... ... ... ... ... ... ... ...
14705153 665deedfcd3d51f9534dd7ca 45.10 44.85 45.49 44.69 314630.0 1.414729e+09 2024-05-31 688981 1.717085e+09
14705154 665deedfcd3d51f9534dd7cb 44.96 45.44 46.36 44.80 393681.0 1.796427e+09 2024-06-03 688981 1.717344e+09
14705155 665deedfcd3d51f9534dd7cc 37.45 37.77 37.90 37.02 48175.0 1.810881e+08 2024-05-30 689009 1.716998e+09
14705156 665deedfcd3d51f9534dd7cd 37.69 37.51 38.35 37.31 45055.0 1.706379e+08 2024-05-31 689009 1.717085e+09
14705157 665deedfcd3d51f9534dd7ce 37.73 38.11 39.26 37.29 95049.0 3.641072e+08 2024-06-03 689009 1.717344e+09

查询结果转化为Polars数据帧

1
2
df_polars = collection.find_polars_all({})
df_polars
输出:
_id open close high low vol amount date code date_stamp
binary f64 f64 f64 f64 f64 f64 str str f64
b"_0f$50<W>" 49.0 49.0 49.0 49.0 32768.5 5000.0 "1991-04-03" "000001" 6.70608e8
b"_0f$50<W?" 48.76 48.76 48.76 48.76 4098.0 15000.0 "1991-04-04" "000001" 6.706944e8
b"_0f$50<W@" 48.52 48.52 48.52 48.52 2.0 10000.0 "1991-04-05" "000001" 6.707808e8
b"_0f$50<WA" 48.28 48.28 48.28 48.28 7.0 34000.0 "1991-04-06" "000001" 6.708672e8
b"_0f$50<WB" 48.04 48.04 48.04 48.04 2.0 10000.0 "1991-04-08" "000001" 6.7104e8
b"f]=Q9SM" 45.1 44.85 45.49 44.69 314630.0 1.4147e9 "2024-05-31" "688981" 1.7171e9
b"f]=Q9SM" 44.96 45.44 46.36 44.8 393681.0 1.7964e9 "2024-06-03" "688981" 1.7173e9
b"f]=Q9SM" 37.45 37.77 37.9 37.02 48175.0 1.81088112e8 "2024-05-30" "689009" 1.7170e9
b"f]=Q9SM" 37.69 37.51 38.35 37.31 45055.0 1.70637904e8 "2024-05-31" "689009" 1.7171e9
b"f]=Q9SM" 37.73 38.11 39.26 37.29 95049.0 3.64107232e8 "2024-06-03" "689009" 1.7173e9

查询结果转化为Numpy数组

1
2
ndarrays = collection.find_numpy_all({})
ndarrays

输出:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
{'_id': array([b'_\x0f\xd6$50\xc3\xea\x82<W>', b'_\x0f\xd6$50\xc3\xea\x82<W?',
b'_\x0f\xd6$50\xc3\xea\x82<W@', ...,
b'f]\xee\xdf\xcd=Q\xf9SM\xd7\xcc',
b'f]\xee\xdf\xcd=Q\xf9SM\xd7\xcd',
b'f]\xee\xdf\xcd=Q\xf9SM\xd7\xce'], dtype=object),
'open': array([49. , 48.76, 48.52, ..., 37.45, 37.69, 37.73]),
'close': array([49. , 48.76, 48.52, ..., 37.77, 37.51, 38.11]),
'high': array([49. , 48.76, 48.52, ..., 37.9 , 38.35, 39.26]),
'low': array([49. , 48.76, 48.52, ..., 37.02, 37.31, 37.29]),
'vol': array([3.27685e+04, 4.09800e+03, 2.00000e+00, ..., 4.81750e+04,
4.50550e+04, 9.50490e+04]),
'amount': array([5.00000000e+03, 1.50000000e+04, 1.00000000e+04, ...,
1.81088112e+08, 1.70637904e+08, 3.64107232e+08]),
'date': array(['1991-04-03', '1991-04-04', '1991-04-05', ..., '2024-05-30',
'2024-05-31', '2024-06-03'], dtype='<U10'),
'code': array(['000001', '000001', '000001', ..., '689009', '689009', '689009'],
dtype='<U6'),
'date_stamp': array([6.7060800e+08, 6.7069440e+08, 6.7078080e+08, ..., 1.7169984e+09,
1.7170848e+09, 1.7173440e+09])}

查询结果转化为Arrow

1
2
arrow_table = collection.find_arrow_all({})
arrow_table

输出:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
_id: extension<pymongoarrow.objectid<ObjectIdType>>
open: double
close: double
high: double
low: double
vol: double
amount: double
date: string
code: string
date_stamp: double
----
_id: [[5F0FD6243530C3EA823C573E,5F0FD6243530C3EA823C573F,5F0FD6243530C3EA823C5740,5F0FD6243530C3EA823C5741,5F0FD6243530C3EA823C5742,...,665DEEDFCD3D51F9534DD7CA,665DEEDFCD3D51F9534DD7CB,665DEEDFCD3D51F9534DD7CC,665DEEDFCD3D51F9534DD7CD,665DEEDFCD3D51F9534DD7CE]]
open: [[49,48.76,48.52,48.28,48.04,...,45.1,44.96,37.45,37.69,37.73]]
close: [[49,48.76,48.52,48.28,48.04,...,44.85,45.44,37.77,37.51,38.11]]
high: [[49,48.76,48.52,48.28,48.04,...,45.49,46.36,37.9,38.35,39.26]]
low: [[49,48.76,48.52,48.28,48.04,...,44.69,44.8,37.02,37.31,37.29]]
vol: [[32768.5,4098,2,7,2,...,314630,393681,48175,45055,95049]]
amount: [[5000,15000,10000,34000,10000,...,1414729472,1796427392,181088112,170637904,364107232]]
date: [["1991-04-03","1991-04-04","1991-04-05","1991-04-06","1991-04-08",...,"2024-05-31","2024-06-03","2024-05-30","2024-05-31","2024-06-03"]]
code: [["000001","000001","000001","000001","000001",...,"688981","688981","689009","689009","689009"]]
date_stamp: [[670608000,670694400,670780800,670867200,671040000,...,1717084800,1717344000,1716998400,1717084800,1717344000]]
# 参考 [^1]: PyMongoArrow: Bridging the Gap Between MongoDB and Your Data Analysis App | MongoDB [^2]: MongoDB PyMongoArrow - PyMongoArrow v 1.3 [^3]: PyMongoArrow Extension Course | MongoDB University