summaryrefslogtreecommitdiff
path: root/docs/pandas.rst
blob: 830eb11c4df8e85f5645aca6f3ca4c83088d8dfd (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
.. _pandas:

Pandas support
==============

It is convenient to use the `Pandas package`_ when dealing with numerical data, so Pint provides `PintArray`. A `PintArray` is a `Pandas Extension Array`_, which allows Pandas to recognise the Quantity and store it in Pandas DataFrames and Series.

For this to work, we rely on `Pandas Extension Types`_ which are still experimental. As a result, we are currently pinned to a specific commit, with version id ``0.24.0.dev0+625.gbdb7a16``, of Pandas.


Basic example
-------------

This example gives you the basics, but is slightly fiddly as you are not reading from a file. A more normal use case is given in `Reading a csv`_.

To use Pint with Pandas, as stated above, firstly ensure that you have the latest version of Pandas installed. Then import the relevant packages and create an instance of a Pint Quantity:

.. doctest::

   >>> import pandas as pd
   >>> import numpy as np
   >>> import pint
   >>> from pint.pandas_interface import PintArray

   >>> ureg = pint.UnitRegistry()
   >>> Q_ = ureg.Quantity

.. testsetup:: *

   import pandas as pd
   import numpy as np
   import pint
   from pint.pandas_interface import PintArray

   ureg = pint.UnitRegistry()
   Q_ = ureg.Quantity

Next, we can create a DataFrame with PintArray's as columns

.. doctest::

   >>> torque = PintArray(Q_([1, 2, 2, 3], "lbf ft"))
   >>> angular_velocity = PintArray(Q_([1000, 2000, 2000, 3000], "rpm"))
   >>> df = pd.DataFrame({"torque": torque, "angular_velocity": angular_velocity})
   >>> print(df)
    torque angular_velocity
   0      1             1000
   1      2             2000
   2      2             2000
   3      3             3000


Operations with columns are units aware so behave as we would intuitively expect

.. doctest::

   >>> df['power'] = df['torque'] * df['angular_velocity']
   >>> print(df)
     torque angular_velocity power
   0      1             1000  1000
   1      2             2000  4000
   2      2             2000  4000
   3      3             3000  9000


Each column can be accessed as a Pandas Series

.. doctest::

   >>> print(df.power)
   0    1000
   1    4000
   2    4000
   3    9000
   Name: power, dtype: pint


Which contains a PintArray

.. doctest::

   >>> print(df.power.values)
    PintArray([1000 foot * force_pound * revolutions_per_minute,
               4000 foot * force_pound * revolutions_per_minute,
               4000 foot * force_pound * revolutions_per_minute,
               9000 foot * force_pound * revolutions_per_minute],
              dtype='pint')


Which contains a Quantity

.. doctest::

   >>> print(df.power.values.data)
   [1000 4000 4000 9000] foot * force_pound * revolutions_per_minute


Pandas Series accessors are provided for most Quantity properties and methods, which will convert the result to a Series where possible.

.. doctest::

   >>> print(df.power.pint.dimensionality)
   [length] ** 2 * [mass] / [time] ** 3

   >>> print(df.power.pint.to("kW"))
   0    0.14198092353610375
   1      0.567923694144415
   2      0.567923694144415
   3     1.2778283118249338
   Name: power, dtype: pint


Standard pint conversions can still be performed on the underlying quantity, and will still return a quantity.

.. doctest::

   >>> print(df.power.values.data.to("kW"))
   [0.14198092 0.56792369 0.56792369 1.27782831] kilowatt

Reading a csv
-------------

Thanks to the DataFrame accessors, reading from files with unit information becomes trivial. The DataFrame accessors make it easy to get to PintArrays.

Setup
~~~~~

Here we create the DateFrame and save it to file, next we will show you how to load and read it.

We start with an DateFrame with column headers only.

.. doctest::

   >>> speed = [1000, 1100, 1200, 1200]
   >>> mech_power = [np.nan, np.nan, np.nan, np.nan]
   >>> torque = [10, 10, 10, 10]
   >>> rail_pressure = [1000, 1000000000000, 1000, 1000]
   >>> fuel_flow_rate = [10, 10, 10, 10]
   >>> fluid_power = [np.nan, np.nan, np.nan, np.nan]
   >>> df_init = pd.DataFrame({"speed": speed, "mech power": mech_power, "torque": torque, "rail pressure": rail_pressure, "fuel flow rate": fuel_flow_rate, "fluid power": fluid_power,})
   >>> print(df_init)
      speed  mech power  torque  rail pressure  fuel flow rate  fluid power
   0   1000         NaN      10           1000              10          NaN
   1   1100         NaN      10  1000000000000              10          NaN
   2   1200         NaN      10           1000              10          NaN
   3   1200         NaN      10           1000              10          NaN

Then we add a column header which contains units information

.. doctest::

   >>> units = ["rpm", "kW", "N m", "bar", "l/min", "kW"]
   >>> df_to_save = df_init.copy()
   >>> df_to_save.columns = pd.MultiIndex.from_arrays([df_init.columns, units])
   >>> print(df_to_save)
     speed mech power torque  rail pressure fuel flow rate fluid power
       rpm         kW    N m            bar          l/min          kW
   0  1000        NaN     10           1000             10         NaN
   1  1100        NaN     10  1000000000000             10         NaN
   2  1200        NaN     10           1000             10         NaN
   3  1200        NaN     10           1000             10         NaN

Now we save this to disk as a csv to give us our starting point.

.. doctest::

   >>> test_csv_name = "pandas_test.csv"
   >>> df_to_save.to_csv(test_csv_name, index=False)

Now we are in a position to read the csv we just saved. Let's start by reading the file with units as a level in a multiindex column.

.. doctest::

   >>> df = pd.read_csv(test_csv_name, header=[0,1])
   >>> print(df)
     speed mech power torque  rail pressure fuel flow rate fluid power
       rpm         kW    N m            bar          l/min          kW
   0  1000        NaN     10           1000             10         NaN
   1  1100        NaN     10  1000000000000             10         NaN
   2  1200        NaN     10           1000             10         NaN
   3  1200        NaN     10           1000             10         NaN

Then use the DataFrame's `pint.quantify` method to convert the columns from `np.ndarray`s to PintArrays, with units from the bottom column level.

.. doctest::

   >>> df_ = df.pint.quantify(ureg, level=-1)
   >>> print(df_)
       speed mech power torque    rail pressure fuel flow rate fluid power
   0  1000.0        nan   10.0           1000.0           10.0         nan
   1  1100.0        nan   10.0  1000000000000.0           10.0         nan
   2  1200.0        nan   10.0           1000.0           10.0         nan
   3  1200.0        nan   10.0           1000.0           10.0         nan


As previously, operations between DataFrame columns are unit aware

.. doctest::

   >>> df_['mech power'] = df_.speed*df_.torque
   >>> df_['fluid power'] = df_['fuel flow rate'] * df_['rail pressure']
   >>> print(df_)
       speed mech power torque    rail pressure fuel flow rate       fluid power
   0  1000.0    10000.0   10.0           1000.0           10.0           10000.0
   1  1100.0    11000.0   10.0  1000000000000.0           10.0  10000000000000.0
   2  1200.0    12000.0   10.0           1000.0           10.0           10000.0
   3  1200.0    12000.0   10.0           1000.0           10.0           10000.0


The DataFrame's `pint.dequantify` method then allows us to retrieve the units information as a header row once again

.. doctest::

   >>> print(df_.pint.dequantify())
                      speed                              mech power  \
     revolutions_per_minute meter * newton * revolutions_per_minute
   0                 1000.0                                 10000.0
   1                 1100.0                                 11000.0
   2                 1200.0                                 12000.0
   3                 1200.0                                 12000.0

             torque rail pressure fuel flow rate          fluid power
     meter * newton           bar liter / minute bar * liter / minute
   0           10.0  1.000000e+03           10.0         1.000000e+04
   1           10.0  1.000000e+12           10.0         1.000000e+13
   2           10.0  1.000000e+03           10.0         1.000000e+04
   3           10.0  1.000000e+03           10.0         1.000000e+04



This allows for some rather powerful abilities. For example, to change single column units

.. doctest::

   >>> df_['fluid power'] = df_['fluid power'].pint.to("kW")
   >>> df_['mech power'] = df_['mech power'].pint.to("kW")
   >>> print(df_.pint.dequantify())

                      speed mech power         torque rail pressure  \
     revolutions_per_minute   kilowatt meter * newton           bar
   0                 1000.0   1.047198           10.0  1.000000e+03
   1                 1100.0   1.151917           10.0  1.000000e+12
   2                 1200.0   1.256637           10.0  1.000000e+03
   3                 1200.0   1.256637           10.0  1.000000e+03

     fuel flow rate   fluid power
     liter / minute      kilowatt
   0           10.0  1.666667e+01
   1           10.0  1.666667e+10
   2           10.0  1.666667e+01
   3           10.0  1.666667e+01


or the entire table's units

.. doctest::

   >>> print(df_.pint.to_base_units().pint.dequantify())

               speed                          mech power  \
     radian / second kilogram * meter ** 2 / second ** 3
   0      104.719755                         1047.197551
   1      115.191731                         1151.917306
   2      125.663706                         1256.637061
   3      125.663706                         1256.637061

                                  torque                  rail pressure  \
     kilogram * meter ** 2 / second ** 2 kilogram / meter / second ** 2
   0                                10.0                   1.000000e+08
   1                                10.0                   1.000000e+17
   2                                10.0                   1.000000e+08
   3                                10.0                   1.000000e+08

          fuel flow rate                         fluid power
     meter ** 3 / second kilogram * meter ** 2 / second ** 3
   0            0.000167                        1.666667e+04
   1            0.000167                        1.666667e+13
   2            0.000167                        1.666667e+04
   3            0.000167                        1.666667e+04


Comments
--------

What follows is a short discussion about Pint's `PintArray` Object.

It is first useful to distinguish between three different things:

1. A scalar value

.. doctest::

   >>> print(Q_(123,"m"))
   123 meter

2. A scalar value

.. doctest::

   >>> print(Q_([1, 2, 3], "m"))
   [1 2 3] meter

3. A scalar value

.. doctest::

   >>> print(Q_([[1, 2], [3, 4]], "m"))
   [[1 2] [3 4]] meter


The first, a single scalar value is not intended to be stored in the PintArray as it's not an array, and should raise an error (TODO). The scalar Quantity is the scalar form of the PintArray, and is returned when performing operations that use `get_item`, eg indexing. A PintArray can be created from a list of scalar Quantitys using `PintArray._from_sequence`.

The second, a 1d array or list, is intended to be stored in the PintArray, and is stored in the PintArray.data attribute.

The third, 2d+ arrays or lists, are beyond the capabilities of ExtensionArrays which are limited to 1d arrays, so cannot be stored in the array, and should raise an error (TODO).

Most operations on the PintArray act on the Quantity stored in `PintArray.data`, so will behave similiarly to operations on a Quantity, with some caveats:

1. An operation that would return a 1d Quantity will return a PintArray containing the Quantity. This allows pandas to assign the result to a Series.
2. Arithemetic and comparative operations are limited to scalars and sequences of the same length as the stored Quantity. This ensures results are the same length as the stored Quantity, so can be added to the same DataFrame.




.. _`Pandas package`: https://pandas.pydata.org/pandas-docs/stable/index.html
.. _`Pandas Dataframes`: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html
.. _`Pandas Extension Array`: https://pandas.pydata.org/pandas-docs/stable/extending.html#extensionarray
.. _`Pandas Extension Types`: https://pandas.pydata.org/pandas-docs/stable/extending.html#extension-types
.. _`Pandas README`: https://github.com/pandas-dev/pandas/blob/master/README.md