.. _pandas: Pandas support ============== It is convenient to use the `Pandas package`_ when dealing with numerical data, so Pint provides `PintArray`. A `PintArray` is a `Pandas Extension Array`_, which allows Pandas to recognise the Quantity and store it in Pandas DataFrames and Series. For this to work, we rely on `Pandas Extension Types`_ which are still experimental. As a result, we are currently pinned to a specific commit, with version id ``0.24.0.dev0+625.gbdb7a16``, of Pandas. Basic example ------------- This example gives you the basics, but is slightly fiddly as you are not reading from a file. A more normal use case is given in `Reading a csv`_. To use Pint with Pandas, as stated above, firstly ensure that you have the latest version of Pandas installed. Then import the relevant packages and create an instance of a Pint Quantity: .. doctest:: >>> import pandas as pd >>> import numpy as np >>> import pint >>> from pint.pandas_interface import PintArray >>> ureg = pint.UnitRegistry() >>> Q_ = ureg.Quantity .. testsetup:: * import pandas as pd import numpy as np import pint from pint.pandas_interface import PintArray ureg = pint.UnitRegistry() Q_ = ureg.Quantity Next, we can create a DataFrame with PintArray's as columns .. doctest:: >>> torque = PintArray(Q_([1, 2, 2, 3], "lbf ft")) >>> angular_velocity = PintArray(Q_([1000, 2000, 2000, 3000], "rpm")) >>> df = pd.DataFrame({"torque": torque, "angular_velocity": angular_velocity}) >>> print(df) torque angular_velocity 0 1 1000 1 2 2000 2 2 2000 3 3 3000 Operations with columns are units aware so behave as we would intuitively expect .. doctest:: >>> df['power'] = df['torque'] * df['angular_velocity'] >>> print(df) torque angular_velocity power 0 1 1000 1000 1 2 2000 4000 2 2 2000 4000 3 3 3000 9000 Each column can be accessed as a Pandas Series .. doctest:: >>> print(df.power) 0 1000 1 4000 2 4000 3 9000 Name: power, dtype: pint Which contains a PintArray .. doctest:: >>> print(df.power.values) PintArray([1000 foot * force_pound * revolutions_per_minute, 4000 foot * force_pound * revolutions_per_minute, 4000 foot * force_pound * revolutions_per_minute, 9000 foot * force_pound * revolutions_per_minute], dtype='pint') Which contains a Quantity .. doctest:: >>> print(df.power.values.data) [1000 4000 4000 9000] foot * force_pound * revolutions_per_minute Pandas Series accessors are provided for most Quantity properties and methods, which will convert the result to a Series where possible. .. doctest:: >>> print(df.power.pint.dimensionality) [length] ** 2 * [mass] / [time] ** 3 >>> print(df.power.pint.to("kW")) 0 0.14198092353610375 1 0.567923694144415 2 0.567923694144415 3 1.2778283118249338 Name: power, dtype: pint Standard pint conversions can still be performed on the underlying quantity, and will still return a quantity. .. doctest:: >>> print(df.power.values.data.to("kW")) [0.14198092 0.56792369 0.56792369 1.27782831] kilowatt Reading a csv ------------- Thanks to the DataFrame accessors, reading from files with unit information becomes trivial. The DataFrame accessors make it easy to get to PintArrays. Setup ~~~~~ Here we create the DateFrame and save it to file, next we will show you how to load and read it. We start with an DateFrame with column headers only. .. doctest:: >>> speed = [1000, 1100, 1200, 1200] >>> mech_power = [np.nan, np.nan, np.nan, np.nan] >>> torque = [10, 10, 10, 10] >>> rail_pressure = [1000, 1000000000000, 1000, 1000] >>> fuel_flow_rate = [10, 10, 10, 10] >>> fluid_power = [np.nan, np.nan, np.nan, np.nan] >>> df_init = pd.DataFrame({"speed": speed, "mech power": mech_power, "torque": torque, "rail pressure": rail_pressure, "fuel flow rate": fuel_flow_rate, "fluid power": fluid_power,}) >>> print(df_init) speed mech power torque rail pressure fuel flow rate fluid power 0 1000 NaN 10 1000 10 NaN 1 1100 NaN 10 1000000000000 10 NaN 2 1200 NaN 10 1000 10 NaN 3 1200 NaN 10 1000 10 NaN Then we add a column header which contains units information .. doctest:: >>> units = ["rpm", "kW", "N m", "bar", "l/min", "kW"] >>> df_to_save = df_init.copy() >>> df_to_save.columns = pd.MultiIndex.from_arrays([df_init.columns, units]) >>> print(df_to_save) speed mech power torque rail pressure fuel flow rate fluid power rpm kW N m bar l/min kW 0 1000 NaN 10 1000 10 NaN 1 1100 NaN 10 1000000000000 10 NaN 2 1200 NaN 10 1000 10 NaN 3 1200 NaN 10 1000 10 NaN Now we save this to disk as a csv to give us our starting point. .. doctest:: >>> test_csv_name = "pandas_test.csv" >>> df_to_save.to_csv(test_csv_name, index=False) Now we are in a position to read the csv we just saved. Let's start by reading the file with units as a level in a multiindex column. .. doctest:: >>> df = pd.read_csv(test_csv_name, header=[0,1]) >>> print(df) speed mech power torque rail pressure fuel flow rate fluid power rpm kW N m bar l/min kW 0 1000 NaN 10 1000 10 NaN 1 1100 NaN 10 1000000000000 10 NaN 2 1200 NaN 10 1000 10 NaN 3 1200 NaN 10 1000 10 NaN Then use the DataFrame's `pint.quantify` method to convert the columns from `np.ndarray`s to PintArrays, with units from the bottom column level. .. doctest:: >>> df_ = df.pint.quantify(ureg, level=-1) >>> print(df_) speed mech power torque rail pressure fuel flow rate fluid power 0 1000.0 nan 10.0 1000.0 10.0 nan 1 1100.0 nan 10.0 1000000000000.0 10.0 nan 2 1200.0 nan 10.0 1000.0 10.0 nan 3 1200.0 nan 10.0 1000.0 10.0 nan As previously, operations between DataFrame columns are unit aware .. doctest:: >>> df_['mech power'] = df_.speed*df_.torque >>> df_['fluid power'] = df_['fuel flow rate'] * df_['rail pressure'] >>> print(df_) speed mech power torque rail pressure fuel flow rate fluid power 0 1000.0 10000.0 10.0 1000.0 10.0 10000.0 1 1100.0 11000.0 10.0 1000000000000.0 10.0 10000000000000.0 2 1200.0 12000.0 10.0 1000.0 10.0 10000.0 3 1200.0 12000.0 10.0 1000.0 10.0 10000.0 The DataFrame's `pint.dequantify` method then allows us to retrieve the units information as a header row once again .. doctest:: >>> print(df_.pint.dequantify()) speed mech power \ revolutions_per_minute meter * newton * revolutions_per_minute 0 1000.0 10000.0 1 1100.0 11000.0 2 1200.0 12000.0 3 1200.0 12000.0 torque rail pressure fuel flow rate fluid power meter * newton bar liter / minute bar * liter / minute 0 10.0 1.000000e+03 10.0 1.000000e+04 1 10.0 1.000000e+12 10.0 1.000000e+13 2 10.0 1.000000e+03 10.0 1.000000e+04 3 10.0 1.000000e+03 10.0 1.000000e+04 This allows for some rather powerful abilities. For example, to change single column units .. doctest:: >>> df_['fluid power'] = df_['fluid power'].pint.to("kW") >>> df_['mech power'] = df_['mech power'].pint.to("kW") >>> print(df_.pint.dequantify()) speed mech power torque rail pressure \ revolutions_per_minute kilowatt meter * newton bar 0 1000.0 1.047198 10.0 1.000000e+03 1 1100.0 1.151917 10.0 1.000000e+12 2 1200.0 1.256637 10.0 1.000000e+03 3 1200.0 1.256637 10.0 1.000000e+03 fuel flow rate fluid power liter / minute kilowatt 0 10.0 1.666667e+01 1 10.0 1.666667e+10 2 10.0 1.666667e+01 3 10.0 1.666667e+01 or the entire table's units .. doctest:: >>> print(df_.pint.to_base_units().pint.dequantify()) speed mech power \ radian / second kilogram * meter ** 2 / second ** 3 0 104.719755 1047.197551 1 115.191731 1151.917306 2 125.663706 1256.637061 3 125.663706 1256.637061 torque rail pressure \ kilogram * meter ** 2 / second ** 2 kilogram / meter / second ** 2 0 10.0 1.000000e+08 1 10.0 1.000000e+17 2 10.0 1.000000e+08 3 10.0 1.000000e+08 fuel flow rate fluid power meter ** 3 / second kilogram * meter ** 2 / second ** 3 0 0.000167 1.666667e+04 1 0.000167 1.666667e+13 2 0.000167 1.666667e+04 3 0.000167 1.666667e+04 Comments -------- What follows is a short discussion about Pint's `PintArray` Object. It is first useful to distinguish between three different things: 1. A scalar value .. doctest:: >>> print(Q_(123,"m")) 123 meter 2. A scalar value .. doctest:: >>> print(Q_([1, 2, 3], "m")) [1 2 3] meter 3. A scalar value .. doctest:: >>> print(Q_([[1, 2], [3, 4]], "m")) [[1 2] [3 4]] meter The first, a single scalar value is not intended to be stored in the PintArray as it's not an array, and should raise an error (TODO). The scalar Quantity is the scalar form of the PintArray, and is returned when performing operations that use `get_item`, eg indexing. A PintArray can be created from a list of scalar Quantitys using `PintArray._from_sequence`. The second, a 1d array or list, is intended to be stored in the PintArray, and is stored in the PintArray.data attribute. The third, 2d+ arrays or lists, are beyond the capabilities of ExtensionArrays which are limited to 1d arrays, so cannot be stored in the array, and should raise an error (TODO). Most operations on the PintArray act on the Quantity stored in `PintArray.data`, so will behave similiarly to operations on a Quantity, with some caveats: 1. An operation that would return a 1d Quantity will return a PintArray containing the Quantity. This allows pandas to assign the result to a Series. 2. Arithemetic and comparative operations are limited to scalars and sequences of the same length as the stored Quantity. This ensures results are the same length as the stored Quantity, so can be added to the same DataFrame. .. _`Pandas package`: https://pandas.pydata.org/pandas-docs/stable/index.html .. _`Pandas Dataframes`: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html .. _`Pandas Extension Array`: https://pandas.pydata.org/pandas-docs/stable/extending.html#extensionarray .. _`Pandas Extension Types`: https://pandas.pydata.org/pandas-docs/stable/extending.html#extension-types .. _`Pandas README`: https://github.com/pandas-dev/pandas/blob/master/README.md