blog <| code

Subclassing a (Pandas) DataFrame

Introduction

If you are a Python programmer using the Pandas library as one of the core libraries in the products you create, then you should be interested in this post. I hope to make a case for subclassing a Pandas DataFrame for certain use cases that are very common in projects that make use of DataFrames as a primary data structure to pass around tabular data.

In a serious application, a typical use case is to wrap the following snippet of code in some domain specific class or a function.

def load_data(students_data_file_path):
    # some  code here...

    df =  pd.read_csv(students_data_file_path)  # read a file with numerous columns
    return df

The dataframe returned by such a function could potentially have a large number of columns. Usually, this dataframe gets passed around to various other functions. Say, we have a function, that will find the mean of scores for the given list of students.

    def calculate_mean(students_df):
        # some code
        ..
        return df['scores'].mean()  # string 'scores' term

Notice in the above code that the term scores is stringy. Also, we are using this stringy value to access the column in the dataframe. That is, we end up using this stringy value as an API into the dataframe we loaded. Now, imagine such stringy columns being used and reused everywhere across all the functions this dataframe gets passed around.

Hope you see the problem here. We are using 'stringy` values as API. One way to fix this and make the reference more symbolic would be to create a related class which represents the columns available in the dataframe.

For example, we could have

    # student_data_def.py

    class StudentsData:
        # each of the left-hand values represents the
        # column names that are availale in the loaded file
        SCORES = 'scores'
        STUDENT_NAME = 'student_name'

And therefore, our calculate_mean function, now looks like this

    from student_data_def import StudentsData

    def calculate_mean(students_df):
        .
        .
        return df[StudentsData.SCORES].mean()

Now, this is better. We converted our stringy API to some symbolic values with better discover-ability. But, the problem here is that the students_df and StudentsData types are decoupled. Wherever, the dataframe is passed around we will have to import the StudentsData class in the modules that are working on this dataframe. Ideally, these two objects are supposed to reside together. We need more coupling between these two objects.

Now, if you are one of the fortunate folks using Python 3.5 and above, you should be excited about Type Hints. Type Hints, if embraced carefully can help our editors assist us as we type out our code. The linting tools like PyLint and type checkers like MyPy can also help us verify certain kinds of coding errors even before we run our code. In the scenario we are dealing with, imagine that we wanted to annotate our calculate_mean function with its type.

Let us try that,

    def calculate_mean(students_df: pd.DataFrame) -> np.float:
        .
        .
        return df[StudentsData.SCORES].mean()

Huh! pd.DataFrame. This type annotation does not provide us with any more information that the name of the variable. There is no way, for one to know what other columns are available in this DataFrame. All introspection capabilities are lost from such a rich data structure like the DataFrame which stores tens of fields. It seems like a mysterious object keeps getting passed around without any ability to understand its type without running the program or looking a all places the dataframe could have been initialized.

Summarizing our frustrations

a. Stringy API to access columns in DataFrame

b. If we try to get rid of stringy API, then we have decoupled objects(dataframe and the class declaring the columns)

c. Inability to provide useful type hints.

One Proposed Solution

Subclassing DataFrames

Pandas provides a way to subclass a DataFrame using the _constructor function. Here is some nice stuff that we can do this construct.

    import pandas as pd

    class StudentsDF(pd.DataFrame):
        SCORES = 'scores'
        STUDENT_NAME = 'name'

        @property
        def _constructor(self):
            return StudentsDF

    x = StudentsDF(data=dict(name=['Alice', 'Bob'], scores=[60, 50]), index=[100, 200])
    type(x)  # __main__.StudentData

Now, we see that the type(x) is StudentData. Therefore, we can pass around this object to our calculate_mean function.

    def calculate_mean(df: StudentsDF):
        .
        .
        return df[df.SCORES].mean()   # we are able to use `df` to access both column labels and values

Notice, that we just solved the three frustrations we had summarized above.

a. We pass around the Custom class whose type describes more about the object.

b. Both the data and the column definitions (formerly, stringy API) are coupled together tightly

c. Type annotations make more sense now. We do not have to use the generic pd.DataFrame any more.

Note, that using this construct, it is also possible to create a StudentsDF type instance while create a dataframe from a file.

    df = pd.read_csv('/tmp/t.csv')
    students_df = StudentsDF(df)   # just pass the original df into the StudentsDF constructor
    print(student_df.columns)  # Index(['name', 'scores'], dtype='object')

Automating this class generation

Using a simple script like the one shown below, we can automate creation of some of the boiler plate code. The following code creates and prints out a class definition based on the columns read from the file. This kind of script can be used as a one time set up tool to create a bunch of subclasses of the DataFrame.

import pandas as pd

class DFClassGenerator:

    CLASS_HEADER = 'class {class_name}(pd.DataFrame):'
    COLUMNS = '    {var} = "{label}"'   # we cheat an encode 4 spaces here,for demo

    CONSTRUCTOR =  ("    @property\n"
                    "    def _constructor(self):\n"
                    "        return {class_name}")

    @classmethod
    def generate_class(cls, df, class_name):

        cols = [cls.COLUMNS.format(var=c.upper(), label=c)
                for c in df.columns] # works for single hierarchical column index

        lines = [cls.CLASS_HEADER.format(class_name=class_name)]
        constructor = cls.CONSTRUCTOR.format(class_name=class_name)
        source_code = '\n'.join(lines + cols) + '\n\n' + constructor
        print(source_code)

We can invoke this function every time we want to generate a new DF type.For example to generate our StudentDF class we do the following.

    df = pd.read_csv(student_file)
    source_code = DFClassgenerator.generate_class(df, 'StudentDF')
    print(source_code)
    class StudentsDF(pd.DataFrame):
        SCORES = 'scores'
        NAME = 'name'

        @property
        def _constructor(self):
            return StudentsDF

Then, we have the class we need.

Of course, a lot more enhancements can be built into this generation class. Some of the enhancements that can be added are as follows:

  1. Ability to create a symbolic name for Index columns

  2. Handle multi-hierarchical columns

  3. Provide a default converter functions if a type-conversion dictionary is provided.

  4. Build a mapping of fields to types, and automatically create data type converter functions.

Summary

There are lot of scenarios in our production code where it would help to annotate and pass around a rich data structure like a DataFrame by binding them with symbolic column names. Providing more meaningful type hinting also adds needed documentation. Subclassing the dataframe is one approach to help us accomplish those goals.

References

Pandas Subclassing

Example of Pandas Subclassing - GeoPandas

SO Question regarding Subclassing