blog <| code

DataFrame use Anti-Pattern

Introduction

Panda's DataFrame is a versatile data structure. It is a Swiss-army like data structure that lets you perform a variety of operations. When I am on ipython interactive console prototyping new features, I can confidently throw any weird operation on to the Panda's dataframe and more often that not, it just works. Creating a DataFrame using a variety of data structures as input, slicing and dicing to the heart's content all comes easily. But, with this kind of flexibility and power, there is a tendency to reach out to the DataFrame for all kinds of tasks in production code as well. This leads to what I keep referring to as dictionary-oriented programming.

In this style of programming, one goes about creating a DataFrame or Series. Then this object is passed around to other functions that may/or may not create copies of this data-frame/series while adding additional columns. The convenience of this approach is that the flow of data and calculation is documented in the columns of the dataframe. The logs and actual calculation are are mixed up in one data structure. It is really convenient to use this approach to quickly produce results as well as provide enough information to power users who consume this data. Here, by power users I mean users(say analysts/quants) who directly analyze the result in Excel or with any of their power tools using the output from programs that use and produce output from these dataframes.

There are 2 parts to the above discussed use of dataframes. If the data we deal with is 'tall' or 'wide' and we perform a lot of vector operations on this data-structure, then using dataframes likely is the perfect tool for the job. And in Python, that is the best possible tool you can use. But, this reasoning does not hold any more if you are dealing with small pieces of data (say 10s of rows). While dealing with that kind of data, it is desirable to move away from dataframes and just use named-tuples and or any other light weight data structure. Using dataframes, to deal with such data should be considered an anti-pattern and that is the core proposition of this post.

Dictionary-oriented programming - Anti-Pattern

As I stated earlier using dataframe as a go-to data structure leads to dictionary-oriented programming. It is easy to define columns and grow the dataframe in terms of columns during the program flow. This leads to incoherently typed programs. The dataframe is used as a kitchen-sink and all columns are thrown into it as the program organically evolves. This also makes if difficult to reason about programs. To understand the final shape of the dataframe one needs to peruse code across different function calls. Unless vector oriented operations are required, and large data needs to be processed it is sufficient to stick to simple data structures, eg. namedtuples.

In the following section I show a few statistics on how these two approaches perform. Even in cases of better performance characteristics of dataframe approach, it is still better not to reach out the dataframe for the above stated reasons.

Performance Comparison

What follows is very simple comparison of creation and access operations using namedtuples and dataframes for a small piece of data and for data that spans to 150K entries.

Creation

Creation using namedtuples

    # Creating 5 entries using a namedtuple

    In [1]: X = namedtuple('X', ('a', 'b', 'c', 'd'))

    In [2]: %timeit [X(10, 'a', 100, 20), X(10, 'a', 100, 20), X(10, 'a', 100, 20),X(10, 'a', 100, 20),X(10, 'a', 100, 20)]
    # 1.94 µs ± 8.02 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Creation using DataFrame

    In [3]: %timeit pd.DataFrame(data=dict(a=[10,10,10,10], b=['a', 'a', 'a', 'a'], c=[100,100,100,100], d=[20,20,20,20]))
    # 348 µs ± 1.75 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

    # create dict upfront to eliminate the time needed to construct the dictionary

    In [4]: data = dict(a=[10,10,10,10], b=['a', 'a', 'a', 'a'], c=[100,100,100,100], d=[20,20,20,20])

    In [5]: %timeit pd.DataFrame(data=data)
    # 347 µs ± 1.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

You will notice that named-tuple creation is a lot faster. This speed-difference is surprising! even after factoring for dictionary object creation. This is due to pandas itself creating a dictionary-like object which we are creating in the namedtuple approach. But, here the point is for the size of datasets we create, that rich datastructure may not be required.

Creating 150K named-tuple objects takes the same amount of time creating a dataframe with 5 records.

    # Creating 1K named tuple objects is close to creating 5 rows in Pandas

    In [6]: %timeit [X(10, 'a', 100, 20) for _ in range(1000)]
    # 388 µs ± 2.18 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

But creating a dictionary to hold the namedtuples takes a lot more time and here time needed for dataframe creation seems to get smaller

    In [8]: %timeit {i:v for i, v in enumerate((X(10, 'a', 100, 20) for _ in range(150000)))}
    # 78.6 ms ± 268 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Creating a dataframe which contains 150K rows

    In [13]: %timeit pd.DataFrame(data=dict(a=list(repeat(10, 150000)), b=list(repeat('a', 150000)), c=list(repeat(100, 150000)), d=list(repeat(20,150000))))
    # 52.8 ms ± 550 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

It is also useful to note that while dealing with DataFrames, we are always creating a dictionary like structure even though that is not the case with namedtuple operations we are comparing against.

Access operations

    In [7]: named_dict = {i:v for i, v in enumerate([X(10, 'a', 100, 20) for _ in range(150000)])}

    In [9]: %timeit named_dict[10]
    # 26.8 ns ± 0.119 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

Now, using a dataframe, creating 150K rows takes more than twice the time. Access time is still faster with Python dictionary.

    In [33]: df = pd.DataFrame(data=dict(a=list(repeat(10, 150000)), b=list(repeat('a', 150000)), c=list(repeat(100, 150000)), d=list(repeat(20,150000))))

    In [16]: %timeit df.loc[100]
    # 74.4 µs ± 416 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Vector operations

Here the performance is as expected. DataFrame approach is far better.

    In [20]: %timeit sum(n.a for n in named_dict.values())
    # 11.5 ms ± 147 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

    In [29]: %timeit df['a'].sum()
    # 475 µs ± 1.56 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Random access

While accessing elements randomly, accessing values through the namedtuple is faster.

    In [24]: %timeit (named_dict[100].a + named_dict[200].a + named_dict[500].a)
    # 240 ns ± 1.35 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

    In [23]: %timeit (df.loc[100]['a'] + d.loc[200]['a'] + d.loc[500]['a'])
    # 246 µs ± 2.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

DataFrame is still powerful

I make the case to not use DataFrames when we need simple data objects to pass around. Once we need to manipulate and transform large datasets, slice and dice them, merge them and perform verctor operations that DataFrame is a powerful datastructure for those purposes.

Benefits

Using named-tuples are similar data structures, you get

  1. Immutability
  2. Types are well-defined
  3. Easier to refactor since it is easier to understand where and how the data is created and passed around
  4. No more 'stringy' use of api to access columns