Question:
How to process pandas DataFrame in chunks?

Problem:

I'm working on a Jupyter Notebook in Python with Pandas for data analysis.

When creating new aggregation features, I encounter a "MemoryError" due to the memory capacity of my system.


Before this error occurs, I'm performing operations similar to the following code:

# Sum Operations

operation_fields = [column for column in data.columns if 'Operations' in column]

data['Operations'] = data[operation_fields].apply(lambda row: sum(row), axis=1)

for operation_field in operation_fields:

    data[f'Operations_{operation_field}_vs_total'] = np.where(

        data['Operations'] == 0,

        np.floor(data['Operations']),

        np.floor(data[operation_field] / data['Operations'] * 100)

    )


This results in the data having the following dimensions:


  • Size: 17741115

  • Columns: 85

  • Rows: 208719

After that, I attempt to execute the following code to calculate new features based on transactions:


# Sum transactions

transaction_fields = [column for column in data.columns if 'Transactions' in column and 'SavingAccount' in column]

data['Transactions'] = data[transaction_fields].apply(lambda row: sum(row), axis=1)

for transaction_field in transaction_fields:

    data[f'Transactions_{transaction_fields}_vs_total'] = np.where(

        data['Transactions'] == 0,

        np.floor(data['Transactions']),

        np.floor(data[transaction_fields] / data['Transactions'] * 100)

    )


However, I encounter the error message: "MemoryError: Unable to allocate 325. GiB for an array with shape (208719, 208719) and data type float64."


I'm looking for guidance on how to process this large dataset efficiently.


Options:

  • A way to process this dataset in smaller "chunks" to avoid memory errors.

  • Strategies to optimize memory usage when working with large Pandas DataFrames for aggregation features.


Solution:

Never heard about >vectorization?


# vectorized sum, which is more memory efficient

data['Transactions'] = data[transaction_fields].sum(axis=1)



# numpy to avoid the explicit loop

transaction_ratios = data[transaction_fields].div(data['Transactions'], axis=0)

transaction_ratios = transaction_ratios.multiply(100).fillna(0).astype(int)


for transaction_field in transaction_fields:

    data[f'Transactions_{transaction_field}_vs_total'] = transaction_ratios[transaction_field]


This should help. Alternatively you can explore >Polars which is rapidly gaining traction in the data ecosystem.


Suggested blogs:

>7 Best Open Source Libraries for Artificial Intelligence

>Guide: Building a responsive Flutter Web- Part 1

>Firebase Authentication in Flutter

>Firebase Authentication using Provider in Flutter

>Build an Animated App with Rive and Flutter

>How to start programming for Artificial Intelligence?

>How to implement AI and ML with .NET Applications

>Quality Testing in AI: Manual to Autonomous Testing

>Top 8 programming languages to used in artificial intelligence coding



Ritu Singh

Ritu Singh

Submit
0 Answers