Question:

How to process pandas DataFrame in chunks?

Problem:

I'm working on a Jupyter Notebook in Python with Pandas for data analysis.

When creating new aggregation features, I encounter a "MemoryError" due to the memory capacity of my system.

Before this error occurs, I'm performing operations similar to the following code:

# Sum Operations

operation_fields = [column for column in data.columns if 'Operations' in column]

data['Operations'] = data[operation_fields].apply(lambda row: sum(row), axis=1)

for operation_field in operation_fields:

data[f'Operations_{operation_field}_vs_total'] = np.where(

data['Operations'] == 0,

np.floor(data['Operations']),

np.floor(data[operation_field] / data['Operations'] * 100)

)

This results in the data having the following dimensions:

Size: 17741115
Columns: 85
Rows: 208719

After that, I attempt to execute the following code to calculate new features based on transactions:

# Sum transactions

transaction_fields = [column for column in data.columns if 'Transactions' in column and 'SavingAccount' in column]

data['Transactions'] = data[transaction_fields].apply(lambda row: sum(row), axis=1)

for transaction_field in transaction_fields:

data[f'Transactions_{transaction_fields}_vs_total'] = np.where(

data['Transactions'] == 0,

np.floor(data['Transactions']),

np.floor(data[transaction_fields] / data['Transactions'] * 100)

)

However, I encounter the error message: "MemoryError: Unable to allocate 325. GiB for an array with shape (208719, 208719) and data type float64."

I'm looking for guidance on how to process this large dataset efficiently.

Options:

A way to process this dataset in smaller "chunks" to avoid memory errors.
Strategies to optimize memory usage when working with large Pandas DataFrames for aggregation features.

Solution:

Never heard about >vectorization?

# vectorized sum, which is more memory efficient

data['Transactions'] = data[transaction_fields].sum(axis=1)

# numpy to avoid the explicit loop

transaction_ratios = data[transaction_fields].div(data['Transactions'], axis=0)

transaction_ratios = transaction_ratios.multiply(100).fillna(0).astype(int)

for transaction_field in transaction_fields:

data[f'Transactions_{transaction_field}_vs_total'] = transaction_ratios[transaction_field]

This should help. Alternatively you can explore >Polars which is rapidly gaining traction in the data ecosystem.

>Guide: Building a responsive Flutter Web- Part 1

>Firebase Authentication in Flutter

>Firebase Authentication using Provider in Flutter

>Build an Animated App with Rive and Flutter

>How to start programming for Artificial Intelligence?

>How to implement AI and ML with .NET Applications

>Quality Testing in AI: Manual to Autonomous Testing

>Top 8 programming languages to used in artificial intelligence coding

Python

Ritu Singh

Your Answer

Submit

Ritu Singh

0 Answers

Top Questions

Ritu Singh

0 Answers

Top Questions

Related Questions