Question:
How to iterate and read one row at at time from multiple parquet files?

1. You can use RecordBatch.to_pylist to get each row. Then use yield to create an iterator.

import pyarrow.parquet as pq

def file_iterator(file_name, batch_size):
    parquet_file = pq.ParquetFile(file_name)
    for record_batch in parquet_file.iter_batches(batch_size=batch_size):
        for d in record_batch.to_pylist():
            yield d

for row in file_iterator("file.parquet", 100):
    print(row)

Answer by: >0x26res
Credit: >Stackoverflow

To read multiple .parquet files from multiple directories into a single pandas dataframe, we will use the 
following steps:

1. Import the required libraries
2. Create a list of directories containing .parquet files
3. Loop through the list of directories and read the .parquet files into separate dataframes
4. Concatenate the dataframes into a single dataframe

Suggested blogs:

>How to save python yaml and load a nested class?

>What makes Python 'flow' with HTML nicely as compared to PHP?

>How to do wild grouping of friends in Python?

>How to do Web Scraping with Python?





Ritu Singh

Ritu Singh

Submit
0 Answers