polars-fastavro project

polars_fastavro module

Polars io-plugin backed by fastavro.

This plugin allows reading, writing, and scanning avro files into polars DataFrames using the fastavro library.

Usage

from polars_fastavro import scan_avro, read_avro, write_avro

frame = scan_avro(...).collect()  # or `read_avro(...)`
write_avro(frame, dest)

Limitations

  1. Because it uses python types an an intermediary, it’s slow, (30x read to 80x write).

  2. Since this is ultimately converting between avro and arrow, it has no support for avro maps, unions (other than null), names for certain types

  3. Every type is treated as as nullable.

  4. Additionally, some types could in theory be supported by aren’t for technical reasons. These include fixed, decimal, uuid, time, and duration.

  5. Timestamp support is limited. local-timestamp-*s are treated as Datetime without tz info, while timestamp-*s are reated as UTC Datetime. Writing Datetimes with nano-precision is also not supported.

  6. This can’t read cloud files, as that functionality isn’t exposed in python to my knowledge.

polars_fastavro.read_avro(sources: Sequence[str | Path] | Sequence[BinaryIO] | str | Path | BinaryIO, *, columns: Sequence[int | str] | None = None, n_rows: int | None = None, row_index_name: str | None = None, row_index_offset: int = 0, rechunk: bool = False, convert_logical_types: bool = False, batch_size: int = 32768, glob: bool = True, single_col_name: str | None = None) DataFrame

Read an avro file.

Parameters:
  • sources (The source(s) to read.)

  • columns (Columns to select from the read sources.)

  • n_rows (The maximum number of rows to read.)

  • row_index_name (If not None, the name to assign as a row index.)

  • row_index_offset (Where to position the row index column.)

  • rechunk (Whether to rechunk the frame so it's contiguous.)

  • convert_logical_types (If true, logical types that can't be parsed, but are) – backed by physical types that can will be parsed as those physical types instead.

  • batch_size (How many rows to read at a time.)

  • glob (Whether to interpret glob patterns in files.)

  • single_col_name (If not None and the avro schema isn't a record, wrap) – values in a record with a single field called single_col_name.

polars_fastavro.scan_avro(sources: Sequence[str | Path] | Sequence[BinaryIO] | str | Path | BinaryIO, *, convert_logical_types: bool = False, batch_size: int = 32768, glob: bool = True, single_col_name: str | None = None) LazyFrame

Scan Avro files.

Parameters:
  • sources (The source(s) to scan.)

  • convert_logical_types (If true, logical types that can't be parsed, but are) – backed by physical types that can will be parsed as those physical types instead.

  • batch_size (How many rows to attempt to read at a time.)

  • glob (If true, expand path sources with glob patterns.)

  • single_col_name (If not None and the avro schema isn't a record, wrap) – values in a record with a single field called single_col_name.

polars_fastavro.write_avro(frame: DataFrame, dest: str | Path | BinaryIO, *, batch_size: int | None = None, promote_ints: bool = True, promote_array: bool = True, codec: Literal['null', 'deflate', 'snappy'] = 'null') None

Write a DataFrame as an avro file.

Parameters:
  • frame (The DataFrame to write.)

  • dest (Where to write the frame to.)

  • promote_ints (Whether to promote ints to a large size that avro supports.)

  • promote_arrays (Whether to write Arrays as Lists.)

  • coded (Codec for dest.)

Indices and tables