polars-avro project

polars_avro module

Scan and sink avro files using Polars.

This module provides three functions: scan_avro, write_avro, and read_avro written as an io-plugin. Currently sink is not supported for io-plugins.

Not all types can be converted between Polars and Avro, or are supported but with precision loss. In general this library will attempt to read and write everything as long as there’s no data loss, but that can mean promoting types to make them more general for serialization / deserialization.

exception polars_avro.AvroError

Bases: Exception

exception polars_avro.AvroSpecError

Bases: ValueError

class polars_avro.Codec

Bases: object

Bzip2 = Codec.Bzip2
Deflate = Codec.Deflate
Null = Codec.Null
Snappy = Codec.Snappy
Xz = Codec.Xz
Zstandard = Codec.Zstandard
exception polars_avro.EmptySources

Bases: ValueError

polars_avro.read_avro(sources: Sequence[str | Path] | Sequence[BinaryIO] | str | Path | BinaryIO, *, columns: Sequence[int | str] | None = None, n_rows: int | None = None, row_index_name: str | None = None, row_index_offset: int = 0, rechunk: bool = False, batch_size: int = 32768, glob: bool = True, single_col_name: str | None = None) DataFrame

Read an Avro file into a DataFrame.

Parameters:
  • sources (The source(s) to scan.)

  • columns (The columns to select.)

  • n_rows (The number of rows to read.)

  • row_index_name (The name of the row index column, or None to not add one.)

  • row_index_offset (The offset to start the row index at.)

  • rechunk (Whether to rechunk the DataFrame after reading.)

  • batch_size (How many rows to attempt to read at a time.)

  • glob (Whether to use globbing to find files.)

  • storage_options (Additional options for cloud operations.)

  • credential_provider (The credential provider to use for cloud operations.) – Defaults to “auto” which uses the default credential provider.

  • retries (The number of times to retry cloud operations.)

  • file_cache_ttl (The time to live for cached cloud files.)

polars_avro.scan_avro(sources: Sequence[str | Path] | Sequence[BinaryIO] | str | Path | BinaryIO, *, batch_size: int = 32768, glob: bool = True, single_col_name: str | None = None) LazyFrame

Scan Avro files.

Parameters:
  • sources (The source(s) to scan.)

  • batch_size (How many rows to attempt to read at a time.)

  • glob (Whether to use globbing to find files.)

  • storage_options (Additional options for cloud operations.)

polars_avro.write_avro(frame: DataFrame, dest: str | Path | BinaryIO, *, batch_size: int | None = None, codec: Codec = Codec.Null, promote_ints: bool = True, promote_array: bool = True, truncate_time: bool = False, compression_level: int | None = None, storage_options: Mapping[str, str] | None = None) None

Write a DataFrame to an Avro file.

Parameters:
  • frame (The DataFrame to write.)

  • dest (Where to write the dataframe)

  • codec (The codec to use for compression, or Null for uncompressed.)

  • promote_ints (Whether to promote integer columns to a larger type in order) – to support writing them.

  • promote_array (Whether to promote array columns to lists in order to) – support writing them to avro.

  • truncate_time (Whether to truncate time columns so they can be written to) – avro as avro doesn’t support time with nanosecond precision.

  • retries (The number of times to retry cloud operations.)

  • credential_provider (The credential provider to use for cloud operations.) – Defaults to “auto” which uses the default credential provider.

  • storage_options (Additional options for cloud operations.)

Indices and tables