polars-avro project

polars_avro module

Polars io plugin for reading and writing Apache Avro files.

Provides scan_avro, read_avro, write_avro, and AvroWriter. Some polars types (Int8, Int16, UInt8, UInt16, UInt32, UInt64, Time, Categorical, Enum) must be cast before writing. When reading, the utf8_view option controls how UUIDs and nullable strings are decoded — see scan_avro for details.

exception polars_avro.AvroError

Bases: Exception

exception polars_avro.AvroSpecError

Bases: ValueError

class polars_avro.AvroWriter(dest: str | Path | BinaryIO, *, schema: Schema | None = None, codec: Codec = Codec.Null, storage_options: Mapping[str, str] | None = None, credential_provider: CredentialProviderInput = 'auto')

Bases: object

Incrementally write DataFrames to an Avro file.

This creates a context manager that needs to be used when writing cloud files.

Some polars types (Int8, Int16, UInt8, UInt16, UInt32, UInt64, Time, Categorical, Enum) must be cast before writing — see the README for workarounds.

close() None
write(batch: DataFrame) None
class polars_avro.Codec

Bases: object

Bzip2 = Codec.Bzip2
Deflate = Codec.Deflate
Null = Codec.Null
Snappy = Codec.Snappy
Xz = Codec.Xz
Zstandard = Codec.Zstandard
exception polars_avro.EmptySources

Bases: ValueError

polars_avro.read_avro(sources: Sequence[str | Path] | Sequence[BinaryIO] | str | Path | BinaryIO, *, columns: Sequence[int | str] | None = None, n_rows: int | None = None, row_index_name: str | None = None, row_index_offset: int = 0, rechunk: bool = False, batch_size: int = 32768, glob: bool = True, strict: bool = False, utf8_view: bool = False, storage_options: Mapping[str, str] | None = None, credential_provider: Callable[[], tuple[dict[str, str], int | None]] | CredentialProvider | Literal['auto'] | None = 'auto') DataFrame

Read an Avro file into a DataFrame.

Parameters:
  • sources (The source(s) to scan.)

  • columns (The columns to select.)

  • n_rows (The number of rows to read.)

  • row_index_name (The name of the row index column, or None to not add one.)

  • row_index_offset (The offset to start the row index at.)

  • rechunk (Whether to rechunk the DataFrame after reading.)

  • batch_size (How many rows to attempt to read at a time.)

  • glob (Whether to use globbing to find files.)

  • strict (Whether to use strict mode when parsing avro. Incurs a) – performance hit.

  • utf8_view (Whether to read strings as views. When False (default),) – UUIDs are read as binary and nullable strings preserve nulls. When True, UUIDs are read as formatted strings and nulls in nullable strings are replaced with "" (lossy). Since polars tends to work with string views internally, True is likely faster.

  • storage_options (Extra configuration passed to the cloud storage) – backend (same keys accepted by Polars, e.g. aws_region).

  • credential_provider (Credential provider for cloud storage. Set to) – "auto" (default) to use automatic credential detection, or None to disable.

polars_avro.scan_avro(sources: Sequence[str | Path] | Sequence[BinaryIO] | str | Path | BinaryIO, *, batch_size: int = 1024, glob: bool = True, strict: bool = False, utf8_view: bool = False, storage_options: Mapping[str, str] | None = None, credential_provider: Callable[[], tuple[dict[str, str], int | None]] | CredentialProvider | Literal['auto'] | None = 'auto') LazyFrame

Scan Avro files.

Parameters:
  • sources (The source(s) to scan.)

  • batch_size (How many rows to attempt to read at a time.)

  • glob (Whether to use globbing to find files.)

  • strict (Whether to use strict mode when parsing avro. Incurs a) – performance hit.

  • utf8_view (Whether to read strings as views. When False (default),) – UUIDs are read as binary and nullable strings preserve nulls. When True, UUIDs are read as formatted strings and nulls in nullable strings are replaced with "" (lossy). Since polars tends to work with string views internally, True is likely faster.

  • storage_options (Extra configuration passed to the cloud storage) – backend (same keys accepted by Polars, e.g. aws_region).

  • credential_provider (Credential provider for cloud storage. Set to) – "auto" (default) to use automatic credential detection, or None to disable.

polars_avro.write_avro(batches: DataFrame | Iterable[DataFrame], dest: str | Path | BinaryIO, *, schema: Schema | None = None, codec: Codec = Codec.Null, storage_options: Mapping[str, str] | None = None, credential_provider: CredentialProviderInput = 'auto') None

Write a DataFrame or iterable of DataFrames to an Avro file.

Some polars types (Int8, Int16, UInt8, UInt16, UInt32, UInt64, Time, Categorical, Enum) must be cast before writing — see the README for workarounds.

Parameters:
  • batches (A DataFrame or iterable of DataFrames to write.)

  • dest (The file path, cloud URL, or writable binary buffer to write to.)

  • schema (The schema to use. If None, inferred from the first batch.)

  • codec (The compression codec to use.)

  • storage_options (Extra configuration passed to the cloud storage) – backend (same keys accepted by Polars, e.g. aws_region).

  • credential_provider (Credential provider for cloud storage. Set to) – "auto" (default) to use automatic credential detection, or None to disable.

Indices and tables