polars-avro project¶
polars_avro module¶
Polars io plugin for reading and writing Apache Avro files.
Provides scan_avro, read_avro, write_avro, and AvroWriter. Some polars
types (Int8, Int16, UInt8, UInt16, UInt32, UInt64, Time, Categorical, Enum)
must be cast before writing. When reading, the utf8_view option controls
how UUIDs and nullable strings are decoded — see scan_avro for details.
- exception polars_avro.AvroError¶
Bases:
Exception
- exception polars_avro.AvroSpecError¶
Bases:
ValueError
- class polars_avro.AvroWriter(dest: str | Path | BinaryIO, *, schema: Schema | None = None, codec: Codec = Codec.Null, storage_options: Mapping[str, str] | None = None, credential_provider: CredentialProviderInput = 'auto')¶
Bases:
objectIncrementally write DataFrames to an Avro file.
This creates a context manager that needs to be used when writing cloud files.
Some polars types (Int8, Int16, UInt8, UInt16, UInt32, UInt64, Time, Categorical, Enum) must be cast before writing — see the README for workarounds.
- close() None¶
- write(batch: DataFrame) None¶
- class polars_avro.Codec¶
Bases:
object- Bzip2 = Codec.Bzip2¶
- Deflate = Codec.Deflate¶
- Null = Codec.Null¶
- Snappy = Codec.Snappy¶
- Xz = Codec.Xz¶
- Zstandard = Codec.Zstandard¶
- exception polars_avro.EmptySources¶
Bases:
ValueError
- polars_avro.read_avro(sources: Sequence[str | Path] | Sequence[BinaryIO] | str | Path | BinaryIO, *, columns: Sequence[int | str] | None = None, n_rows: int | None = None, row_index_name: str | None = None, row_index_offset: int = 0, rechunk: bool = False, batch_size: int = 32768, glob: bool = True, strict: bool = False, utf8_view: bool = False, storage_options: Mapping[str, str] | None = None, credential_provider: Callable[[], tuple[dict[str, str], int | None]] | CredentialProvider | Literal['auto'] | None = 'auto') DataFrame¶
Read an Avro file into a DataFrame.
- Parameters:
sources (The source(s) to scan.)
columns (The columns to select.)
n_rows (The number of rows to read.)
row_index_name (The name of the row index column, or None to not add one.)
row_index_offset (The offset to start the row index at.)
rechunk (Whether to rechunk the DataFrame after reading.)
batch_size (How many rows to attempt to read at a time.)
glob (Whether to use globbing to find files.)
strict (Whether to use strict mode when parsing avro. Incurs a) – performance hit.
utf8_view (Whether to read strings as views. When
False(default),) – UUIDs are read as binary and nullable strings preserve nulls. WhenTrue, UUIDs are read as formatted strings and nulls in nullable strings are replaced with""(lossy). Since polars tends to work with string views internally,Trueis likely faster.storage_options (Extra configuration passed to the cloud storage) – backend (same keys accepted by Polars, e.g.
aws_region).credential_provider (Credential provider for cloud storage. Set to) –
"auto"(default) to use automatic credential detection, orNoneto disable.
- polars_avro.scan_avro(sources: Sequence[str | Path] | Sequence[BinaryIO] | str | Path | BinaryIO, *, batch_size: int = 1024, glob: bool = True, strict: bool = False, utf8_view: bool = False, storage_options: Mapping[str, str] | None = None, credential_provider: Callable[[], tuple[dict[str, str], int | None]] | CredentialProvider | Literal['auto'] | None = 'auto') LazyFrame¶
Scan Avro files.
- Parameters:
sources (The source(s) to scan.)
batch_size (How many rows to attempt to read at a time.)
glob (Whether to use globbing to find files.)
strict (Whether to use strict mode when parsing avro. Incurs a) – performance hit.
utf8_view (Whether to read strings as views. When
False(default),) – UUIDs are read as binary and nullable strings preserve nulls. WhenTrue, UUIDs are read as formatted strings and nulls in nullable strings are replaced with""(lossy). Since polars tends to work with string views internally,Trueis likely faster.storage_options (Extra configuration passed to the cloud storage) – backend (same keys accepted by Polars, e.g.
aws_region).credential_provider (Credential provider for cloud storage. Set to) –
"auto"(default) to use automatic credential detection, orNoneto disable.
- polars_avro.write_avro(batches: DataFrame | Iterable[DataFrame], dest: str | Path | BinaryIO, *, schema: Schema | None = None, codec: Codec = Codec.Null, storage_options: Mapping[str, str] | None = None, credential_provider: CredentialProviderInput = 'auto') None¶
Write a DataFrame or iterable of DataFrames to an Avro file.
Some polars types (Int8, Int16, UInt8, UInt16, UInt32, UInt64, Time, Categorical, Enum) must be cast before writing — see the README for workarounds.
- Parameters:
batches (A DataFrame or iterable of DataFrames to write.)
dest (The file path, cloud URL, or writable binary buffer to write to.)
schema (The schema to use. If None, inferred from the first batch.)
codec (The compression codec to use.)
storage_options (Extra configuration passed to the cloud storage) – backend (same keys accepted by Polars, e.g.
aws_region).credential_provider (Credential provider for cloud storage. Set to) –
"auto"(default) to use automatic credential detection, orNoneto disable.