Skip to content

Reading¤

For worked examples, see GWAS, cis-eQTL, and Filtering.

genoio.Dataset ¤

Resolved genotype dataset with metadata, whole-read, and block-read methods.

Constructed by genoio.vcf, genoio.bfile, genoio.bgen, or genoio.pfile. The object caches source metadata after the first metadata-dependent operation, but matrix reads are executed on each call.

Attributes:

  • source: resolved source format, primary path, companion files, and optional PLINK prefix.

iter_blocks(self, size: int, **read_options: object)-> Iterator[ReadResult] ¤

Yield consecutive variant blocks from this dataset.

Each yielded block has at most size variants and follows the same return contract as genoio.Dataset.read. Blocks are fixed-width retained-variant chunks ordered by source variant order after any filtering. BGEN dosage blocks with a concrete region filter use a same-path .bgen.bgi index when present. Haplotype blocks follow the same source-encoded representation rules as genoio.Dataset.read.

Arguments:

  • size: maximum number of variants per yielded block.
  • read_options: forwarded to genoio.Dataset.read.

Returns:

Iterator yielding matrices or matrix/metadata tuples.

iter_regions(self, regions: Iterable[_Region], **read_options: object)-> Iterator[tuple[_Region, ReadResult]] ¤

Yield one read result per requested region filter.

Each yielded item is (region, result), where region is the original object from regions and result follows the same return contract as genoio.Dataset.read. Concrete VCF/BCF and BGEN region filters use the same indexed pushdown paths as normal reads when an index is present. Haplotype region reads follow the same source-encoded representation rules as genoio.Dataset.read.

Arguments:

  • regions: iterable of region filter expressions.
  • read_options: forwarded to genoio.Dataset.read, except variants, which is supplied by each region.

Returns:

Iterator yielding (region, matrix_or_tuple) pairs.

read(self, *, kind: str = 'geno', dosage: str = 'hardcall', sparse: bool | str = False, variants: FilterExpr | Iterable[str] | None = None, samples: list[str] | tuple[str, ...] | set[str] | None = None, missing: Literal[nan, 'raise', impute] | None = None, dtype: DTypeLike = 'float32', return_samples: bool = False, return_variants: bool = False)-> ReadResult ¤

Read a genotype or haplotype matrix from this dataset.

Dense reads return a NumPy array with shape (samples, variants). Sparse reads return SciPy CSC by default or CSR when sparse="csr". Set return_samples or return_variants to return Polars metadata frames with the matrix.

Arguments:

  • kind: Matrix row layout. "geno" returns one row per retained sample, with diploid genotype values in each cell. "haplo" returns one row per source haplotype, so diploid samples contribute two rows. Haplotype reads require phased records in the source.
  • dosage: Value source for each matrix cell. "hardcall" reads allele counts from called genotypes. "dosage" reads expected allele counts from dosage/probability fields when the source format supports them. genoio does not convert dosages into hard calls.
  • sparse: Output storage. False returns a dense NumPy array. True and "csc" return a SciPy CSC matrix; "csr" returns a SciPy CSR matrix. Sparse reads require missing="raise" because this release does not store sparse missing-value masks.
  • variants: Variants to keep. Pass a genoio filter expression to filter by metadata or genotype predicates, or pass an iterable of variant IDs to keep matching IDs. None keeps all variants. Retained columns stay in source order, not request order.
  • samples: Sample IDs to keep. Pass a list, tuple, or set of sample IDs. None keeps all samples. Retained rows stay in source order; duplicate requested IDs are rejected.
  • missing: Missing-call policy. None uses "nan" for dense reads and "raise" for sparse reads. "nan" stores missing calls as np.nan, "raise" fails if retained calls are missing, and "impute" fills missing calls with the retained variant mean.
  • dtype: NumPy dtype for returned matrix values. Missing policies that write np.nan or imputed means require a floating dtype.
  • return_samples: When True, return a sample metadata frame with the matrix. Haplotype reads include columns that map haplotype rows back to source samples.
  • return_variants: When True, return a variant metadata frame for the retained matrix columns.

Returns:

Matrix alone, or a tuple containing the matrix and requested metadata frames.

Raises:

  • genoio.InvalidOptionError: if read options are invalid.
  • genoio.UnsupportedRepresentation: if the requested representation is unavailable for the source.
  • genoio.InvalidSourceError: if the source cannot be decoded.
  • genoio.MissingDataError: if retained missing calls conflict with the requested missing-data policy.
  • genoio.InternalError: if the compiled backend reports an internal invariant failure.

samples(self, **options: object)-> pl.DataFrame ¤

Return sample metadata as a Polars DataFrame.

Columns are fid, iid, father, mother, sex, and phenotype. Rows are ordered as they appear in the source. Haplotype reads that return sample metadata add source_sample_index and haplotype_index columns to map haplotype rows back to source samples.

Returns:

Polars DataFrame with source sample metadata in source order.

variants(self, *, stats: object = None, **options: object)-> pl.DataFrame ¤

Return variant metadata as a Polars DataFrame.

Columns are chrom, pos, id, a0, and a1. Rows are ordered as they appear in the source; variant frames returned by matrix reads are ordered to match matrix columns after filtering. The a1 allele is the allele counted by returned genotype values.

The stats argument is reserved for future metadata-stat controls. Passing it currently raises genoio.InvalidOptionError.

Arguments:

  • stats: reserved; must be None.

Returns:

Polars DataFrame with source variant metadata in source order.

genoio.vcf(path: str | Path)-> Dataset ¤

Resolve a VCF/BCF file and return a reusable dataset.

Arguments:

  • path: .vcf, .vcf.gz, or .bcf path.

Returns:

genoio.Dataset backed by the VCF/BCF source.

Raises:

  • genoio.SourceResolutionError: if the path cannot be used as VCF/BCF.

genoio.bfile(path: str | Path)-> Dataset ¤

Resolve a PLINK1 BED/BIM/FAM file set and return a reusable dataset.

path may be the shared prefix or one .bed, .bim, or .fam member.

Arguments:

  • path: PLINK1 prefix or member path.

Returns:

genoio.Dataset backed by the PLINK1 source.

genoio.pfile(path: str | Path)-> Dataset ¤

Resolve a PLINK2 PGEN/PVAR/PSAM file set and return a reusable dataset.

path may be the shared prefix or one .pgen, .pvar, .pvar.zst, or .psam member. If both .pvar and .pvar.zst exist for a prefix, uncompressed .pvar is preferred.

Arguments:

  • path: PLINK2 prefix or member path.

Returns:

genoio.Dataset backed by the PLINK2 source.

genoio.bgen(path: str | Path)-> Dataset ¤

Resolve a BGEN source and return a reusable dataset.

path may be the shared prefix or the .bgen member. If a same-prefix .sample file exists, it is recorded as an optional companion. Concrete region filters look for a same-path bgenix SQLite index beside the BGEN member, for example cohort.bgen.bgi.

Arguments:

  • path: BGEN prefix or .bgen member path.

Returns:

genoio.Dataset backed by the BGEN source.