Reading¤
For worked examples, see GWAS, cis-eQTL, and Filtering.
genoio.Dataset
¤
Resolved genotype dataset with metadata, whole-read, and block-read methods.
Constructed by genoio.vcf, genoio.bfile, genoio.bgen, or
genoio.pfile. The object caches source metadata after the first
metadata-dependent operation, but matrix reads are executed on each call.
Attributes:
source: resolved source format, primary path, companion files, and optional PLINK prefix.
iter_blocks(self, size: int, **read_options: object)-> Iterator[ReadResult]
¤
Yield consecutive variant blocks from this dataset.
Each yielded block has at most size variants and follows the same
return contract as genoio.Dataset.read. Blocks are fixed-width
retained-variant chunks ordered by source variant order after any
filtering. BGEN dosage blocks with a concrete region filter use a
same-path .bgen.bgi index when present. Haplotype blocks follow the
same source-encoded representation rules as genoio.Dataset.read.
Arguments:
size: maximum number of variants per yielded block.read_options: forwarded togenoio.Dataset.read.
Returns:
Iterator yielding matrices or matrix/metadata tuples.
iter_regions(self, regions: Iterable[_Region], **read_options: object)-> Iterator[tuple[_Region, ReadResult]]
¤
Yield one read result per requested region filter.
Each yielded item is (region, result), where region is the original
object from regions and result follows the same return contract as
genoio.Dataset.read. Concrete VCF/BCF and BGEN region filters use
the same indexed pushdown paths as normal reads when an index is
present. Haplotype region reads follow the same source-encoded
representation rules as genoio.Dataset.read.
Arguments:
regions: iterable of region filter expressions.read_options: forwarded togenoio.Dataset.read, exceptvariants, which is supplied by each region.
Returns:
Iterator yielding (region, matrix_or_tuple) pairs.
read(self, *, kind: str = 'geno', dosage: str = 'hardcall', sparse: bool | str = False, variants: FilterExpr | Iterable[str] | None = None, samples: list[str] | tuple[str, ...] | set[str] | None = None, missing: Literal[nan, 'raise', impute] | None = None, dtype: DTypeLike = 'float32', return_samples: bool = False, return_variants: bool = False)-> ReadResult
¤
Read a genotype or haplotype matrix from this dataset.
Dense reads return a NumPy array with shape (samples, variants).
Sparse reads return SciPy CSC by default or CSR when sparse="csr".
Set return_samples or return_variants to return Polars metadata
frames with the matrix.
Arguments:
kind: Matrix row layout."geno"returns one row per retained sample, with diploid genotype values in each cell."haplo"returns one row per source haplotype, so diploid samples contribute two rows. Haplotype reads require phased records in the source.dosage: Value source for each matrix cell."hardcall"reads allele counts from called genotypes."dosage"reads expected allele counts from dosage/probability fields when the source format supports them.genoiodoes not convert dosages into hard calls.sparse: Output storage.Falsereturns a dense NumPy array.Trueand"csc"return a SciPy CSC matrix;"csr"returns a SciPy CSR matrix. Sparse reads requiremissing="raise"because this release does not store sparse missing-value masks.variants: Variants to keep. Pass agenoiofilter expression to filter by metadata or genotype predicates, or pass an iterable of variant IDs to keep matching IDs.Nonekeeps all variants. Retained columns stay in source order, not request order.samples: Sample IDs to keep. Pass a list, tuple, or set of sample IDs.Nonekeeps all samples. Retained rows stay in source order; duplicate requested IDs are rejected.missing: Missing-call policy.Noneuses"nan"for dense reads and"raise"for sparse reads."nan"stores missing calls asnp.nan,"raise"fails if retained calls are missing, and"impute"fills missing calls with the retained variant mean.dtype: NumPy dtype for returned matrix values. Missing policies that writenp.nanor imputed means require a floating dtype.return_samples: WhenTrue, return a sample metadata frame with the matrix. Haplotype reads include columns that map haplotype rows back to source samples.return_variants: WhenTrue, return a variant metadata frame for the retained matrix columns.
Returns:
Matrix alone, or a tuple containing the matrix and requested metadata frames.
Raises:
genoio.InvalidOptionError: if read options are invalid.genoio.UnsupportedRepresentation: if the requested representation is unavailable for the source.genoio.InvalidSourceError: if the source cannot be decoded.genoio.MissingDataError: if retained missing calls conflict with the requested missing-data policy.genoio.InternalError: if the compiled backend reports an internal invariant failure.
samples(self, **options: object)-> pl.DataFrame
¤
Return sample metadata as a Polars DataFrame.
Columns are fid, iid, father, mother, sex, and phenotype.
Rows are ordered as they appear in the source. Haplotype reads that
return sample metadata add source_sample_index and haplotype_index
columns to map haplotype rows back to source samples.
Returns:
Polars DataFrame with source sample metadata in source order.
variants(self, *, stats: object = None, **options: object)-> pl.DataFrame
¤
Return variant metadata as a Polars DataFrame.
Columns are chrom, pos, id, a0, and a1. Rows are ordered as
they appear in the source; variant frames returned by matrix reads are
ordered to match matrix columns after filtering. The a1 allele is the
allele counted by returned genotype values.
The stats argument is reserved for future metadata-stat controls.
Passing it currently raises genoio.InvalidOptionError.
Arguments:
stats: reserved; must beNone.
Returns:
Polars DataFrame with source variant metadata in source order.
genoio.vcf(path: str | Path)-> Dataset
¤
Resolve a VCF/BCF file and return a reusable dataset.
Arguments:
path:.vcf,.vcf.gz, or.bcfpath.
Returns:
genoio.Dataset backed by the VCF/BCF source.
Raises:
genoio.SourceResolutionError: if the path cannot be used as VCF/BCF.
genoio.bfile(path: str | Path)-> Dataset
¤
Resolve a PLINK1 BED/BIM/FAM file set and return a reusable dataset.
path may be the shared prefix or one .bed, .bim, or .fam member.
Arguments:
path: PLINK1 prefix or member path.
Returns:
genoio.Dataset backed by the PLINK1 source.
genoio.pfile(path: str | Path)-> Dataset
¤
Resolve a PLINK2 PGEN/PVAR/PSAM file set and return a reusable dataset.
path may be the shared prefix or one .pgen, .pvar, .pvar.zst, or
.psam member. If both .pvar and .pvar.zst exist for a prefix,
uncompressed .pvar is preferred.
Arguments:
path: PLINK2 prefix or member path.
Returns:
genoio.Dataset backed by the PLINK2 source.
genoio.bgen(path: str | Path)-> Dataset
¤
Resolve a BGEN source and return a reusable dataset.
path may be the shared prefix or the .bgen member. If a same-prefix
.sample file exists, it is recorded as an optional companion. Concrete
region filters look for a same-path bgenix SQLite index beside the BGEN
member, for example cohort.bgen.bgi.
Arguments:
path: BGEN prefix or.bgenmember path.
Returns:
genoio.Dataset backed by the BGEN source.