Skip to content

Filtering¤

Use filters to keep the variants you want before they become a matrix. Python builds the expression; Rust evaluates it while reading records.

rare_high_quality = (
    genoio.region("22:20000000-21000000")
    & genoio.qual(min=20)
    & genoio.maf(max=0.05)
)

ds = genoio.vcf("data/chr22_hg38.vcf.gz")
samples = ds.samples()
y = load_phenotype_vector(samples["iid"])

for X, variants in ds.iter_blocks(
    5_000,
    variants=rare_high_quality,
    return_variants=True,
):
    association_scan(X, y, variants=variants)

Filters compose with Python operators:

genoio.chrom("22") & genoio.snp()
genoio.maf(max=0.01) | genoio.id_in(["rs123", "rs456"])
~genoio.missing_rate(max=0.1)

Cheap filters and genotype filters¤

Some predicates use metadata already stored on the variant record: chromosome, position, ID, REF/ALT structure, and QUAL. These can remove records before genotypes are decoded.

Other predicates need the genotype values. maf(...), mac(...), missing_rate(...), and polymorphic() run after candidate genotypes have been read.

Put cheap filters first when you can

region("22:20000000-21000000") & qual(min=20) & maf(max=0.05) lets the reader narrow the candidate set before computing MAF. You don't have to order expressions perfectly, but filters based on position, ID, allele structure, and quality are the ones that can avoid genotype work.


Region filters¤

Use a concrete region string with 1-based inclusive coordinates:

region = genoio.region("22:20000000-21000000")

Indexed regions

Compressed VCF/BCF region reads require a .tbi or .csi index. genoio rejects unindexed compressed region reads instead of silently scanning the whole file.

BGEN dosage reads use a same-path bgenix SQLite index when one exists. For cohort.bgen, the expected path is cohort.bgen.bgi. Without that index, BGEN falls back to a sequential scan.

bgen_ds = genoio.bgen("cohort.bgen")
region = genoio.region("22:20000000-21000000")

for X, variants in bgen_ds.iter_blocks(
    1_000,
    dosage="dosage",
    variants=region,
    return_variants=True,
):
    analyze_region(X, variants)

Common predicates¤

genoio.chrom("22")
genoio.region("22:20000000-21000000")
genoio.snp()
genoio.biallelic()
genoio.qual(min=20)
genoio.maf(max=0.05)
genoio.mac(min=10)
genoio.missing_rate(max=0.1)
genoio.polymorphic()
genoio.id_in(["rs123", "rs456"])

Threshold predicates are inclusive. maf(max=0.05) keeps variants with minor allele frequency less than or equal to 0.05.

When a filter matches no variants

Empty filters are valid. Whole reads return shape (n_samples, 0), returned variant metadata has the standard columns with zero rows, and block reads yield no blocks.

Advanced: filter IR

Filter expressions are converted to a small intermediate representation before they cross into Rust:

expr = genoio.chrom("22") & genoio.maf(max=0.05)
expr.to_ir()

The IR records predicate names, validated parameters, and boolean structure. It does not contain Python callbacks. That keeps filtering portable across whole reads, block reads, and Rust reader implementations.

Rust uses the IR to separate cheap metadata decisions from data-dependent genotype decisions. Metadata predicates can be evaluated before matrix construction, and concrete VCF/BCF or BGEN region predicates can use an index when one is available. Before reading, Rust also normalizes simple boolean expressions: overlapping conjoined regions are reduced to their intersection, repeated threshold predicates are tightened, conjoined id_in predicates are intersected, and contradictory predicates become an empty result without scanning variant records.

Treat to_ir() as an inspection aid rather than a stable wire format. Build filters with the Python constructors so validation stays consistent.

See Filter API for signatures and validation rules.