Skip to content

cis-eQTLยค

cis-eQTL scans use genomic windows as the iteration unit. iter_regions(...) keeps each region result separate while still using indexed region reads when the source format supports them.

This example assumes the expression matrix has one row per sample and one column per gene. The genes table supplies the gene ID and its cis window:

gene_id cis_region
ENSG000001 22:20000000-21000000
ENSG000002 22:22000000-23000000

The unusual part is the region list. A region(...) is also a filter object. To scan only common biallelic SNPs in each cis window, build one combined filter per gene: region & maf & snp & biallelic. That lets genoio use region pushdown first, then apply the remaining predicates while reading candidate variants.

import polars as pl
import genoio

ds = genoio.bgen("data/chr22_hg38.bgen")

# `samples` defines the row order of X_region for every returned region.
samples = ds.samples()

expression = pl.read_csv("expression.tsv", separator="\t")
covariates = pl.read_csv("covariates.tsv", separator="\t")
genes = pl.read_csv("cis_windows.tsv", separator="\t")

design = (
    # Start from genotype sample order so expression, covariates, and genotypes align.
    samples.select("iid")
    .join(expression, on="iid", how="left")
    .join(covariates.select("iid", "age", "sex", "PC1", "PC2"), on="iid", how="left")
)
C = design.select("age", "sex", "PC1", "PC2").to_numpy()

# This shared filter is applied inside every cis window.
variant_filter = genoio.maf(max=0.05) & genoio.snp() & genoio.biallelic()

# Each list element is a complete read filter for one gene's cis window.
# The region predicate can use indexed VCF/BGEN reads when an index exists.
regions = [genoio.region(region) & variant_filter for region in genes["cis_region"]]

for region_index, (region, (X_region, variants)) in enumerate(
    ds.iter_regions(regions, dosage="dosage", return_variants=True)
):
    # Use the region index to pull the matching gene metadata and phenotype.
    gene = genes.row(region_index, named=True)
    y_region = design[gene["gene_id"]].to_numpy()

    # X_region contains only variants retained for this gene's cis window.
    cis_scan(X_region, y_region, C, gene=gene["gene_id"], region=region, variants=variants)

Use iter_regions(...) when each genomic interval has its own phenotype or analysis context. Unlike iter_blocks(...), the iterator boundary is part of the scientific question, not just a matrix chunk size.