Users Manual¶
Initialize Environment¶
SuShiE is a command-line software written in Python. Before installation, we recommend to create a new environment using conda so that it will not affect the software versions of users’ other projects.
SuShiE uses JAX with Just In Time compilation to achieve high-speed computation. However, there are some issues for JAX with Mac M1 chip. To solve this, users need to initiate conda using miniforge, and then install SuShiE using pip
in the desired environment.
Installation¶
Users can download the latest repository and then use pip
:
git clone https://github.com/mancusolab/sushie.git
cd sushie
pip install .
We currently only support Python3.8+.
Data Preparation¶
Fine-mapping using individual-level data¶
To fine-map using individual-level data, SuShiE requires at least phenotype and genotype data specified with the option to specify covariates.
Although we highly recommend users to perform high-quality QC on their own genotype, phenotype, and covariate data, we implement following basic QCs in the software:
Remove subjects with N/A values from either phenotype or covariates data.
Remove SNPs that all subjects have N/A value.
Impute SNPs that partial subjects have N/A value based on two times allele frequencies.
Only keep subjects who have data in all the genotype, phenotype, and covariate data.
Only keep SNPs that are available in all the ancestries.
Adjust genotype data across ancestries based on the same reference alleles. Drop non-biallelic SNPs.
Remove SNPs that have minor allele frequency (MAF) less than 1% within each ancestry (users can change 1% with
--maf
).Users also have an option to keep ambiguous SNPs (i.e., A/T, T/A, C/G, or GC) by specifying
--keep-ambiguous
(Default is NOT to keep them).For single ancestry SuSiE or Mega-SuSiE, users have the option to perform rank inverse normalization transformation on the phenotype data.
See sushie.cli.process_raw()
for these QCs’ source codes.
Fine-mapping using summary-level data (GWAS statistics)¶
To fine-map using summary-level data, SuShiE requires at least GWAS z statistics, sample sizes, and LD data. For LD data, users can provide individual-level genotype in PLINK1.9, VCF, or BGEN format and let SuShiE compute the LD matrix, or provide pre-computed LD matrix in tsv format.
Although we highly recommend users to perform high-quality QC on their own summary-level data, we implement following basic QCs in the software:
Remove SNPs with N/A values in GWAS.
Only keep SNPs that are available in all the ancestries.
Adjust GWAS and genotype data across ancestries based on the same reference alleles. Drop non-biallelic SNPs.
Remove SNPs (for LD computation) that have minor allele frequency (MAF) less than 1% within each ancestry (users can change 1% with
--maf
).Users also have an option to keep ambiguous SNPs (i.e., A/T, T/A, C/G, or GC) by specifying
--keep-ambiguous
(Default is NOT to keep them).
Testing Data¶
We provide example data in ./data/
folder to test out SuShiE. All the data are in three ancestries: 489 European individuals (EUR), 639 African individuals (AFR), and 481 East Asian individuals (EAS).
The genotype is the high-quality HapMap SNPs in some random gene 1M base-pair window, which contains 123, 129, and 113 SNPs for EUR, AFR, and EAS respectively in 1000G project. We provide genotype data in plink 1, vcf, and bgen 1.3 format.
Using ./data/make_example.py
, we simulated phenotype data (2 causal QTLs, cis-SNP heritability: 0.5 and effect size correlation 0.8), random covariate data for each ancestry. The two QTL rsID are rs1886340 and rs10914958. It also outputs all.pheno
file that row-binds simulated phenotype across ancestries, all.ancestry.index
file that specifies ancestry index if using all.pheno
, all.covar
, and .\plink\all
triplets, keep.subject
file that specifies subjects to be included in the inference.
As for the format requirement, see Parameters for detailed explanations.
Examples¶
SuShiE software is very easy to use, for it only has one command finemap
. In this section, we walk through several examples of using SuShiE.
See Output Files for the detailed explanation of output files.
See Parameters for the detailed explanation of parameters.
We make a bash script ./misc/run_sushie.sh
to show a more general working flow of running SuShiE.
If users still have questions, feel free to contact the developers.
Here are some examples for fine-mapping using individual-level data:
1. Two-Ancestry SuShiE¶
In this example, we perform two-ancestry SuShiE with covariates regressed out from both phenotypes and genotypes while updating the prior effect size covariance matrix during the optimizations.
cd ./data/
sushie finemap --pheno EUR.pheno AFR.pheno --vcf vcf/EUR.vcf vcf/AFR.vcf --covar EUR.covar AFR.covar --output ./test_result
2. \(N\)-Ancestry SuShiE¶
In the example below, we perform single-ancestry SuShiE, which is equivalently to the SuSiE model (see Reference).
cd ./data/
sushie finemap --pheno EUR.pheno --vcf vcf/EUR.vcf --covar EUR.covar --output ./test_result
Or three-ancestry setting:
cd ./data/
sushie finemap --pheno EUR.pheno AFR.pheno EAS.pheno --vcf vcf/EUR.vcf vcf/AFR.vcf vcf/EAS.vcf --covar EUR.covar AFR.covar EAS.covar --output ./test_result
3. Can I use other formats of genotypes?¶
Yes! SuShiE can take either plink 1, vcf, or bgen, but not plink 2.
For plink 1, SuShiE read in the triplet (bed, bim, and fam) prefix.
cd ./data/
sushie finemap --pheno EUR.pheno AFR.pheno --plink plink/EUR plink/AFR --output ./test_result
For bgen data, users need to make sure that the latter allele shown up in the allele ids
is the reference allele.
cd ./data/
sushie finemap --pheno EUR.pheno AFR.pheno --bgen bgen/EUR.bgen bgen/AFR.bgen --output ./test_result
4. My data contains all the participants and I do not want to separate them¶
No problem! If all the subjects are in single phenotype, genotype, and covariate files. Users just need to use --ancestry-index
command to specify a file that subject ID on the first column, and the ancestry index on the second column. The ancestry index has to start from 1 continuously to the total number of ancestry.
cd ./data/
sushie finemap --pheno all.pheno --plink plink/all --ancestry-index all.ancestry.index --output ./test_result
5. How about mega or meta SuShiE?¶
The software employs the function to run meta SuShiE and mega SuShiE by adding the parameter --meta
or --mega
.
We define the meta SuShiE as running single-ancestry SuShiE followed by meta analysis of the PIPs:
We define the mega SuShiE as running single-ancestry SuShiE on genotype and phenotype data that is row-wise stacked across ancestries.
cd ./data/
sushie finemap --pheno EUR.pheno AFR.pheno --vcf vcf/EUR.vcf vcf/AFR.vcf --meta --output ./test_result
cd ./data/
sushie finemap --pheno EUR.pheno AFR.pheno --vcf vcf/EUR.vcf vcf/AFR.vcf --mega --output ./test_result
6. Let’s estimate heritability, run CV, and make FUSION files!¶
SuShiE incorporates codes in limix to estimate the narrow-sense cis-heritability (\(h_g^2\)) by specifying --her
.
SuShiE also has a function (--cv
) to perform \(X\)-fold cross-validation (CV; --cv-num X
) on the ancestry-specific prediction weights to compute the out-of-sample \(r^2\) between predicted and measured expressions with its corresponding \(p\)-value.
Specifically, we randomly (--seed [YOUR SEED]
) and equally divide the dataset into X
portions. We regard each portion as validation dataset and the rest four portions as training dataset. Then, we perform SuShiE on the training datasets for X
times, and predict the expressions on the corresponding validation dataset. Last, we row-wise stack all X
predicted expressions and compute the \(r^2\) with row-wise stacked and matched validation dataset.
With these two information (\(h_g^2\) and CV), we prepare R codes ./misc/make_fusion.R
to generate FUSION-format prediction weights, thus can be used in TWAS.
cd ./data/
sushie finemap --pheno EUR.pheno AFR.pheno --vcf vcf/EUR.vcf vcf/AFR.vcf --cv --her --output ./test_result
Rscript ./misc/make_FUSION.R ./test_result ~
7. I don’t want to scale my phenotype by its standard deviation¶
Fine-mapping inference sometimes can be sensitive to whether scaling the phenotypes and genotypes. SuShiE by default scales the phenotypes and genotypes by their respective standard deviations. However, if users want to disable it, simply add --no-scale
to the command.
cd ./data/
sushie finemap --pheno EUR.pheno AFR.pheno --vcf vcf/EUR.vcf vcf/AFR.vcf --no-scale --output ./test_result
8. I have my own initial values for the hyperparameters¶
SuShiE has three hyperparameters (Model Description): the residual variance (\(\sigma^2_e\)) prior, the QTL effect size variance (\(\sigma^2_{i,b}\)) prior, and the ancestral effect size correlation (\(\rho\)) prior. SuShiE by default initializes them as 0.001
, 0.001
, and 0.8
. If users have their own initial values, simply specify them with --resid-var
, --effect-var
, and --rho
. Make sure the ancestry order has to match the phenotype file order.
cd ./data/
sushie finemap --pheno EUR.pheno AFR.pheno --vcf vcf/EUR.vcf vcf/AFR.vcf --resid-var 2.2 2.2 --effect-var 1.2 3.4 --rho 0.2 --output ./test_result
By default, SuShiE will update \(\sigma^2_{i,b}\) and \(\rho\) during the optimization. If users want to disable it, add --no-update
to the command line.
cd ./data/
sushie finemap --pheno EUR.pheno AFR.pheno --vcf vcf/EUR.vcf vcf/AFR.vcf --resid-var 2.2 2.2 --effect-var 1.2 3.4 --rho 0.2 --no-update --output ./test_result
In addition, with --no-update
, if users only specify --effect-var
but not for --rho
, --effect-var
will be fixed during the optimizations while --rho
will get updated, vice versa. In other words, if users want to fix both priors, they have to specify both at the same time or specify neither of them (in the latter case, fixing the default values 0.001 and 0.2 as the priors).
9. What if I assume no correlation across ancestries?¶
SuShiE features that it accounts for ancestral quantitative trait loci (QTL) effect size correlation (\(\rho\) in Model Description) in the inference, which is different from other SuSiE-extended multi-ancestry fine-mapping frameworks assuming no ancestral correlation (Joint SuShiE). However, it has the functions to make inference assuming no correlation across ancestries by simply specifying --no-update
on the effect size covariance matrix and fixing the rho equal to zero --rho 0
. With this, the effect size variance (\(\sigma^2_{i,b}\) in Model Description) will get updated while rho will not.
cd ./data/
sushie finemap --pheno EUR.pheno AFR.pheno --vcf vcf/EUR.vcf vcf/AFR.vcf --no-update --rho 0 --output ./test_result
10. I want to improvise in post-hoc analysis¶
We understand Output Files output by SuShiE may not serve all users’ post-hoc analysis. Therefore, we add the option to save all the inference results in *.npy
file by specifying --numpy
.
The *.npy
files include SNP information, prior estimators, posterior estimators, credible set, PIPs, and sample size.
cd ./data/
sushie finemap --pheno EUR.pheno AFR.pheno --vcf vcf/EUR.vcf vcf/AFR.vcf --numpy --output ./test_result
11. I seek to use GPU or TPU to make inference faster¶
SuShiE software uses JAX with Just In Time compilation to achieve high-speed computation. Jax can be run on GPU or TPU.
cd ./data/
sushie finemap --pheno EUR.pheno AFR.pheno --vcf vcf/EUR.vcf vcf/AFR.vcf --platform gpu --output ./test_result
12. I want to use 32-bit precision¶
SuShiE uses 64-bit precision to assure an accurate inference. However, if users want to use 32-bit precision, they can specify it by having --precision 32
.
Unless necessarily needed, we do not recommend to use 32-bit precision as it may cause non-positive semi-definite effect size covariance prior or decreasing ELBO, thus concluding the inference earlier.
cd ./data/
sushie finemap --pheno EUR.pheno AFR.pheno --vcf vcf/EUR.vcf vcf/AFR.vcf --precision 32 --output ./test_result
13. I want to run fine-mapping on certain subjects¶
Users can use --keep
command to specify a file that contains the subject IDs. As a result, the following fine-mapping inference only performs on the subjects listed in the file.
cd ./data/
sushie finemap --pheno EUR.pheno AFR.pheno --vcf vcf/EUR.vcf vcf/AFR.vcf --keep keep.subject --output ./test_result
14. I want to assign the prior weights for each SNP¶
Users can use --pi
command to specify a tsv file that contains the SNP ID and their prior weights. The weights will be normalized to sum to 1 before inference.
cd ./data/
sushie finemap --pheno EUR.pheno AFR.pheno --vcf vcf/EUR.vcf vcf/AFR.vcf --pi prior_weights --output ./test_result
Here are some examples for fine-mapping using individual-level data:
I want to perform fine-mapping on summary-level data and I provide individual-level reference panels for LD.
cd ./data/
sushie finemap --summary --gwas EUR.gwas AFR.gwas --vcf vcf/EUR.vcf AFR.vcf --sample-size 489 639 --output ./test_result
I want to perform fine-mapping on summary-level data and I provide pre-computed LD matrix.
cd ./data/
sushie finemap --summary --gwas EUR.gwas AFR.gwas --ld EUR.ld AFR.ld --sample-size 489 639 --output ./test_result
I want to only focus on SNPs with GWAS P values less than 5e-8 across
all
ancestries.
cd ./data/
sushie finemap --summary --gwas EUR.gwas AFR.gwas --vcf vcf/EUR.vcf AFR.vcf --sample-size 489 639 --gwas-sig 5e-8 --gwas-sig-type all --output ./test_result
I want to only focus on SNPs between 1bp and 1Mbp on chromsome 6
cd ./data/
sushie finemap --summary --gwas EUR.gwas AFR.gwas --vcf vcf/EUR.vcf AFR.vcf --sample-size 489 639 --chrom 6 --start 1 --end 1000000 --output ./test_result
My GWAS data has different column names.
cd ./data/
sushie finemap --summary --gwas EUR.gwas AFR.gwas --vcf vcf/EUR.vcf AFR.vcf --sample-size 489 639 --gwas-header CHR SNP BP A1 A2 STAT --output ./test_result
I want to add small number to diagonal of my LD matrix to make it positive definite.
cd ./data/
sushie finemap --summary --gwas EUR.gwas AFR.gwas --ld EUR.ld AFR.ld --sample-size 489 639 --ld-adjust 1e-3 --output ./test_result
Parameters¶
Parameter |
Type |
Default |
Example |
Notes |
---|---|---|---|---|
|
Boolean |
False |
|
Indicator whether to run fine-mapping on summary statistics. Default is False. If True, the software will need GWAS files as input data by specifying –gwas and need LD matrix by specifying either –ld or one of the –plink, –vcf, or –bgen. If False, the software will need phenotype data by specifying –pheno and genotype data by specifying either –plink, –vcf, or –bgen. |
|
String |
Required, no default |
|
Phenotype data. It has to be a tsv file that contains at least two columns where the first column is subject ID and the second column is the continuous phenotypic value. It can be a compressed file (e.g., tsv.gz). It is okay to have additional columns, but only the first two columns will be used. No headers. Use |
|
String |
None |
|
Genotype data in plink 1 format. The plink triplet (bed, bim, and fam) should be in the same folder with the same prefix. Use |
|
String |
None |
|
Genotype data in vcf format. Use |
|
String |
None |
|
Genotype data in bgen 1.3 format. Use |
|
String |
None |
|
Single file that contains subject ID and their ancestry index. Default is None. It has to be a tsv file that contains at least two columns where the first column is the subject ID and the second column is the ancestry index starting from 1 (e.g., 1, 2, 3 etc.). It can be a compressed file (e.g., tsv.gz). Only the first two columns will be used. No headers. If this file is specified, it assumes that all the phenotypes across ancestries are in one single file, and same thing for genotypes and covariates data. It will produce errors if multiple phenotype, genotype, and covariates are specified. |
|
String |
None |
|
Single file that contains subject ID across all ancestries that are used for fine-mapping. It has to be a tsv file that contains at least one columns where the first column is the subject ID. It can be a compressed file (e.g., tsv.gz). No headers. If this file is specified, all phenotype, genotype, and covariates data will be filtered down to the subjects listed in it. |
|
String |
None |
|
Covariates that will be accounted in the fine-mapping. It has to be a tsv file that contains at least two columns where the first column is the subject ID. It can be a compressed file (e.g., tsv.gz). No headers. All the columns will be counted. Use |
|
String |
None |
|
LD files that will be used in the fine-mapping. Default is None. Keep the same ancestry order as GWAS files. It has to be a tsv or comparessed file (e.g., tsv.gz). The header has to be the SNP name matching the GWAS data in –gwas. It can have less or more SNPs than the GWAS data, and the software will find the overlap SNPs. Users must ensure that the LD and GWAS z statistics are computed using the same counting alleles. |
|
Integer |
None |
|
Chromsome number to subset GWAS SNPs in the fine-mapping. Default is None. Value has to be an integer number between 1 and 22. |
|
Integer |
None |
|
Base-pair start position to subset GWAS SNPs in the fine-mapping. Default is None. Value has to be a positive integer number. |
|
Integer |
None |
|
Base-pair end position to subset GWAS SNPs in the fine-mapping. Default is None. Value has to be a positive integer number. |
|
Integer |
None |
|
GWAS sample size of each ancestry. Default is None. Values have to be positive integer. Use ‘space’ to separate ancestries if more than two. The order has to be the same as the GWAS data in –gwas. |
|
String |
chrom snp pos a1 a2 z |
|
GWAS file header names. Default is [‘chrom’, ‘snp’, ‘pos’, ‘a1’, ‘a0’, ‘z’]. Users can specify the header names for the GWAS data in this order. |
|
Float |
None |
|
The significance threshold for SNPs to be included in the fine-mapping. Default is 1.0. Only SNPs with P value less than this threshold will be included. It has to be a float number between 0 and 1. |
|
String |
at-least |
|
The cases how to include significant SNPs in the fine-mapping across ancestries. If it is ‘at-least’, the software will include SNPs that are significant in at least one ancestry. If it is ‘all’, the software will include SNPs that are significant in all ancestries. Default is ‘at-least’. The significant threshold is specified by –gwas-sig. |
|
Integer |
10 |
|
Integer number of shared effects pre-specified. Larger number may cause slow inference. |
|
String |
“uniform” |
|
Prior probability for each SNP to be causal (\(\pi\) in Model Description). Default is uniform (i.e., |
|
Float |
1e-3 |
|
Specify the prior for the residual variance (\(\sigma^2_e\) in Model Description) for ancestries. Values have to be positive. Use |
|
Float |
1e-3 |
``–effect-var 5.21 0.99 `` |
Specify the prior for the causal effect size variance (\(\sigma^2_{i,b}\) in Model Description) for ancestries. Values have to be positive. Use |
|
Float |
0.1 |
|
Specify the prior for the effect correlation (\(\rho\) in Model Description) for ancestries. Default is 0.1 for each pair of ancestries. Use space to separate ancestries if more than two. Each rho has to be a float number between -1 and 1. If there are |
|
Boolean |
False |
|
Indicator to scale the genotype and phenotype data by standard deviation. Default is to scale. Specify |
|
Boolean |
False |
|
Indicator to regress the covariates on each SNP. Default is to regress. Specify |
|
Boolean |
False |
|
Indicator to update effect covariance prior before running single effect regression. Default is to update. Specify |
|
Integer |
500 |
|
Maximum iterations for the optimization. Larger number may slow the inference while smaller may cause different inference. |
|
Float |
1e-3 |
|
Minimum tolerance for the convergence. Smaller number may slow the inference while larger may cause different inference. |
|
Float |
0.95 |
|
Specify the PIP threshold for SNPs to be included in the credible sets. It has to be a float number between 0 and 1. |
|
Float |
0.5 |
|
Specify the purity threshold for credible sets to be output. It has to be a float number between 0 and 1. |
|
String |
“weighted” |
|
Specify the method to compute purity across ancestries. Users choose ‘weighted’, ‘max’, or ‘min’. |
|
Float |
0 |
|
The adjusting number to LD diagonal to ensure the positive definiteness. It has to be positive integer number between 0 and 0.1. Default is 0. |
|
Integer |
250 |
|
The maximum selected number of SNPs to calculate the purity. Default is 250. It has to be positive integer number. A larger number can unnecessarily spend much memory. |
|
Integer |
100 |
|
The minimum number of SNPs to fine-map. Default is 100. It has to be positive integer number. A smaller number may produce weird results. |
|
float |
0.01 |
|
Threshold for minor allele frequency (MAF) to filter out SNPs for each ancestry. It has to be a float between 0 (exclusive) and 0.5 (inclusive). |
|
Boolean |
False |
|
Indicator to perform rank inverse normalization transformation (rint) for each phenotype data. Default is False (do not transform). Specify –rint will store ‘True’ value. We suggest users to do this QC during data preparation. |
|
Boolean |
False |
|
Indicator to re-order single effects based on Frobenius norm of alpha-weighted posterior mean square. Default is False (to re-order). Specify –no-reorder will store ‘True’ value. |
|
Boolean |
False |
|
Indicator to keep ambiguous SNPs (i.e., A/T, T/A, C/G, or G/C) from the genotypes. Recommend to remove these SNPs if each ancestry data is from different studies or plan to use the inference results for downstream analysis with other datasets. Default is False (do not keep). Specify –keep-ambiguous will store ‘True’ value. |
|
Boolean |
False |
|
Indicator to perform single-ancestry SuShiE followed by meta analysis of the results. Specify |
|
Boolean |
False |
|
Indicator to perform mega SuShiE that run single-ancestry SuShiE on genotype and phenotype data that is row-wise stacked across ancestries. Specify |
|
Boolean |
False |
|
Indicator to perform heritability (\(h_g^2\)) analysis using limix. Specify |
|
Boolean |
False |
|
Indicator to perform cross validation (CV) and output CV results (adjusted r-squared and its p-value) for future FUSION pipline. Specify |
|
Integer |
5 |
|
The number of fold cross validation. It has to be a positive integer number. Larger number may cause longer running time. |
|
Integer |
12345 |
|
The seed for randomization. It can be used to cut data sets in cross validation. It can also be used to randomly select SNPs in the credible sets to calculate the purity. Default is 12345. It has to be positive integer number. |
|
Boolean |
False |
|
Indicator to output all the credible set results before pruning for purity including PIPs, \(\alpha\) (in Model Description), whether in cs, across all \(L\). Default is False. Specify –alphas will store ‘True’ value and increase running time. |
|
Boolean |
False |
|
Indicator to output all the results in *.npy file. Specify |
|
String |
“Trait” |
|
Trait, tissue, gene name of the phenotype for better indexing in post-hoc analysis. |
|
Boolean |
False |
|
Indicator to not print message to console. Specify |
|
Boolean |
False |
|
Indicator to include debug information in the log. Specify |
|
Boolean |
False |
|
Indicator to compress all output tsv files in ‘tsv.gz’. Specify |
|
String choices in |
cpu |
|
Indicator for the JAX platform. |
|
Integer choices in |
64 |
|
Indicator for the JAX precision: 64-bit or 32-bit. Choose 32-bit may cause ‘elbo decreases’ warning. |
|
String |
sushie_finemap |
|
Prefix for output files. |