Welcome to PRSKB’s CLI documentation!¶
The PRSKB’s CLI tool is an extension of the PRSKB web application. It is designed to give more flexability and capability in calculating polygenic risk scores for large datasets. Features include:
Searching the database for studies and traits
Printing out available ethnicities to filter by
Learning about required and optional parameters for performing calculations
Calculating polygenic risk scores
Given the required parameters, the tool will calculate risk scores for each individual sample for each study and trait combination in the database.
NOTE: You MUST have bash and jq for bash, and python3 and the PyVCF, filelock, and requests Python modules installed. For GWAS uploading, you must also have the myvariant, biopython, and biothings_client Python modules installed.
Required Parameters Example¶
./runPrsCLI.sh -f path/to/file/samples.vcf -o path/to/file/output.tsv -c 0.05 -r hg19 -p EUR
The CLI polygenic risk score calculator can be run directly from the command-line or through the CLI’s interactive menu. The menu includes the options to search the database and to learn more about the parameters involved in risk score calculations. To access the menu, run the script without parameters. You will be given the option to see the usage statement or start the interactive menu.
Search the Database¶
Through the interactive menu, users can search the database for traits or studies. This can be helpful when filtering studies to run the calculator on.
After selecting the menu option to search for studies or traits, you will be prompted to specify if you are searching for studies or traits. You can then enter the search term you wish to use.
When the search is complete, the results will be printed to the console, and you will be returned to the menu.
The menu has an option for displaying the available ethnicities to filter studies by.
Learn about Parameters¶
The menu also has an option for learning about the parameters involved in filtering studies and calculating scores.
File Path (-f)¶
The path to the input file. Either a vcf or a txt with lines formatted as rsID:allele1,allele2
Output File (-o)¶
The path to output file. Can be either a tsv or a json.
P-value Cutoff (-c)¶
Creates a threshold for included snps.
Reference Genome (-r)¶
The reference genome of samples in the input file. Available reference genomes: hg17, hg18, hg19, or hg38
Super Population (-p)¶
The super population of samples in the input file. These are the five super populations from the 1000 genomes project
AFR - African
AMR - American
EAS - East Asian
EUR - European
SAS - South Asian
Optional Filtering Parameters¶
To run the PRSKB CLI calculator on a more specific set of studies, optional filtering parameters can be added. When specifying more than one, add the identifying flag in front of every filter specified.
Specific traits to run the calculator on.
-t "Alzheimer's Disease" -t acne
Study Types (-k)¶
Specifies what kinds of studies to run analyses on.
HI (high impact) - studies that have the highest altmetric score within their trait group
LC (largest cohort) - studies that have the largest sample size (initial and replication combined)
O (other) - all other studies that do not fall in the first two categories
-k HI -k LC
Study IDs (-i)¶
Specifies the studyID(s) of studies to run the calculator on.
Filters studies by the ethnicity of their sample population.
-e European -e "East Asian"
Additonal Optional Parameters¶
These additional parameters manage calculation details and may only be specified once per calculation.
Creates a more detailed TSV result file. The verbose output file will include the following for each corresponding sample, study, and trait combination:
polygenic risk score
variants that are present but do not include the risk allele
variants that are in high linkage disequilibrium whose odds ratios are not included in the calculations
If the output file is in TSV format and this parameter is not included, the default TSV result file will include the study ID and the corresponding polygenic risk scores for each sample. If the output file is in JSON format, the results will, by default, be in verbose format.
NOTE: There is no condensed version of JSON output.
Sex dependent associations (-g)¶
Though a rare occurence, some studies have duplicates of the same SNP that differ by which biological sex the p-value and odds ratio is associated with or SNPs that are not duplicated, but are dependent on biological sex. The system default is to exclude sex dependent SNPs from calculations. You can include sex dependent associations by selecting either male (M) or female (F).
Step Number (-s)¶
Breaks running the calculator into steps. Make sure when you are running the CLI in steps that the only parameter that changes between the two steps is the step parameter.
Step 1 deals with downloading data from the server for use in calculations. This step requires internet access.
Step 2 deals with the actual calculation of polygenic risk scores. This step does not require internet access but does need to have access to the files that were downloaded in step 1.
Number Of Processes (-n)¶
This parameter determines the number of subprocesses used by the Python multiprocessing module during the calculations. If no value is given by the user, all available cores will be utilized.
Omit Unused Studies file (-m)¶
This flag will prevent the creation of an additional output file that lists the studies that no risk scores could be calculated for.
User GWAS upload file (-u)¶
This parameter allows the user to upload a GWAS summary statistics file to be used in polygenic risk score calculations instead of GWAS Catalog data stored in our database. The file must be tab separated, use a .tsv or .txt extension (or be a zipped file with one of those extensions), and have the correct columns in order for calculations to occur. (see Uploading GWAS Summary Statistics for more directions on uploading GWAS data)
GWAS reference genome (-a)¶
Specifies the reference genome of the GWAS summary statistics data. If left off when a GWAS file is uploaded, the reference genome is assumed to be the same as the samples reference genome (-r).
Polygenic risk scores can be calculated directly through the command-line or through the interactive menu. Using just the required parameters, the CLI will calculate risk scores for all studies in the database for each individual in the input file. Additional parameters will filter studies to be included in the calculation.
Some trait/study combination have multiple odds ratios and p-values for the same SNP. This could be due to many reasons, including differences in association between males and females. By default, we exclude all SNPs that have multiple odds ratios/p-values from calculations or are reported for only a single sex. Users are given the option to add in SNP associations that are annotated for males or females. Studies where snps have been excluded for these reasons are denoted by a † symbol.
Uploading GWAS Summary Statistics¶
In addition to calculating polygenic risk scores using GWA studies from the GWAS Catalog stored in our database, users have the option to upload their own GWAS summary statistics to use in risk score calculations.
The GWAS summary statistics file to be uploaded must be in the correct format. It should be either a .tsv or a .txt tab separated file, or a zipped .tsv or .txt. The following columns are required and must be included in the file’s header line: Study ID, Trait, RsID, Chromosome, Position, Risk Allele, Odds Ratio, and P-value. Additional optional columns that will be included if present are: Citation and Reported Trait. Column order does not matter and there may be extra columns present in the file. Required and optional header names must be exact.
If more than one odds ratio exists for an rsID in a study, the odds ratio and corresponding risk allele with the most significant p-value will be used.
NOTE: If a GWAS data file is specified, risk scores will only be calculated on that data. No association data from the PRSKB will be used. Additionally, the optional params -t, -k, -i, -e, and -g will be ignored.
Below is a brief overview of the required and optional columns for uploading GWAS summary statistics data.
Study ID - A unique study identifier. In our database, we use GWAS Catalog study identifiers. As long as this is unique for each study, it can be whatever you want.
Trait - The Experimental Factor Ontology (EFO) trait the GWAS deals with.
RsID - The Reference SNP cluster ID (RsID) of the SNP.
Chromosome - The chromosome the SNP resides on.
Position - The position of the SNP in the reference genome.
Risk Allele - The allele that confers risk or protection.
Odds Ratio - Computed in the GWA study, a numerical value of the odds that those in the case group have the allele of interest over the odds that those in the control group have the allele of interest.
P-value - The probability that the risk allele confers the amount of risk stated.
Citation - The citation information for the study.
Reported Trait - Trait description for this study in the authors own words.
Run the calculator on all studies in the database:
./runPrsCLI.sh -f path/to/file/samples.vcf -o path/to/file/output.tsv -c 0.0005 -r hg19 -p SAS
Run the calculator on all studies about the trait ‘acne’:
./runPrsCLI.sh -f path/to/file/samples.vcf -o path/to/file/output.tsv -c 0.0005 -r hg19 -p SAS -t acne
Run the calculator on all high impact studies with samples that are of European descent:
./runPrsCLI.sh -f path/to/file/samples.vcf -o path/to/file/output.tsv -c 0.0005 -r hg19 -p AMR -k HI -e european
Run the calculator on the study with study ID GCST004410:
./runPrsCLI.sh -f path/to/file/samples.vcf -o path/to/file/output.tsv -c 0.0005 -r hg19 -p EUR -i GCST004410
Run the calculator on all studies in the database in two steps:
./runPrsCLI.sh -f path/to/file/samples.vcf -o path/to/file/output.tsv -c 0.0005 -r hg19 -p SAS -s 1 ./runPrsCLI.sh -f path/to/file/samples.vcf -o path/to/file/output.tsv -c 0.0005 -r hg19 -p SAS -s 2
Run the calculator on all studies about the trait ‘acne’, filtering studies from the allAssociations_hg19_f.txt working file:
./runPrsCLI.sh -f path/to/file/samples.vcf -o path/to/file/output.tsv -c 0.0005 -r hg19 -p SAS ./runPrsCLI.sh -f path/to/file/samples.vcf -o path/to/file/output.tsv -c 0.0005 -r hg19 -p SAS -t acne -s 2
Run the calculator using uploaded GWAS summary statistics:
./runPrsCLI.sh -f path/to/file/samples.vcf -o path/to/file/output.tsv -c 0.0005 -r hg19 -p AFR -u path/to/GWAS/GWASsummaryStatistics.tsv -a hg17 ./runPrsCLI.sh -f path/to/file/samples.vcf -o path/to/file/output.tsv -c 0.0005 -r hg19 -p AFR -u path/to/GWAS/GWASsummaryStatistics.tsv
More examples can be found in the CLI download README.md
There are two choices for the tsv output results - condensed (default) or full.
This version of the output results contains one row for each study with columns for each sample’s polygenic risk score.
Study ID | Citation | Reported Trait | Trait(s) | Sample1 | Sample2 | Sample3 | ect.
./runPrsCLI.sh -f path/to/file/samples.vcf -o path/to/file/output.tsv -c 0.0005 -r hg19 -p SAS
This version of the output results contains one row for each sample/study pair. It also includes columns listing the rsIDs of the snps involved in the risk score calculation.
Sample | Study ID | Citation | Reported Trait | Traits(s) | Risk Score | Protective Variants | Risk Variants | Variants Without Risk Allele | Variants in High LD
./runPrsCLI.sh -f path/to/file/samples.vcf -o path/to/file/output.tsv -c 0.0005 -r hg19 -p SAS -v