Processing 1000 Genomes data for set-based association tests

Jaehyun Joo

14 March, 2021


A set-based association test in the snpsettest package requires a reference data set to infer pairwise linkage disequilibrium (LD) values between a set of variants. This vignette shows you how to use 1000 Genomes data as the reference data for set-based association tests.


PLINK 2.0 is required to process the 1000 Genomes dataset. The 1000 Genomes phase 3 dataset (GRCh37) is available in PLINK2 binary format at PLINK 2.0 Resources. To download files,

# The links in here may be changed in future
# "-O" to specify output file name
wget -O all_phase3.psam ""
wget -O all_phase3.pgen.zst ""
wget -O all_phase3.pvar.zst ""

# Decompress pgen.zst to pgen
plink2 --zst-decompress all_phase3.pgen.zst > all_phase3.pgen

Choose an appropriate population

Patterns of LD could vary among racial/ethnic groups, and thus, it may be necessary to choose an appropriate population. For example, if your GWAS is based on European descent, you may want to keep EUR samples as described in the “all_phase3.psam” file.

# "vzs" modifier to directly operate with pvar.zst
# "--chr 1-22" excludes all variants not on the listed chromosomes
# "--output-chr 26" uses numeric chromosome codes
# "--max-alleles 2": PLINK 1 binary does not allow multi-allelic variants
# "--rm-dup" removes duplicate-ID variants
# "--set-missing-var-id" replaces missing IDs with a pattern
plink2 --pfile all_phase3 vzs \
       --chr 1-22 \
       --output-chr 26 \
       --max-alleles 2 \
       --rm-dup exclude-mismatch \
       --set-missing-var-ids '@_#_$1_$2' \
       --make-pgen \
       --out all_phase3_autosomes

# Prepare sub-population filter file
awk 'NR == 1 || $5 == "EUR" {print $1}' all_phase3.psam > EUR_1kg_samples.txt

# Generate sub-population fileset
plink2 --pfile all_phase3_autosomes \
       --keep EUR_1kg_samples.txt \
       --make-pgen \
       --out EUR_phase3_autosomes