Usage

Genotype

It is as simple as:

>>> from pandas_plink import read_plink1_bin
>>> G = read_plink1_bin("chr11.bed", "chr11.bim", "chr11.fam", verbose=False)
>>> print(G)
<xarray.DataArray 'genotype' (sample: 14, variant: 779)>
dask.array<transpose, shape=(14, 779), dtype=float32, chunksize=(14, 779), chunktype=numpy.ndarray>
Coordinates: (12/14)
  * sample   (sample) object 'B001' 'B002' 'B003' ... 'B012' 'B013' 'B014'
  * variant  (variant) <U10 'variant0' 'variant1' ... 'variant777' 'variant778'
    fid      (sample) object 'B001' 'B002' 'B003' ... 'B012' 'B013' 'B014'
    iid      (sample) object 'B001' 'B002' 'B003' ... 'B012' 'B013' 'B014'
    father   (sample) object '0' '0' '0' '0' '0' '0' ... '0' '0' '0' '0' '0' '0'
    mother   (sample) object '0' '0' '0' '0' '0' '0' ... '0' '0' '0' '0' '0' '0'
    ...       ...
    chrom    (variant) object '11' '11' '11' '11' '11' ... '11' '11' '11' '11'
    snp      (variant) object '316849996' '316874359' ... '345698259'
    cm       (variant) float64 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0
    pos      (variant) int32 157439 181802 248969 ... 28937375 28961091 29005702
    a0       (variant) object 'C' 'G' 'G' 'C' 'C' 'T' ... 'A' 'C' 'A' 'A' 'T'
    a1       (variant) object 'T' 'C' 'C' 'T' 'T' 'A' ... 'G' 'T' 'G' 'C' 'C'

The matrix G is a special matrix: xarray.DataArray. It provides labes for its dimensions ("sample" for rows and "variant" for columns) and additional metadata for those dimensions. Lets print the genotype value of sample "B003" and variant "variant5":

>>> variant = "variant5"
>>> print(G.sel(sample="B003", variant=variant).values)
0.0
>>> print(G.a0.sel(variant=variant).values)
T

It means that sample "B003" has two alleles T at the variant "variant5". Likewise, sample "B003" has two alleles C at the variant "variant135":

>>> variant = "variant135"
>>> print(G.sel(sample="B003", variant=variant).values)
2.0
>>> print(G.a1.sel(variant=variant).values)
C

Now lets print a summary of the genotype values:

>>> print(G.values)
[[0.00 0.00 2.00 ... 0.00 0.00 0.00]
 [0.00 1.00 2.00 ... 0.00 0.00  nan]
 [0.00 0.00 2.00 ... 0.00 0.00 0.00]
 ...
 [2.00 2.00 0.00 ... 2.00 2.00 2.00]
 [2.00 1.00 0.00 ... 2.00 2.00 1.00]
 [0.00 0.00 2.00 ... 0.00 0.00  nan]]

The genotype values can be either 0, 1, 2, or math.nan:

  • 0 Homozygous having the first allele (given by coordinate a0)

  • 1 Heterozygous

  • 2 Homozygous having the second allele (given by coordinate a1)

  • math.nan Missing genotype

Kinship matrix

Pandas-plink supports relationship/covariance matrix encoded in PLINK and GCTA file formats since version 2.0.0.

>>> from pandas_plink import read_rel
>>> K = read_rel("plink2.rel.bin")
>>> print(K)
<xarray.DataArray (sample_0: 10, sample_1: 10)>
array([[ 0.89,  0.23, -0.19, -0.01, -0.14,  0.29,  0.27, -0.23, -0.10,
        -0.21],
       [ 0.23,  1.08, -0.45,  0.19, -0.19,  0.17,  0.41, -0.01, -0.13,
        -0.13],
       [-0.19, -0.45,  1.18, -0.04, -0.15, -0.20, -0.31, -0.04,  0.30,
        -0.01],
       [-0.01,  0.19, -0.04,  0.90, -0.07,  0.01,  0.06, -0.19, -0.09,
         0.17],
       [-0.14, -0.19, -0.15, -0.07,  1.18,  0.09, -0.03,  0.10,  0.22,
         0.17],
       [ 0.29,  0.17, -0.20,  0.01,  0.09,  0.96,  0.07, -0.04, -0.09,
        -0.23],
       [ 0.27,  0.41, -0.31,  0.06, -0.03,  0.07,  0.71, -0.10, -0.09,
        -0.06],
       [-0.23, -0.01, -0.04, -0.19,  0.10, -0.04, -0.10,  1.42, -0.30,
        -0.07],
       [-0.10, -0.13,  0.30, -0.09,  0.22, -0.09, -0.09, -0.30,  0.91,
        -0.02],
       [-0.21, -0.13, -0.01,  0.17,  0.17, -0.23, -0.06, -0.07, -0.02,
         0.91]])
Coordinates:
  * sample_0  (sample_0) object 'HG00419' 'HG00650' ... 'NA20508' 'NA20753'
  * sample_1  (sample_1) object 'HG00419' 'HG00650' ... 'NA20508' 'NA20753'
    fid       (sample_1) object 'HG00419' 'HG00650' ... 'NA20508' 'NA20753'
    iid       (sample_1) object 'HG00419' 'HG00650' ... 'NA20508' 'NA20753'
>>> print(K.values)
[[ 0.89  0.23 -0.19 -0.01 -0.14  0.29  0.27 -0.23 -0.10 -0.21]
 [ 0.23  1.08 -0.45  0.19 -0.19  0.17  0.41 -0.01 -0.13 -0.13]
 [-0.19 -0.45  1.18 -0.04 -0.15 -0.20 -0.31 -0.04  0.30 -0.01]
 [-0.01  0.19 -0.04  0.90 -0.07  0.01  0.06 -0.19 -0.09  0.17]
 [-0.14 -0.19 -0.15 -0.07  1.18  0.09 -0.03  0.10  0.22  0.17]
 [ 0.29  0.17 -0.20  0.01  0.09  0.96  0.07 -0.04 -0.09 -0.23]
 [ 0.27  0.41 -0.31  0.06 -0.03  0.07  0.71 -0.10 -0.09 -0.06]
 [-0.23 -0.01 -0.04 -0.19  0.10 -0.04 -0.10  1.42 -0.30 -0.07]
 [-0.10 -0.13  0.30 -0.09  0.22 -0.09 -0.09 -0.30  0.91 -0.02]
 [-0.21 -0.13 -0.01  0.17  0.17 -0.23 -0.06 -0.07 -0.02  0.91]]

Please, refer to the functions pandas_plink.read_rel() and pandas_plink.read_grm() for more details.