Usage

Genotype

It is as simple as:

>>> from pandas_plink import read_plink1_bin
>>> G = read_plink1_bin("chr11.bed", "chr11.bim", "chr11.fam", verbose=False)
>>> print(G)
<xarray.DataArray 'genotype' (sample: 14, variant: 779)>
dask.array<transpose, shape=(14, 779), dtype=float64, chunksize=(14, 779), chunktype=numpy.ndarray>
Coordinates:
  * sample   (sample) object 'B001' 'B002' 'B003' ... 'B012' 'B013' 'B014'
  * variant  (variant) object '11_316849996' '11_316874359' ... '11_345698259'
    fid      (sample) <U4 'B001' 'B002' 'B003' 'B004' ... 'B012' 'B013' 'B014'
    iid      (sample) <U4 'B001' 'B002' 'B003' 'B004' ... 'B012' 'B013' 'B014'
    father   (sample) <U1 '0' '0' '0' '0' '0' '0' ... '0' '0' '0' '0' '0' '0'
    mother   (sample) <U1 '0' '0' '0' '0' '0' '0' ... '0' '0' '0' '0' '0' '0'
    gender   (sample) <U1 '0' '0' '0' '0' '0' '0' ... '0' '0' '0' '0' '0' '0'
    trait    (sample) float64 -9.0 -9.0 -9.0 -9.0 -9.0 ... -9.0 -9.0 -9.0 -9.0
    chrom    (variant) <U2 '11' '11' '11' '11' '11' ... '11' '11' '11' '11' '11'
    snp      (variant) <U9 '316849996' '316874359' ... '345653648' '345698259'
    cm       (variant) float64 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0
    pos      (variant) int64 157439 181802 248969 ... 28937375 28961091 29005702
    a0       (variant) <U1 'C' 'G' 'G' 'C' 'C' 'T' ... 'T' 'A' 'C' 'A' 'A' 'T'
    a1       (variant) <U1 'T' 'C' 'C' 'T' 'T' 'A' ... 'C' 'G' 'T' 'G' 'C' 'C'

The matrix G is a special matrix: xarray.DataArray. It provides labes for its dimensions (sample for rows and variant for columns) and additional metadata for those dimensions. Lets print the genotype value of sample B003 and variant 11_316874359:

>>> print(G.sel(sample="B003", variant="11_316874359").values)
0.0
>>> print(G.a0.sel(variant="11_316874359").values)
G

It means that sample B003 has two alleles G at the variant 11_316874359. Likewise, sample B003 has two alleles G at the variant 11_316874359:

>>> print(G.sel(sample="B003", variant="11_316941526").values)
2.0
>>> print(G.a1.sel(variant="11_316941526").values)
C

Now lets print a summary of the genotype values:

>>> print(G.values)
[[0.00 0.00 2.00 ... 0.00 0.00 0.00]
 [0.00 1.00 2.00 ... 0.00 0.00  nan]
 [0.00 0.00 2.00 ... 0.00 0.00 0.00]
 ...
 [2.00 2.00 0.00 ... 2.00 2.00 2.00]
 [2.00 1.00 0.00 ... 2.00 2.00 1.00]
 [0.00 0.00 2.00 ... 0.00 0.00  nan]]

The genotype values can be either 0, 1, 2, or NaN:

  • 0 Homozygous having the first allele (given by coordinate a0)

  • 1 Heterozygous

  • 2 Homozygous having the second allele (given by coordinate a1)

  • NaN Missing genotype

Kinship matrix

Pandas-plink supports relationship/covariance matrix encoded in PLINK and GCTA file formats since version 2.0.0.

>>> from pandas_plink import read_rel
>>> K = read_rel("plink2.rel.bin")
>>> print(K)
<xarray.DataArray (sample_0: 10, sample_1: 10)>
array([[ 0.89,  0.23, -0.19, -0.01, -0.14,  0.29,  0.27, -0.23, -0.10,
        -0.21],
       [ 0.23,  1.08, -0.45,  0.19, -0.19,  0.17,  0.41, -0.01, -0.13,
        -0.13],
       [-0.19, -0.45,  1.18, -0.04, -0.15, -0.20, -0.31, -0.04,  0.30,
        -0.01],
       [-0.01,  0.19, -0.04,  0.90, -0.07,  0.01,  0.06, -0.19, -0.09,
         0.17],
       [-0.14, -0.19, -0.15, -0.07,  1.18,  0.09, -0.03,  0.10,  0.22,
         0.17],
       [ 0.29,  0.17, -0.20,  0.01,  0.09,  0.96,  0.07, -0.04, -0.09,
        -0.23],
       [ 0.27,  0.41, -0.31,  0.06, -0.03,  0.07,  0.71, -0.10, -0.09,
        -0.06],
       [-0.23, -0.01, -0.04, -0.19,  0.10, -0.04, -0.10,  1.42, -0.30,
        -0.07],
       [-0.10, -0.13,  0.30, -0.09,  0.22, -0.09, -0.09, -0.30,  0.91,
        -0.02],
       [-0.21, -0.13, -0.01,  0.17,  0.17, -0.23, -0.06, -0.07, -0.02,
         0.91]])
Coordinates:
  * sample_0  (sample_0) object 'HG00419' 'HG00650' ... 'NA20508' 'NA20753'
  * sample_1  (sample_1) object 'HG00419' 'HG00650' ... 'NA20508' 'NA20753'
    fid       (sample_1) object 'HG00419' 'HG00650' ... 'NA20508' 'NA20753'
    iid       (sample_1) object 'HG00419' 'HG00650' ... 'NA20508' 'NA20753'
>>> print(K.values)
[[ 0.89  0.23 -0.19 -0.01 -0.14  0.29  0.27 -0.23 -0.10 -0.21]
 [ 0.23  1.08 -0.45  0.19 -0.19  0.17  0.41 -0.01 -0.13 -0.13]
 [-0.19 -0.45  1.18 -0.04 -0.15 -0.20 -0.31 -0.04  0.30 -0.01]
 [-0.01  0.19 -0.04  0.90 -0.07  0.01  0.06 -0.19 -0.09  0.17]
 [-0.14 -0.19 -0.15 -0.07  1.18  0.09 -0.03  0.10  0.22  0.17]
 [ 0.29  0.17 -0.20  0.01  0.09  0.96  0.07 -0.04 -0.09 -0.23]
 [ 0.27  0.41 -0.31  0.06 -0.03  0.07  0.71 -0.10 -0.09 -0.06]
 [-0.23 -0.01 -0.04 -0.19  0.10 -0.04 -0.10  1.42 -0.30 -0.07]
 [-0.10 -0.13  0.30 -0.09  0.22 -0.09 -0.09 -0.30  0.91 -0.02]
 [-0.21 -0.13 -0.01  0.17  0.17 -0.23 -0.06 -0.07 -0.02  0.91]]

Please, refer to the functions pandas_plink.read_rel() and pandas_plink.read_grm() for more details.