pandas_plink.read_plink1_bin¶

pandas_plink.read_plink1_bin(bed, bim=None, fam=None, verbose=True, ref='a1', chunk=Chunk(nsamples=1024, nvariants=1024))[source]¶

Read PLINK 1 binary files [a] into a data array.

A PLINK 1 binary file set consists of three files:

BED: containing the genotype.
BIM: containing variant information.
FAM: containing sample information.

The user might provide a single file path to a BED file, from which this function will try to infer the file path of the other two files. This function also allows the user to provide file path to multiple BED and BIM files, as it is common to have a data set split into multiple files, one per chromosome.

This function returns a samples-by-variants matrix. This is a special kind of matrix with rows and columns having multiple coordinates each. Those coordinates have the metainformation contained in the BIM and FAM files.

Examples

The following example reads two BED files and two BIM files correspondig to chromosomes 11 and 12, and read a single FAM file whose filename is inferred from the BED filenames.

>>> from os.path import join
>>> from pandas_plink import read_plink1_bin
>>> from pandas_plink import get_data_folder
>>> G = read_plink1_bin(join(get_data_folder(), "chr*.bed"), verbose=False)
>>> print(G)
<xarray.DataArray 'genotype' (sample: 14, variant: 1252)>
dask.array<concatenate, shape=(14, 1252), dtype=float32, chunksize=(14, 779), chunktype=numpy.ndarray>
Coordinates: (12/14)
  * sample   (sample) object 'B001' 'B002' 'B003' ... 'B012' 'B013' 'B014'
  * variant  (variant) <U11 'variant0' 'variant1' ... 'variant1251'
    fid      (sample) object 'B001' 'B002' 'B003' ... 'B012' 'B013' 'B014'
    iid      (sample) object 'B001' 'B002' 'B003' ... 'B012' 'B013' 'B014'
    father   (sample) object '0' '0' '0' '0' '0' '0' ... '0' '0' '0' '0' '0' '0'
    mother   (sample) object '0' '0' '0' '0' '0' '0' ... '0' '0' '0' '0' '0' '0'
    ...       ...
    chrom    (variant) object '11' '11' '11' '11' '11' ... '12' '12' '12' '12'
    snp      (variant) object '316849996' '316874359' ... '373081507'
    cm       (variant) float64 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0
    pos      (variant) int32 157439 181802 248969 ... 27163741 27205125 27367844
    a0       (variant) object 'C' 'G' 'G' 'C' 'C' 'T' ... 'A' 'G' 'A' 'T' 'G'
    a1       (variant) object 'T' 'C' 'C' 'T' 'T' 'A' ... 'G' 'A' 'T' 'C' 'A'
>>> print(G.shape)
(14, 1252)

Suppose we want the genotypes of the chromosome 11 only:

>>> G = G.where(G.chrom == "11", drop=True)
>>> print(G)
<xarray.DataArray 'genotype' (sample: 14, variant: 779)>
dask.array<where, shape=(14, 779), dtype=float32, chunksize=(14, 779), chunktype=numpy.ndarray>
Coordinates: (12/14)
  * sample   (sample) object 'B001' 'B002' 'B003' ... 'B012' 'B013' 'B014'
  * variant  (variant) <U11 'variant0' 'variant1' ... 'variant777' 'variant778'
    fid      (sample) object 'B001' 'B002' 'B003' ... 'B012' 'B013' 'B014'
    iid      (sample) object 'B001' 'B002' 'B003' ... 'B012' 'B013' 'B014'
    father   (sample) object '0' '0' '0' '0' '0' '0' ... '0' '0' '0' '0' '0' '0'
    mother   (sample) object '0' '0' '0' '0' '0' '0' ... '0' '0' '0' '0' '0' '0'
    ...       ...
    chrom    (variant) object '11' '11' '11' '11' '11' ... '11' '11' '11' '11'
    snp      (variant) object '316849996' '316874359' ... '345698259'
    cm       (variant) float64 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0
    pos      (variant) int32 157439 181802 248969 ... 28937375 28961091 29005702
    a0       (variant) object 'C' 'G' 'G' 'C' 'C' 'T' ... 'A' 'C' 'A' 'A' 'T'
    a1       (variant) object 'T' 'C' 'C' 'T' 'T' 'A' ... 'G' 'T' 'G' 'C' 'C'
>>> print(G.shape)
(14, 779)

Lets now print the genotype value of the sample B003 for variant variant5:

>>> print(G.sel(sample="B003", variant="variant5").values)
0.0

The special matrix we return is of type xarray.DataArray. More information about it can be found at the xarray documentation.

Parameters

bed (str) – Path to a BED file. It can contain shell-style wildcards to indicate multiple BED files.
bim (Optional[str]) – Path to a BIM file. It can contain shell-style wildcards to indicate multiple BIM files. It defaults to None, in which case it will try to be inferred.
fam (Optional[str]) – Path to a FAM file. It defaults to None, in which case it will try to be inferred.
verbose (bool) – True for progress information; False otherwise.
ref (str) – Reference allele. Specify which allele the dosage matrix will count. It can be either "a1" (default) or "a0".
chunk (Chunk) – Data chunk specification. Useful to adjust the trade-off between computational overhead and IO usage. See pandas_plink.Chunk.

Returns

Genotype with metadata.

Return type

xarray.DataArray

References

a: PLINK 1 binary. https://www.cog-genomics.org/plink/2.0/input#bed