Sequencing the human genome took 13 years to be completed and cost $2.7B1. It has been almost two decades since the Human Genome Project (HGP) was completed and, while genomics is gradually becoming a greater part of our lives, the promised truly transformational impact of genomics on healthcare and our way of life is still yet to come. The cost of DNA sequencing has been steadily declining, however, high-coverage (15X-30X) whole genome sequencing (WGS), the gold standard in genomics, still gravitates towards $1000/sample for human samples.
This has significantly stalled the progress of genomics and delayed its entrance as a key player in health and medicine for humans and animals. While high-coverage WGS is not the only available method for obtaining genetic information from organisms, it is the most comprehensive and data-rich approach. Alternative methods can deliver genetic information for a lower cost but suffer from multiple drawbacks such as limited throughput and scope, high operational complexity, high initial capitalization, etc.
This article discusses the potential of low-pass WGS (< 1X genome coverage), in combination with imputation analysis, to address the high unmet need for an affordable substitute of high-coverage WGS in genomics. We present a case study for using low-pass WGS and imputation to solve a previously unsolved problem in feline breed analysis and discuss other potential applications of the method.
Genotyping microarrays reign over the direct-to-consumer genetics market and are also an essential part of clinical diagnostics. They have been around for decades and offer some indisputable advantages over high-coverage WGS. These include:
However, DNA microarrays also suffer from numerous limitations:
Sequencing only selected genomic regions through hybridization-based target enrichment (hybridization capture) can be extremely useful in genomics. The method can be performed on pre-made sample libraries already prepared for Next Generation Sequencing (NGS).
The technology relies on molecularly tagged, specifically designed probes (complementary to the genomic regions of interest) hybridizing to the sample DNA. Through the pull-down of the tagged probes already hybridized to the sample DNA, genomic regions of interest can be separated from the rest of the DNA and sequenced. The method provides:
Despite these positives, hybridization-based target enrichment sequencing has multiple drawbacks:
With dropping sequencing costs, low-pass WGS (typically defined as <1X coverage of the genome) presents an attractive substitute for DNA microarrays. For comparison, 0.4X coverage translates to around one read covering each of ~30 million genetic variants of the human genome, while microarrays provide information on orders of magnitude fewer variants2.
Low-pass sequencing is particularly useful when combined with imputation analysis, which allows us to fill in sequencing data gaps and impute missing data from known gene variant co-inheritance patterns. Obtaining information on a few different variants in a block of DNA allows imputing the remaining known variants within the same block. A 0.4X genome coverage, combined with imputation analysis, was found to be 98.2% concordant with a DNA microarray-based analysis, while 1X coverage showed 99.2% concordance with microarray-based results2. Therefore, low-pass sequencing, in combination with imputation analysis, can provide at least the same level of accuracy as DNA microarrays.
A potential drawback of using low-pass WGS + imputation is that there has to be prior knowledge available on gene variant co-inheritance patterns in the species of interest, i.e. a robust multi-generational haplotype map has to already exist or be built prior to analysis (discussed in more detail later in this text). If this prerequisite is in place, low-pass WGS can be an operationally straightforward and cost-effective way to obtain large amounts of genomic information when following a scale-optimized laboratory process. In addition, with the accumulation of a large low-pass WGS sample database, the potential for the discovery of novel variants increases.
Table 1. Comparison of genotyping methods.
Unlike dog breeds, cat breeds are extremely difficult to identify with precision. The reasons for this are associated with the domestic cat’s short history of selective breeding and its under-representation in genomics research.
Cat domestication started around 10,000 years ago and was related to the emergence of agriculture since cats provided the perfect solution for rodent pest control3,4,5. During the gradual cat domestication process, very limited selective breeding occurred due to the fact that freely breeding cats were still remarkably capable pest controllers6,7,8. Selective cat breeding only appeared in modern times, more specifically, over the past 50 years8. In evolutionary terms, this is an extremely short period of time for robust genetically different sub-populations within any species to form.
In addition, selective cat breeding has historically been focused on aesthetic features (coat color, coat texture, and other typically monogenic traits) rather than genetically complex body structure or functional/behavioral traits. This has resulted in cat breeds often being defined by a single gene variant while sharing the majority of variants associated with life history and geographic origin. Conversely, it also happens that cats with diverse genotypes are classified as the same breed due to similar phenotypic presentation.
These factors make cats an unusual case of domesticated animals, especially when compared to dogs, where domestication started ~14,000 years ago and followed a rigid set of selective breeding rules focusing on traits defined by complex gene interactions9. The vast differences between cats’ and dogs’ evolutionary histories mean that breed analyses based on genotype will yield different conclusions for the two species.
There is a substantial disparity in the number of resources and research efforts dedicated to feline genomics compared to canine genomics. There is also a stark contrast in the genome sequencing goals set for the two fields. While researchers from the 99 Lives cat genome project celebrated when they sequenced the genomes of 200 domestic cats (double their initial goal)10, the Dog10K Consortium is aiming to sequence the genomes of 10,000 dogs and wild canids11, as well as sequence dog breeds at high depth, allowing for different breeds to have their own high-quality genome assemblies. In contrast, until the 99 Lives project, there had been very little systematic effort to understand genome-wide differences between cat breeds.
The evolution of cat breeds is inextricably linked to the species’ ancestral and geographic history8. Therefore, cat breed analysis bears a high degree of similarity with human ethnic ancestry analysis. Both types of analysis are based on assessing the sample of interest’s genomic similarity to chunks of DNA (haplotype blocks), rather than to the small individual units comprising the genome (nucleotides). Gene variants (alleles) are usually inherited together in discrete haplotype block units showing the very low amount of ‘genetic shuffling’ across generations12.
Because every species has its own haplotype inheritance pattern (multi-generational haplotype map, also known as linkage disequilibrium map), haplotype blocks can be used to assess a cat’s similarity to a particular breed using a limited amount of data (imputation).
As Figure 1 shows, different breeds have a characteristic combination of alleles inherited together within each haplotype block. A comprehensive breed analysis has to take into account the sample’s genetic similarity to all known feline haplotype blocks before judging the cat’s overall genetic proximity to a particular breed. Once a high-quality multi-generational haplotype map is available, low-pass WGS in conjunction with bioinformatic imputation can be used for cat breed analysis. The better the quality of the feline haplotype map, the more accurate the imputation-based breed analysis.
Building a thorough high-resolution haplotype map relies on having a reference panel comprised of genome sequencing data from thousands of cats representing different breeds and geographic locations. If the reference panel has a small sample size or an obvious bias in population sampling, the allele frequency and allele co-segregation estimates on which the haplotype map is based will be inaccurate.
In addition, as mentioned previously, genetic differences between feline breeds are minor and difficult to detect unless a large cat genome repository exists. Given the already discussed limitations of the feline genomics effort, cats are disadvantaged when it comes to having sufficient publicly available genomic data for the creation of a highly detailed haplotype map (and therefore breed identification).
We first aimed to build the largest available reference panel of WGS data from purebred and mixed breed cats from across the world. Our reference panel is continuously enriched and updated with quality-controlled new cat DNA samples. This process of updating the reference panel takes full advantage of the screening potential of low-pass WGS.
We start by first performing low-pass sequencing on candidate purebred cats. We then perform a population stratification analysis to get an idea of how well these samples cluster with existing high-coverage samples (> 15X coverage). This inexpensive and computationally lean approach allows us to select the best candidates for subsequent high-coverage sequencing. To not bias our computationally defined populations to the founder samples, we also supplement the analysis with occasional non-screened purebred samples.
Using our sample reference panel, we are able to: (1) perform a Principal Component Analysis (PCA) to observe breed clusters based on genetic similarity (Eastern, Western, Exotic, Persian, and Polycat breed groups); (2) generate a high-resolution multi-generational haplotype map and utilize it for our imputation and downstream machine learning classification pipeline for breed analysis.
Every Basepaws’ customer sample undergoes low-pass WGS (average coverage of 0.44X) and the sequencing reads are mapped to the latest version of the domestic cat’s genome assembly (felix_catus_0.9). Variant calling is then performed, followed by an imputation analysis of un-genotyped alleles with the help of our high-depth reference panel and out multi-generational haplotype map.
Next, we use our haplotype map to segment the cat sample’s genome into haplotype blocks, which are then compared against the haplotype blocks in our reference panel using a machine learning classification algorithm.
We use this analysis to deliver two types of insights regarding a cat’s breed:
Using low-pass WGS, combined with bioinformatic imputation, has allowed us to perform a high-accuracy cat breed analysis for a fraction of the cost (and time) that would have been required if an alternative approach, such as DNA microarray, was used.
Low-pass WGS’s potential is becoming widely recognized and the technique is already being applied in multiple novel and diverse contexts. It is well-known that geneticists are currently seeing the world through a caucasian-centric lens. This is due to the fact that the majority of the available genomic data comes from Caucasian populations.
While performing high-coverage WGS on thousands of people from different races is a costly endeavor, our understanding of the human genome is advanced enough to allow the use of low-pass WGS + imputation on under-represented populations to quickly (and cheaply) diversify our genetic outlook. Low-pass WGS has already been used to identify 2 new alleles associated with major depressive disorder in a cohort of Chinese women13. This study performed low-pass WGS on 10,640 Chinese women.
Another example is the Broad Institute/Harvard partnership on the Neuropsychiatric Genetics in African Populations (NeuroGAP) initiative aiming to study psychiatric genetics in 35,000 African people14,15. Using low-pass WGS is among the key methods being considered for this study.
Low-pass WGS also has a high potential for advancing our understanding of cancer and cancer care. One valuable use of this technique in cancer care is quality control of tumor biopsy prior to high-coverage sequencing and more in-depth genomic sample characterization. Low-pass WGS can be used to (1) confirm that isolated blood cells are indeed cancerous cells; (2) confirm that sequencing libraries generated from single cells are a uniform representation of the genome; (3) confirm that sequencing libraries from cell-free DNA (cfDNA) have tumor DNA15.
Apart from being used as a supplementary method for quality control, low-pass WGS can also be used as the primary method in cancer research and diagnostics. Researchers have used low-pass WGS of cfDNA, combined with imputation, to identify somatic copy number variations (SCNVs) in tumor biopsies of patients suffering from metastatic breast or prostate cancer15. Alternative methods for assessing SCNVs, such as microarrays, would typically require higher starting DNA amounts and provide lower resolution.
Low-pass WGS, combined with imputation analysis, is becoming increasingly widely adopted across different genomics applications due to its low cost, operational convenience, and information density. As NGS prices continue to decrease and bioinformatic methods continue to evolve, this technology will become more commonly used.
References
1) genome.gov/human-genome-project/Completion-FAQ
2) Wasik, K., Berisa, T., Pickrell, J.K., Li, J.H., Fraser, D.J., King, K. and Cox, C., 2019. Comparing low-pass sequencing and genotyping for trait mapping in pharmacogenetics. bioRxiv, p.632141.
3) Vigne, J.D., Guilaine, J., Debue, K., Haye, L. and Gérard, P., 2004. Early taming of the cat in Cyprus. Science, 304(5668), pp.259-259.
4) Gupta, A.K., 2004. Origin of agriculture and domestication of plants and animals linked to early Holocene climate amelioration. CURRENT SCIENCE-BANGALORE-, 87, pp.54-59.
5) Zohary, D. and Hopf, M., 2000. Domestication of plants in the Old World: the origin and spread of cultivated plants in West Asia, Europe and the Nile Valley (No. Ed. 3). Oxford University Press.
6) Dobney, K. and Larson, G., 2006. Genetics and animal domestication: new windows on an elusive process. Journal of Zoology, 269(2), pp.261-271.
7) Randi, E., Pierpaoli, M., Beaumont, M., Ragni, B. and Sforzi, A., 2001. Genetic identification of wild and domestic cats (Felis silvestris) and their hybrids using Bayesian clustering methods. Molecular Biology and Evolution, 18(9), pp.1679-1693.
8) Lipinski, M.J., Froenicke, L., Baysac, K.C., Billings, N.C., Leutenegger, C.M., Levy, A.M., Longeri, M., Niini, T., Ozpinar, H., Slater, M.R. and Pedersen, N.C., 2008. The ascent of cat breeds: genetic evaluations of breeds and worldwide random-bred populations. Genomics, 91(1), pp.12-21.
9) Adams, J., 2008. Genetics of Dog Breeding. Nature Education 1(1):144
10) missouri.edu/99lives
11) Wang, G.D., Larson, G., Kidd, J.M., vonHoldt, B.M., Ostrander, E.A. and Zhang, Y.P., 2019. Dog10K: The International Consortium of Canine Genome Sequencing. National Science Review.
12)broadinstitute.org/international-haplotype-map-project/haplotype-map
13) Cai, N., Bigdeli, T.B., Kretzschmar, W., Li, Y., Liang, J., Song, L., Hu, J., Li, Q., Jin, W., Hu, Z. and Wang, G., 2015. Sparse whole-genome sequencing identifies two loci for major depressive disorder. Nature, 523(7562), p.588.
14) broadinstitute.org/stanley-center-psychiatric-research
15) cancer.gov/about-nci/organization/ccg/blog/2019/low-coverage-seq
Sign up to our newsletter to receive the latest industry news, and trends.