Breeding and Genetics Symposium II: Challenges and Opportunities with Big Data in Animal Breeding
Using whole genome sequence for improving genomic prediction relative to that from high density SNP arrays has been well below expectations, despite some overoptimistic computer simulations. Why is this so? First, NGS data are massive, noisy and their computer bioinformatics analysis is expensive when applied to the scale needed in animal breeding. SNP calling is a tricky procedure that is especially sensitive to low depth sequencing. This makes it NGS data far more expensive than array genotyping. Second, rare variants are the most frequent class of variants. Population genetics theory dictates that the number of SNPs of a given frequency f is inversely proportional to f. For prediction purposes, it is clear that rare variants are not useful, because it is very likely that they do not segregate in both testing and training subpopulations. Third, sequence contains highly repetitive info, the number of new SNPs decreases quickly with adding new samples and, further, low effective population sizes in domestic animals makes it disequilibrium to be large. What can we do about it? First, high density data can be imputed up to sequence; this has a mild - and limited - effect on improving accuracy. Second, sequence at very low depth on numerous animals can be obtained. This is an extremely risky option that I discourage due to strong biases in heterozygous genotype calling. Third, predictions can be constructed using some sort of prior information (e.g., based on known causative genes or from GWAS studies) together with high density, perhaps custom designed arrays. I believe this is the most promising approach.