Abstract: The data science ecosystem encompasses data fairness, scalable and powerful statistical and ML/AI methods and tools, and trustworthy science by accounting uncertainty and improving interpretability. In this talk, I will discuss the challenges and opportunities as we navigate the crossroad of statistics and AI to empower genomic health data science. Examples include leveraging the AI/ML-generated synthetic data to empower statistical analysis of large biobank data in the presence of missing data, trans-ancestry genetic risk prediction by treating ancestry as a continuum, and scalable analysis of the UK biobank whole genome sequencing data of 500,000 subjects in the UK biobank cloud platform RAP. This talk aims to ignite proactive and thought-provoking discussions, foster cross-disciplinary collaboration, and cultivate open-minded approaches to advance scientific discovery.
Bio: Xihong Lin is Professor and Former Chair of the Department of Biostatistics, Coordinating Director of the Program in Quantitative Genomics at the Harvard T. H. Chan School of Public Health, and Professor of the Department of Statistics at the Faculty of Arts and Sciences of Harvard University, and Associate Member of the Broad Institute of MIT and Harvard.
Dr. Lin’s research interests lie in the development and application of scalable statistical and machine learning methods for the analysis of massive and complex genetic and genomic, epidemiological and health data. Some examples of her current research include analytic methods and applications for large scale Whole Genome Sequencing studies, biobanks and Electronic Health Records, techniques and tools for whole genome variant functional annotations, analysis of the interplay of genes and environment, multiple phenotype analysis, polygenic risk prediction and heritability estimation. Additional examples include integrative analysis of different types of data, Mendelian Randomization, causal mediation analysis and causal inference, federated and transferred learning, single cell genomics, analysis of epidemiological and complex observational studies, and analysis of COVID-19 epidemic data. Dr. Lin’s theoretical and computational statistical research includes statistical methods for testing a large number of complex hypotheses, causal inference, statistical and ML methods for large matrices, prediction models using high-dimensional data, federated and transferred learning, cloud-based statistical computing, and mixed models, nonparametric and semiparametric regression, and statistical methods for epidemiological studies.