Biomedical Data Science
Since 2016 I have been developing and teaching a 10 credits (20 hours) course on analysis of biomedical data using the R statistical software as part of the MSc in Operational Research (with Data Science) and the MSc in Statistics.
The course covers the following topics during 5 lectures (10 hours in total):
Introduction to biomedical data
- Typical research questions: association, causation, discovery and prediction
- Types of biomedical data: routine data (consented and unconsented), phenotypic biomarkers, genetic data, derived data
- Identifying problems in real-world data
- Data cleaning, alignment, imputation and exploration
- Mechanisms of missing data
Discovering associations
- Covariance and correlation
- Statistical inference and linear regression
- Solving the least squares problem
- Linear algebra considerations and collinearity
- Hypothesis testing
- Power considerations
- Assessing the fit of the model
Logistic regression and predictive models
- Case-control studies
- Generalized linear models
- Logistic regression
- Odds ratio and interpretation of results
- Likelihood and model comparison
- Measures of discrimination and calibration performance
- Predictive models and cross-validation
Biomarker discovery and high-dimensional datasets
- High-throughput data (proteomics, metabolomics, lipidomics, glycomics)
- Biomarkers and biomarker discovery
- Dimensionality reduction: clustering and PCA
- Multiple testing
- Subset selection approaches
- Penalised regression: LASSO, ridge regression, elastic nets
Prediction from genetic data
- Causality, confounding and stratification
- Introduction to genetic data
- Genetic variation
- Genome-wide association studies
- GWAS meta-analysis
- Approaches for genotypic prediction and genetic risk scores
The course is accompanied by self-guided material to learn and practice how to perform analyses using R (10 hours in total):
Lab 1: Introduction to R
- Interactive terminal and workspaces
- Object types and data structures
- Basic functions and operators
Lab 2: Data preparation and linear regression
- Merging and simple imputations
- Statistical summaries and plots
- Writing functions and loops
- Fitting linear regression monels
Lab 3: Logistic regression and predictive models
- Using R packages
- Fitting logistic regression models
- Making predictions on withdrawn data
Lab 4: High-dimensional datasets
- Correlation plots and PCA
- Subset selection in R
- Regularisation approaches
Lab 5: Prediction from genetic data
- Performing genome-wide association studies
- Computing genetic risk scores
- Prediction from genetic scores
- Performing a GWAS meta-analysis