The analysis of gene expression data has become routine for many large-scale studies in human disease. Much focus has concentrated on the discovery of small gene signatures from high dimensional cross sectional data that map to a specific function and/or disease. Decreasing costs of whole genome micro array experiments due to fast expansion in molecular technologies have made this type of experiment feasible for longitudinal population based studies. As such, many groups are returning to study gene expression at multiple time points, thus effectively mapping disease progression, and not just gene-disease association. Such studies however, are fraught with missing data at both the patient and time point level. Patient drop out and insufficient molecular materials create large ragged arrays of gene expression data.
The focus of this research was (1) the identification of time related gene expression patterns in the presence of missing time point data, and (2) the clustering of time specific functional genes where missing data prevents conventional longitudinal analyses. We first simulate a large gene expression matrix with multiple time points (2-6), random effects (clinician, hospital site), and include confounders such as disease status, biopsy site and inflammation status at the biopsy site. We then introduce a random component of missing data, such that a small proportion of simulated participants have missing time points, creating a ragged array. We visit the problem of detection above background (DABG), and compare standard medoid clustering with a more complex Gaussian process latent variable model (GP-LVM).
We demonstrate the identification of time dependent gene signatures associated with simulated gene expression. We show that DABG analyses may lead to falsely discarding time dependent genes. Lastly we show the benefits of using GP-LVM as compared with standard hierarchical clustering in the presence of longitudinal non-linear gene-disease associations.