Clinical Data Analysis

Rätsch Research Group: Clinical Data Analysis


An electronic health record (EHR) is a digital collection of patient health information. Ideally, every instance of patient care is included in a time-stamped entry to the EHR, with clinical data in a variety of formats such as clinical text notes, pathology images, genomic data and more. To date, there are entries for over 1.3 million unique patients stored in the electronic record system of MSKCC. This makes for a complex and growing dataset, and an exciting opportunity to develop novel algorithms for use in the biomedical field. We are developing advanced methods for feature extraction from this growing body of clinical data, and using these features to develop tools which improve patient care, explore multi-scale phenotype learning and help physicians streamline their work.


Topic Modeling for Mining Clinical Notes

A major challenge of data mining electronic health records is that relevant features are often embedded in free-text clinical notes written for human consumption by domain experts. This text data includes details such as patient history, symptoms and care plans that cannot be found elsewhere in a patient’s EHR. By employing generative topic models tailored to this source of rich patient data we can create a digitized representation of otherwise computationally unwieldy text notes for use in further data analysis. As a proof-of-concept we applied this strategy to a set of 5000 patients’ clinical notes [1]. These patients were chosen because they had taken a common genetic screening panel. By analyzed correlations between patients’ clinical text topics and their genetic testing results, we independently re-identifying several notable correlations between patient symptoms and their cancer mutations. Given a wider set of patient notes and a less studied set of genetic tests, this type of analysis could reveal new and unexpected patterns.

Learning The Dynamics Of Topic Evolution In Clinical Text Timeseries

We are interested in detecting dynamic structure in time-series of clinical text to infer variables describing the health trajectories of patients. In a first study, we have learned a Markov model of the topic representations for the sequences of patient reports to learn transition probabilities and typical health trajectories and combined this temporal model with patient survival information to discover correlations between clinical note topics and mortality [2]. We are also investigating rich generative models to incorporate the temporal structure into a novel dynamic latent variable model to obtain a (naturally) hierarchical representation of the clinical notes and their dynamics and combine the prediction of health trajectories with the existing survival data.



  1. K R Chan, X Lou, T Karaletsos, C Crosbie, S Gardos, D Artz, G Rätsch. An Empricial Analysis of Topic Modeling for Mining Cancer Notes. ICDM Biological Data Mining and its Applications in Healthcare (ICDM-BioDM). 2013.
  2. T Karaletsos, X Lou, K R Chan, C Crosbie, G Rätsch. Towards an integrated dynamic model of temporal structure of clinical text notes and interactions with genetic profiles. Extended abstract, NIPS Workshop on Machine Learning for Clinical Data Analysis in Healthcare. 2013.