Medical language representation & Information Extraction from Clinical Reports

Medical language representation

Much important information is contained in clinical text notes written by doctors, nurses and other clinicians. These notes can contain recommended treatments, assessment of patient wellbeing, prognosis and notable developments - information which may not appear elsewhere in the patient’s EHR. However, integrating this text information requires a representation of medical language. We have worked on this problem at two levels - firstly, developing a method to learn representations of words, integrating prior knowledge, specifically for medical use. This is to address the fact that medical English is different to generic English in terms of assumed word meaning (e.g., ‘patient’), while medical text corpora are limited in size relative to generic corpora [1]. Secondly, we have worked with others to represent entire clinical text notes (text summarisation) in order to perform mortality prediction in the ICU [2].

Information Extraction from Clinical Reports

Clinical reports written by medical staff provide and record key information about patient symptoms, diseases, and treatments. Being able to automatically extract such information is of great practical relevance for various aspects of clinical practice and research involving patient data. However, very often, these reports contain little or no structured information and are composed as free text within document templates. Moreover, medical language in these reports is often very specialized and only understandable in context. Without proper analysis tools, manual annotations by study nurses or physicians are needed to extract the relevant information in a structured manner. Annotating such reports by hand is labor-intensive and, hence, very expensive. The absence of automatic data review and extraction tools often prevents the efficient design of larger studies. In this project, we aim to transfer technologies from Natural Language Processing and Machine Learning into the clinical IT environment. We aim to develop and evaluate a machine learning-based pipeline for automatic information extraction from medical reports. The main technical goal is to develop a hybrid approach using both rule-based dictionary matching for generating training data and advanced language modeling techniques based on Deep Learning (BERT-based models) to perform the actual medical concept identification task. Moreover, a deployable prototypical analysis tool will be able to extract a large number of medical concepts from clinical reports, which can then be further developed within the hospital environment and integrated into the hospital IT systems to systematically extract information from existing and new clinical reports within the hospital. The past technical contributions relevant to this project is Medical reports deidentification.

In a previous collaboration between the Data Service group at University Hospital Zurich and the Biomedical Informatics lab, we have developed a deidentification pipeline suitable for de-identifying clinical reports as they are stored within the hospital’s KISIM system (also used within the Tumor Profiler Study11). It is based on a rule-based system. As the source, the extracted JSON files are loaded into the system and then converted to the GATE format. The next step is then the annotation of tokens that contain identification information such as names, locations, age, dates, organizations, and occupations. The recognition of these entities is enabled by different lexica and JAPE rules12 (e.g., words such as Dr. which typically stands at the beginning of a name). These annotated entities are then substituted by an equivalent token from the same category. For example, dates are shifted by a random amount of days to de-identify important date specifications.

Involved group members: Rita Kuznetsova, Stephanie Hyland (alumna), Gunnar Rätsch

^{References
[1] Hyland, Stephanie L., Theofanis Karaletsos, and Gunnar Rätsch. "Knowledge transfer with medical language embeddings." arXiv preprint arXiv:1602.03551 (2016).
[2] Grnarova, P., Schmidt, F., Hyland, S. L., & Eickhoff, C. (2016). Neural document embeddings for intensive care patient mortality prediction. arXiv preprint arXiv:1612.00467.}