Lead Institution: University of Illinois at Urbana-Champaign

Project Leader: Dan Roth

Research Progress

  • Abstract

    Algorithms for detecting sensitive data need to be very robust and precise. In order to address this, we developed global inference strategies for Information Extraction which can reduce the inconsistency errors generally made while using an IE system. We integrated several medical dictionaries in our system which would provide valuable domain knowledge. We targeted the tasks of detecting entities and relations between them and coreference resolution. We have made our implementation of coreference resolution algorithm publicly available. We have also released several lists which have been designed after careful examination of EHRs from several hospitals.

  • Focus of the research/Market need for this project

    Increasing number of health care institutions are storing patients’ clinical observations electronically because of recent US government initiatives that promote the use of electronic health records (EHRs). Information in hospital systems can often be seen by many people. This poses a big privacy concern. Hence, there is a great market need for developing high-accuracy algorithms for automatic detection of sensitive data. In addition to addressing this market need, MedIE would also have been instrumental in using free-text information to drive Computerized Clinical Decision Support (CDS) which aims to aid decision making of health care providers by providing easily accessible health-related information at the point and time it is needed.

  • Project Aims/Goals

    Our primary goal in this project is to develop techniques to automatically detect sensitive information from clinical narratives (which are part of EHRs).
    We targeted the following types (or categories) of sensitive data:

    • Mental health; abuse in the family
    • Drug Abuse; hospitalization related to it
    • HIV data
    • Genomic data; indication of genetic information in EHRs
    • Sexually transmitted diseases
  • Key Conclusions/Significant Findings/Milestones reached/Deliverables

    Building on our earlier work on identifying concepts in medical documents we developed and released a software package which finds possible instances of drug abuse in a clinical narrative. This software is based on SNOMED CT and it maintains a list of medical concepts related to drug abuse in the SNOMED CT hierarchy. These concepts, in turn, were found out using a set-expansion system, one of the earlier works we have done on this grant. The details of this system can be found out in the technical report mentioned below.

    We continued to improve our technologies for natural language analysis of medical documents, released improved versions of our packages and presented a few papers on this technology in top-level conferences and journals.

  • Materials Available for Other Investigators/interested parties
  • Market entry strategies

    US government initiatives which promote the adoption of Electronic Health Records have resulted in a large number of vendors for both EHRs and CDS. Top EHR vendors include ABELSoft Corporation, AdvancedMD, Allmeds Inc., Cerner Corporation, McKesson etc. Top vendors for CDS include Zynx Health, ESAGIL, PEMSoft, etc. These vendors provide solutions for small, medium and large sized practices. EHR and CDS systems provided by these vendors need to be very robust because it involves security of patients and also affects clinical workflow. Designing such robust solutions is quite expensive and the tools developed by us would significantly help in such design. Vendors would also benefit from the conference and journal papers written by us and the domain knowledge that we will release.

Extraction of Events and Temporal Expressions from Clinical Narratives
Prateek Jindal and Dan Roth
Journal of Biomedical Informatics (JBI), Volume 46, pp S13-S19, December 2013

Using Soft Constraints in Joint Inference for Clinical Concept Recognition
Prateek Jindal and Dan Roth
Proceedings of International Conference on Empirical Methods in Natural Language Processing (EMNLP), pp 1808-1814, Seattle, USA. 2013

Detecting Privacy-Sensitive Events in Medical Text
Prateek Jindal, Dan Roth and Carl A. Gunter
UIUC CS Technical Report, 2013

End-to-End Coreference Resolution for Clinical Narratives
Prateek Jindal and Dan Roth
Proceedings of International Joint Conference on Artificial Intelligence (IJCAI). pp 2106-2112, Beijing, China, 2013

Using Domain Knowledge and Domain-Inspired Discourse Model for Coreference Resolution for Clinical Narratives
Prateek Jindal and Dan Roth
Journal of the American Medical Informatics Association, July 2012