Audit: DATA

Lead Institution: Northwestern Memorial Hospital and University of Illinois at Urbana Champaign

Project Leader: David Liebovitz and Carl Gunter

Research Progress

  • Abstract

    To make meaningful progress on audit studies, including EBAM, it is essential to develop data sets of audit logs that can be used for validation. This project concerned the development of such data sets and the associated problem of how to protect privacy in releasing data of this kind.

  • Focus of the research/Market need for this project

    It is impossible to provide sound claims about vendor solutions to audit log analytics without some data sets on which it is possible to test algorithms. This is a quite difficult problem, but increments of progress are rewarded with a better ability to select between alternative techniques based on objective measures.

  • Project Aims/Goals

    This project has three primary goals. The first of these is to collect and curate realistic data sets that can be used for a broad spectrum of validations. The second is to develop technologies that can be used to share rich data sets of these kinds with adequate privacy assurances. The third is to develop a system of sharing data to researchers within the SHARPS team.

  • Key Conclusions/Significant Findings/Milestones Reached

    The primary achievements on this project were in the area of developing a data set for validation of audit log analytics based on audit logs from Northwestern Memorial Hospital (NMH). We did this in two major stages. The first consisted of the Cerner audit logs from four months of accesses to patient records at the hospital. The core information in these logs tells when a chart user (like a doctor or nurse) accesses the record of a patient and the reason given for this access. Names of chart users and patients were replaced by identifiers and the mapping between names and identifiers were kept by NMH technical staff. Additional information in the records included the locations of patients and the services in which chart users were employed as well as the Cerner positions of the chart users. Each access was associated with a given encounter (hospital stay) for a patient.

    The following table summarizes statistics for the audit logs:

    The following table shows summary statistics for the patient records:

    The following illustrates the type of entries in the logs:

    Our studies on this type of data are described in more detail in the PATHWAYS and SIMILAR components. Typically these involved efforts like looking at sequences of reasons for accessing patient records and indicating by learning techniques whether they looked unusual (potentially because of an access violation). The studies in SIMILAR went beyond this and used a second generation data set that we developed. This one was expanded in two key ways. First, we increased the duration, covering a full year of chart accesses rather than just 4 months, and, second, we added information derived from structured parts of patient records to give a data set that included diagnoses, problem lists, procedures, and medications. We saw to it that the four month data set was a subset so that the smaller data could also be expanded (as shown in the table above).

    Another key direction of DATA was the development of techniques for privacy protection of heterogeneous graphical and sequence data. Most real-world networks are heterogeneous, where nodes and relations are of different types. For example, in a healthcare network, such as a hospital workflow network, nodes can be patients, doctors, nurses, medical tests, diseases, medicines, treatments, and so on. Prior work has shown how privacy can be compromised in homogeneous information networks by the use of specific types of graph patterns. We showed how the extra information derived from heterogeneity can be used to relax these assumptions. To characterize and demonstrate this added threat, we formally defined privacy risk in general anonymized heterogeneous information networks and presented a new re-identification attack that exploits the vulnerability. Our case study published in EDBT was based on a de-identified online social network. Workflows of an enterprise offer a unique perspective on the functioning of an organization. In particular, in an enterprise hospital setting, the workflow of a patient care of an in-patient will reveal typical patterns of care associated with particular demographics, diseases, procedures, and patients with existing problems. This workflow data can be used to identify anomalous or inefficient care patterns, or to compare the differences in patient care across different organizations for similar problems. While releasing such patient care data has many advantages, releasing the patient care patterns of even anonymized patients is fraught with privacy risks. While existing research has concentrated extensively on techniques to disclose tabular data (such as patient age, sex, diagnosis) in a privacy preserving manner, not much research is available for disclosing workflow patterns. We found that techniques in natural language processing, namely probabilistic grammars, are uniquely suited to compactly represent patient care workflows. This compactness is a result of regular patterns of care in a hospital for similar problems. The generative properties of these grammars can be used to efficiently generate synthetic workflow patterns that are realistic representations of a patient care in an enterprise, without compromising the privacy of the individual patients.

    A final aspect of our efforts was the development of agreements to allow sharing of audit log data between the SHARPS team members. Given the issues with de-identification described above we pursued a dual approach in which we carried out basic de-identifications of HIPAA safe harbor item like patient name and zip code. We also de-identified chart users (to protect the medical professionals). Given the sensitive nature of data and its complexity we developed a Data Use Agreement (DUA) based on that of the NIH eMERGE consortium and used this for sharing within SHARPS. This DUA provides prohibitions against re-identification and other uses of the data that could create privacy risks.

  • Available Materials for Other Investigators/Interested parties

    For the reasons give above, we are not able to share the NMH data sets with the public. However, we have written extensively on the results we developed with details sufficient that the studies could be carried out on other data sets.

  • Market entry strategies

    See the discussion for EBAM.


Generative Grammars for Privacy-Preserving Data Publishing
Ravinder Shankesi, Vincent Binschaedler, Aston Zhang, Carl A. Gunter, David Liebovitz, and Brad Malin
Under Review

Privacy Risk in Anonymized Heterogeneous Information Networks
Aston Zhang, Xing Xie, Kevin Chen-Chuan Chang, Carl A. Gunter, Jiawei Han, and XiaoFeng Wang
International Conference on Extending Database Technology (EDBT′14), March 2014

Facilitating Patient and Administrator Analyses of Electronic Health Record Accesses
Eric Duffy
Master of Science Thesis, University of Illinois at Urbana-Champaign, August 2013

Requirements and Design for an Extensible Toolkit for Analyzing EMR Audit Logs
Eric Duffy, Steve Nyemba, Carl A. Gunter, David Liebovitz, and Bradley Malin
USENIX Workshop on Health Information Technologies, August 2013

Tragedy of Anticommons in Digital Right Management of Medical Records
Quanyan Zhu, Carl A. Gunter, and Tamar Basar
USENIX Workshop on Health Security and Privacy (HealthSec12), August 2012

Experience-Based Access Management: A Life-Cycle Framework for Identity and Access Management Systems
Carl A. Gunter, David M. Liebovitz, and Bradley Malin
IEEE Security & Privacy, September/October 2011

The following is a list of additional studies from PATHWAYS and SIMILAR components that used the NMH data set.
Learning a Medical Specialty from a Provider Treatment History
Xun Lu, Aston Zhang, Carl A. Gunter, Daniel Fabbri, David Liebovitz, and Bradley Malin
Under Review

Learning to Discover New Medical Specialties via Patient Treatment Histories
Xun Lu, Aston Zhang, Carl A. Gunter, Daniel Fabbri, David Liebovitz, and Bradley Malin
Under Review

Generative Grammars for Privacy-Preserving Data Publishing
Ravinder Shankesi, Vincent Binschaedler, Aston Zhang, Carl A. Gunter, David Liebovitz, and Brad Malin
Under Review

Decide Now or Decide Later? Quantifying the Tradeoff between Prospective and Retrospective Access Decisions
Wen Zhang, You Chen, Ted Cybulski, Carl A. Gunter, Daniel Fabbri, Patrick Lawlor, David Liebovitz and Brad Malin
Under Review

Diagnosis Based Specialist Identification in the Hospital
Xun Lu
Master of Science Thesis, University of Illinois at Urbana-Champaign, May 2014

Modeling and Detecting Anomalous Topic Access in EMR Audit Logs
Siddharth Gupta
Master of Science Thesis, University of Illinois at Urbana-Champaign, May 2013

Mining Deviations from Patient Care Pathways via Electronic Medical Record System Audits
He Zhang, Sanjay Mehrotra, David Liebovitz, Carl A. Gunter, and Bradley Malin
ACM Transactions on Management Information Systems (TMIS), volume 4, number 4, article 17, December 2013

Modeling and Detecting Anomalous Topic Access
Siddharth Gupta, Casey Hanson, Carl A. Gunter, Mario Frank, David Liebovitz, and Bradley Malin
IEEE Intelligence and Security Informatics (ISI 13), June 2013

Evolving Role Definitions through Permission Invocation Patterns
Wen Zhang, You Chen, Carl A. Gunter, David Liebovitz, and Bradley Malin
ACM Symposium on Access Control Models and Technologies, June 2013

Role Prediction using Electronic Medical Record System Audits

Wen Zhang, Carl A. Gunter, David Liebovitz, Jian Tian and Bradley Malin
AMIA 2011 Annual Symposium, Washington, DC, October 2011