Complying with Patient Expectations for Data De-Identification in i2b2
Shawn N Murphy MD, Ph.D. 2, Michael E. Mendis1, Susanne Churchill Ph.D.1,
Isaac Kohane MD Ph.D.3 @ 2Massachusetts General Hospital, Boston MA, 1Partners
Healthcare, Wellesley, MA, Childrens Hospital, Boston MA
Data de-identification is surprisingly difficult to do. The richness of Electronic Health Record
data defies many of the earlier approaches that promised to produce data sets that were
unidentifiable. The patient expects us to proceed with our research without risking their privacy.
Achieving a balance of 1) de-identification, 2) restricting data distribution, and 3) securing the
hardware, may be a more realistic way to achieve this expectation than rely on a pure
computational solution. This balance is illustrated in the use of the i2b2 platform, which
considers the three when providing several levels of data privacy settings targeted to different
classes of trusted users.
Background - Informatics for Integrating Biology and the Bedside (i2b2) is one of the sponsored
initiatives of the NIH Roadmap National Centers for Biomedical Computing
(http://www.bisti.nih.gov/ncbc/). One of the goals of i2b2 is to provide clinical investigators
broadly with the software tools necessary to collect and manage project-related clinical research
data in the genomics age as a cohesive entity—a software suite to construct and manage the
modern clinical research chart. The ability to manage data at various levels of de-identification
and security is a critical feature of this platform. However, the requirement for easy, pervasive,
and feature rich use is in direct competition with the requirement for preservation of patient
privacy.
Methods – Steps towards the goal of research computing in keeping with patient privacy concerns
must occur with full understanding of the intended outcome. These steps should include a
consistent set of options that allow various “middle grounds” to be achieved with regard to a data
privacy setting in a research application. All three of the considerations above should be taken
into account in the application design.
Results – Five patient privacy settings were created in i2b2 under which one can explore patient
data. There is a setting of the “PHI-enabled user” that allows full disclosure of identifiers to the
researcher and is intended for use under a specific Institutional Review Board approval. There is
a setting of “Quasi-de-identified user” that presents a limited data set, but with a possible error
rate. This setting is used mostly in the setting of de-identified text reports that my have 2-3%
“missed” PHI still embedded. There is a setting of “LDS user” that offers a strict HIPAA-defined
limited data set. There is a setting of “Aggregate-data-only user” that presents only aggregate
results from queries, but may not absolutely comply with full HIPAA de-identification in limited
circumstances. Finally, there is the setting of “obfuscated-data user” which offers HIPAA
statistically de-identified data access.
Discussion – Each of the settings above meets with a separate strata of patient expectations and
investigator access. At each level there is a separate “contract,” such that at the PHI-enabled
level, compliance with full disk encryption, restrictive keys, and very small pools of specifically
IRB approved investigators is observed, while at the Obfuscated-data level a simple, unencrypted
data repository protected using simple passwords, and cross-institutional investigator access is
observed. These contracts can optimize the balance between access of data for research and the
privacy protections that meet patient expectations.
39
Complying with Patient Expectations for Data De-Identification in i2b2
Shawn N Murphy MD, Ph.D. 2, Michael E. Mendis1, Susanne Churchill Ph.D.1,
Isaac Kohane MD Ph.D.3 @ 2Massachusetts General Hospital, Boston MA, 1Partners
Healthcare, Wellesley, MA, Childrens Hospital, Boston MA
Data de-identification is surprisingly difficult to do. The richness of Electronic Health Record
data defies many of the earlier approaches that promised to produce data sets that were
unidentifiable. The patient expects us to proceed with our research without risking their privacy.
Achieving a balance of 1) de-identification, 2) restricting data distribution, and 3) securing the
hardware, may be a more realistic way to achieve this expectation than rely on a pure
computational solution. This balance is illustrated in the use of the i2b2 platform, which
considers the three when providing several levels of data privacy settings targeted to different
classes of trusted users.
Background - Informatics for Integrating Biology and the Bedside (i2b2) is one of the sponsored
initiatives of the NIH Roadmap National Centers for Biomedical Computing
(http://www.bisti.nih.gov/ncbc/). One of the goals of i2b2 is to provide clinical investigators
broadly with the software tools necessary to collect and manage project-related clinical research
data in the genomics age as a cohesive entity—a software suite to construct and manage the
modern clinical research chart. The ability to manage data at various levels of de-identification
and security is a critical feature of this platform. However, the requirement for easy, pervasive,
and feature rich use is in direct competition with the requirement for preservation of patient
privacy.
Methods – Steps towards the goal of research computing in keeping with patient privacy concerns
must occur with full understanding of the intended outcome. These steps should include a
consistent set of options that allow various “middle grounds” to be achieved with regard to a data
privacy setting in a research application. All three of the considerations above should be taken
into account in the application design.
Results – Five patient privacy settings were created in i2b2 under which one can explore patient
data. There is a setting of the “PHI-enabled user” that allows full disclosure of identifiers to the
researcher and is intended for use under a specific Institutional Review Board approval. There is
a setting of “Quasi-de-identified user” that presents a limited data set, but with a possible error
rate. This setting is used mostly in the setting of de-identified text reports that my have 2-3%
“missed” PHI still embedded. There is a setting of “LDS user” that offers a strict HIPAA-defined
limited data set. There is a setting of “Aggregate-data-only user” that presents only aggregate
results from queries, but may not absolutely comply with full HIPAA de-identification in limited
circumstances. Finally, there is the setting of “obfuscated-data user” which offers HIPAA
statistically de-identified data access.
Discussion – Each of the settings above meets with a separate strata of patient expectations and
investigator access. At each level there is a separate “contract,” such that at the PHI-enabled
level, compliance with full disk encryption, restrictive keys, and very small pools of specifically
IRB approved investigators is observed, while at the Obfuscated-data level a simple, unencrypted
data repository protected using simple passwords, and cross-institutional investigator access is
observed. These contracts can optimize the balance between access of data for research and the
privacy protections that meet patient expectations.
39