Development and validation of a pulmonary function test data extraction tool for the US department of veterans affairs electronic health record

Rabin, Alexander S.; Weinstein, Julien B.; Seelye, Sarah M.; Whittington, Taylor N.; Hogan, Cainnear K.; Prescott, Hallie C.

doi:10.1186/s13104-024-06770-3

Research Note
Open access
Published: 23 April 2024

Development and validation of a pulmonary function test data extraction tool for the US department of veterans affairs electronic health record

Alexander S. Rabin^1,2^na1,
Julien B. Weinstein^2,3^na1,
Sarah M. Seelye³,
Taylor N. Whittington³,
Cainnear K. Hogan³ &
…
Hallie C. Prescott^1,2,3

BMC Research Notes volume 17, Article number: 115 (2024) Cite this article

321 Accesses
Metrics details

Abstract

Objective

Pulmonary function test (PFT) results are recorded variably across hospitals in the Department of Veterans Affairs (VA) electronic health record (EHR), using both unstructured and semi-structured notes. We developed and validated a hospital-specific code to extract pre-bronchodilator measures of obstruction (ratio of forced expiratory volume in one second [FEV₁] to forced vital capacity [FVC]) and severity of obstruction (percent predicted of FEV₁).

Results

Among 36 VA facilities with the most PFTs completed between 2018 and 2022 from a parent cohort of veterans receiving long-acting controller inhalers, 12 had a consistent syntactical convention or template for reporting PFT data in the EHR. Of the 42,718 PFTs identified from these 12 facilities, the hospital-specific text processing pipeline yielded 24,860 values for the FEV₁:FVC ratio and 23,729 values for FEV₁. A ratio of FEV₁:FVC less than 0.7 was identified in 17,615 of 24,922 studies (70.7%); 8864 of 24,922 (35.6%) had a severe or very severe reduction in FEV₁ (< 50% of the predicted value). Among 100 randomly selected PFT reports reviewed by two pulmonary physicians, the coding solution correctly identified the presence of obstruction in 99 out of 100 studies and the degree of obstruction in 96 out of 100 studies.

Peer Review reports

Introduction

Pulmonary function tests (PFT) are an essential tool for the assessment of lung disease severity and outcomes in the United States military veteran population. However, PFT reporting in the Department of Veterans Affairs’ (VA) electronic health record (EHR) most commonly occurs in an unstructured or semi-structured format that complicates quantitative and/or qualitative data abstraction and analysis [1]. Prior efforts to extract PFT values (including forced expiratory volume in one second [FEV₁], forced vital capacity [FVC], and the ratio of FEV₁:FVC) from VA EHR data sources using natural language processing techniques [1] or automated tools such as a structured query language (SQL) full-text search [2] have focused on select populations [3] or have been limited to measures of FEV₁ alone [2].

To build on these previously reported methodologies for VA EHR abstraction, we identified VA facilities with a high volume of PFTs performed and applied a site-specific data extraction approach for both quantitative and qualitative reporting of the FEV₁:FVC ratio and FEV₁ severity. We then conducted a validation of the abstraction technique, comparing the programming output to manual PFT classification performed by two pulmonary physician adjudicators.

Methods

Procedure coding and extraction of notes

PFTs were identified by relevant Current Procedural Terminology codes (94010, 94375, 94060, 94726, 94727, 94729, 94150) for procedures occurring among a cohort of veterans receiving long-acting controller inhalers between January 1, 2018 and December 31, 2022 [4]. Inpatient and outpatient clinical notes from days − 1 to + 21 relative to the PFT date of service were extracted from the VA Corporate Data Warehouse (CDW) [5], a central EHR data repository, using Microsoft SQL Server Management Studio via the VA Informatics and Computing Infrastructure for analysis.

Identification of semi-structured or unstructured PFT notes containing the FEV₁ variable

We identified VA facilities with the most PFTs performed during the study period. Among the 36 VA facilities with the most PFTs completed, we assessed the proportion of PFTs completed that had a likely PFT report in the EHR, as evidenced by a clinical node containing the term FEV1. Among facilities with > 80% of PFTs completed having an associated FEV1-containing note, we manually reviewed a random sample of up to 100 notes in JLV to determine whether a consistent approach or template was employed in the semi-structured reporting of PFTs. Unstructured reports (e.g., PFT results reported in physician progress notes) were included only if the notes followed a consistent pattern in the random sample. Reports containing qualitative descriptors of FEV₁ (e.g., “the FEV₁ is normal”) were included..

Creation of a data extraction tool

After identifying high-volume facilities with a consistent approach to PFT reporting in CDW, we developed facility-specific code to extract select PFT results. Each facility-specific code used the following steps: First, the code identified templated PFT result notes based on standard phrasing delineating the start of PFT results. Second, the code extracted a snippet of text from the note (up to 150 characters before and 1000 characters after the appearance of the standard phrase, as shown in Supplementary Fig. 1). Third, the snippet was processed through regular expression pattern matching coding functions to extract the following variables: FEV₁, FEV₁ percentpredicted, FEV₁:FVC ratio, and qualitative reporting descriptors of the PFT results. All coding was performed in Python [6].

Definitions of spirometric obstruction and FEV₁ impairment

Obstruction (present or absent) was defined by a threshold of FEV₁:FVC ratio < 0.7, as suggested by the 2023 Global Initiative for Chronic Obstructive Lung Disease guidelines [7]. FEV₁ results were mapped to pre-specified percentpredicted values for severity as defined by the 2005 American Thoracic Society/European Respiratory Society (ATS/ERS) PFT interpretation guidelines [8]: normal (FEV₁ ≥ 80% percent predicted), mildly reduced (FEV₁ 70–79% percent predicted), moderately reduced (FEV₁ 60–69% percent predicted), moderately-severely reduced (50–59% percent predicted), severely reduced (35–49% percent predicted), and very severely reduced (< 35% percent predicted). Only pre-bronchodilator values were included.

Quantitative over qualitative reporting

For the classification of both spirometric obstruction (i.e., FEV₁:FVC ratio) and FEV₁ severity, quantitative values were prioritized over qualitative descriptions. However, if the quantitative value was not available, then the qualitative descriptor (e.g., “mild obstruction”) was used.

Iterative coding process followed by final validation

A random sample of 100 PFT note snippets was selected and reviewed by a pulmonary physician (A.R.) to identify potential coding errors. When the interpretation of the snippet was unclear to the reviewer, an attempt was made to review the original PFT report in the Joint Longitudinal Viewer (JLV), a clinical application allowing read-only access to health data across the VA health system [9]. This iterative process, cross-referencing 100 snippets at a time, was repeated three times to refine the data extraction code.

A random sample of 100 snippets interpreted by the coding solution was used for validation. Two pulmonary physicians (A.R. and H.P.), blinded to the algorithm’s extraction results and to each other’s adjudication decisions, manually recorded the presence or absence of spirometric obstruction and the severity of impairment in FEV₁ using the previously described criteria. Differences in adjudication were discussed, and consensus was determined in all cases. The consensus adjudications were then compared to the programming output to assess accuracy.

Results

VA facility selection for the extraction of PFT reports

Among 347,578 patients receiving long-acting controller inhalers, a total of 258,903 individual PFT studies were identified from 366 VA facilities (Fig. 1). Of these, 9364 (3.6%) studies containing structured FEV₁ results in CDW Raw were excluded, leaving 249,539 PFTs from 360 facilities for analysis. Of the 36 VA facilities with the most PFT reports, 13 facilities had FEV₁-containing notes for ≥ 80% of PFTs and underwent further manual review to assess the existence of a standard PFT note template (Fig. 2). One facility with non-standardized reporting of FEV₁ was excluded.

Data extraction

Among 42,718 PFT studies from the 12 included facilities, 27,738 PFTs contained an FEV₁-templated note. A total of 24,860 values for FEV₁:FVC ratio and 23,729 values for FEV₁ severity were obtained. The yield of extraction of the FEV₁:FVC ratio and FEV₁ values ranged widely across facilities, from 14% for both variables in Facility AJ to 93% for both variables in Facility AD (Supplementary Table 1). The classification of spirometric obstruction and the degree of obstruction are shown in Table 1.

Table 1 Classification of obstruction (reduced FEV1:FVC ratio) and degree of FEV1 impairment among PFTs with extracted results

Full size table

Validation cohort

Among the 100 PFT reports selected for validation, the algorithm correctly graded the presence of obstruction in 99 out of 100 studies. In the same validation cohort, the algorithm correctly assigned FEV₁ severity, including correctly determining missing FEV₁, in 96 out of 100 studies.

Discussion

Access to high-quality lung function data is of paramount importance as the VA seeks to characterize the burden of chronic respiratory disease [10] and explore the long-term effects of airborne hazard exposure on respiratory health [11]. Here we describe a pattern-matching text processing technique for the extraction of semi-structured or unstructured values of FEV₁:FVC and FEV₁ from EHR data in a general VA population. This approach could be applied more broadly to the extraction of other PFT variables of interest, including measures of diffusion impairment, lung volumes, or bronchodilator response.

Several prior studies have reported automated extraction of PFT variables from unstructured or semi-structured VA EHR data [1,2,3]. Using a two-step text processing approach, Akgün et al. showed a high degree of accuracy in the extraction of FEV₁ values alone from VA progress notes (positive predictive value 99%, 95% confidence interval, 98.2 to 100%) in the Veterans Aging Cohort Study [2]. Another technique using natural language processing to extract FVC values from VA EHR data was accurate, but less applicable to the PFT reporting conventions in CDW beyond the facilities for which it was designed [1].

As the focus of our parent study was on clinical outcomes from inhaler device switching [5], we sought to develop extraction code to best assess for the presence of spirometric obstruction and severity of obstruction while accounting for substantial variability in facility-to-facility PFT reporting conventions. Our approach involved an extensive filtering step that reduced the number of PFT studies supplied to the text mining pipeline; however, the inclusion of longer text snippets and the use of regular expressions in our algorithm allowed for more specific text-matching rules than in previously described methods. The precision afforded by regular expressions to generate customized text processing and extraction procedures at the level of the VA facility resulted in a high degree of accuracy in FEV₁ value identification.

Strengths of our approach include the use of an adaptable extraction code that enables the text processing of diverse PFT templates across hospitals. These features build on prior methodologies by first identifying high-volume facilities and then refining facility-specific code, thus increasing both the yield and accuracy of the data output. The extensive validation process, involving physician review of hundreds of note snippets cross-referenced with primary EHR data, gives added assurance of data output quality.

In conclusion, this iterative, validated text mining approach for the extraction of PFT data may aid researchers aiming to study pulmonary function housed in unstructured or semi-structured VA data sources.

Limitations

Our methodology has a number of limitations. First, the code was able to extract FEV₁ and FEV₁:FVC ratio for only a minority of PFTs completed in the parent cohort. We developed facility-specific code to increase the yield, but were ultimately limited by the low prevalence of templated notes reporting PFT results. Second, the code was not trained to partition PFT results by date, such as when multiple studies were listed sequentially in a single snippet. In the course of validation, though, this appeared to be an infrequent occurrence. Third, the approach was time- and labor-intensive, requiring multiple revisions and validations in order to maximize data extraction yield and accuracy. Application of the open-source code presented herein could accelerate future efforts to successfully extract PFT variables.

Data availability

No datasets were generated or analysed during the current study. Programming code is available at CCMRPulmCritCare/PFTTextMining (github.com).

References

Sauer BC, Jones BE, Gary Globe, Leng J, Lu C-C, He T et al. Performance of an NLP Tool to extract PFT reports from structured and semi-structured VA data. EGEMS (Wash, DC). 2016;4:10.
Akgün KM, Sigel K, Cheung K-H, Kidwai-Khan F, Bryant AK, Brandt C, et al. Extracting lung function measurements to enhance phenotyping of chronic obstructive pulmonary disease (COPD) in an electronic health record using automated tools. PLoS ONE. 2020;15:e0227730.
Article PubMed PubMed Central Google Scholar
England BR, Roul P, Yang Y, Hershberger D, Sayles H, Rojas J, et al. Extracting forced vital capacity from the electronic health record through natural language processing in rheumatoid arthritis-associated interstitial lung disease. Pharmacoepidemiol Drug Saf. 2023. https://doi.org/10.1002/pds.5744.
Article PubMed Google Scholar
Corporate Data Warehouse (CDW). 2023. https://www.hsrd.research.va.gov/for_researchers/vinci/cdw.cfm. Accessed 14 Feb 2024.
IIR 23–032– HSR study. 2024. https://www.hsrd.research.va.gov/research/abstracts.cfm?Project_ID=2141709943. Accessed 27 Feb 2024.
Van Rossum G. The Python Library Reference, release 3.8.2. Python Software Foundation. 2020.
https://goldcopd.org/wp-content/uploads/2023/03/GOLD-2023-ver-1.3-17Feb2023_WMV.pdf. Accessed 18 Feb 2024.
Pellegrino R. Interpretative strategies for lung function tests. Eur Respir J. 2005;26:948–68.
Article CAS PubMed Google Scholar
https://. vaww.vhadataportal.med.va.gov/Tools-Applications/JLV. Accessed 18 Feb 2024.
Bamonti PM, Robinson SA, Wan ES, Moy ML. Improving physiological, physical, and psychological health outcomes: a narrative review in US veterans with COPD. Int J Chron Obstruct Pulmon Dis. 2022;17:1269–83.
Article CAS PubMed PubMed Central Google Scholar
https://nap.nationalacademies.org/resource/25837/Gulf. Accessed 18 Feb 2024.

Download references

Acknowledgements

None.

Funding

The study was supported by VA Investigator Initiated Research 23–032 “Veterans Affairs Study of a Real-World Inhaler Delivery Device Transition on Climate and Health Outcomes (VA-SWITCH).”

Author information

Alexander S. Rabin M.D. and Julien B. Weinstein M.S. contributed equally to this work.

Authors and Affiliations

Pulmonary Section, Veterans Affairs Ann Arbor Healthcare System, 2215 Fuller Road, 48105, Ann Arbor, MI, USA
Alexander S. Rabin & Hallie C. Prescott
Division of Pulmonary and Critical Care Medicine, University of Michigan, Ann Arbor, MI, USA
Alexander S. Rabin, Julien B. Weinstein & Hallie C. Prescott
Veterans Affairs Center for Clinical Management Research, Ann Arbor, MI, USA
Julien B. Weinstein, Sarah M. Seelye, Taylor N. Whittington, Cainnear K. Hogan & Hallie C. Prescott

Authors

Alexander S. Rabin
View author publications
You can also search for this author in PubMed Google Scholar
Julien B. Weinstein
View author publications
You can also search for this author in PubMed Google Scholar
Sarah M. Seelye
View author publications
You can also search for this author in PubMed Google Scholar
Taylor N. Whittington
View author publications
You can also search for this author in PubMed Google Scholar
Cainnear K. Hogan
View author publications
You can also search for this author in PubMed Google Scholar
Hallie C. Prescott
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

J.W. wrote the programming code and prepared the figures. A.R. wrote the first draft of the manuscript. T.W. and C.H. assisted in data acquisition. A.R. and H.P. performed the validation. All authors contributed to the study design and reviewed the manuscript.

Corresponding author

Correspondence to Alexander S. Rabin.

Ethics declarations

Ethics approval and consent to participate

The study was reviewed by the VA Ann Arbor Institutional Review Board and was deemed exempt from the need for consent under 45 CFR§ 46, category 4.

Consent for publication

Not applicable.

Disclaimer

The views expressed may not represent the position of the U.S. Department of Veterans Affairs or the U.S. government.

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary Material 1

Supplementary Material 2

Supplementary Material 3

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article

Rabin, A.S., Weinstein, J.B., Seelye, S.M. et al. Development and validation of a pulmonary function test data extraction tool for the US department of veterans affairs electronic health record. BMC Res Notes 17, 115 (2024). https://doi.org/10.1186/s13104-024-06770-3

Download citation

Received: 07 March 2024
Accepted: 10 April 2024
Published: 23 April 2024
DOI: https://doi.org/10.1186/s13104-024-06770-3

Development and validation of a pulmonary function test data extraction tool for the US department of veterans affairs electronic health record

Abstract

Objective

Results

Introduction

Methods

Procedure coding and extraction of notes

Identification of semi-structured or unstructured PFT notes containing the FEV1 variable

Creation of a data extraction tool

Definitions of spirometric obstruction and FEV1 impairment

Quantitative over qualitative reporting

Iterative coding process followed by final validation

Results

VA facility selection for the extraction of PFT reports

Data extraction

Validation cohort

Discussion

Limitations

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Disclaimer

Competing interests

Additional information

Publisher’s Note

Electronic supplementary material

Supplementary Material 1

Supplementary Material 2

Supplementary Material 3

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Research Notes

Contact us

Identification of semi-structured or unstructured PFT notes containing the FEV₁ variable

Definitions of spirometric obstruction and FEV₁ impairment