Abstract
Background Deep learning (DL), a subset of artificial intelligence (AI), has been applied to pneumothorax diagnosis to aid physician diagnosis, but no meta-analysis has been performed.
Methods A search of multiple electronic databases through September 2022 was performed to identify studies that applied DL for pneumothorax diagnosis using imaging. Meta-analysis via a hierarchical model to calculate the summary area under the curve (AUC) and pooled sensitivity and specificity for both DL and physicians was performed. Risk of bias was assessed using a modified Prediction Model Study Risk of Bias Assessment Tool.
Results In 56 of the 63 primary studies, pneumothorax was identified from chest radiography. The total AUC was 0.97 (95% CI 0.96–0.98) for both DL and physicians. The total pooled sensitivity was 84% (95% CI 79–89%) for DL and 85% (95% CI 73–92%) for physicians and the pooled specificity was 96% (95% CI 94–98%) for DL and 98% (95% CI 95–99%) for physicians. More than half of the original studies (57%) had a high risk of bias.
Conclusions Our review found the diagnostic performance of DL models was similar to that of physicians, although the majority of studies had a high risk of bias. Further pneumothorax AI research is needed.
Abstract
In this, the first systematic review and meta-analysis of pneumothorax diagnostic AIs, physicians and AI models showed comparable performance in diagnosing pneumothorax from chest radiographs. https://bit.ly/3JZeGN4
Introduction
Pneumothorax is defined as the presence of air in the pleural space, i.e. the space between the lungs and the chest wall [1, 2]. Pneumothorax is a common disease in the population, with an incidence of primary spontaneous pneumothorax of 7.4/100 000 per year in men and 1.2/100 000 per year in women and an incidence of secondary spontaneous pneumothorax of 6.3/100 000 per year in men and 2.0/100 000 per year in women [3]. In contrast to the benign clinical course of primary spontaneous pneumothorax, secondary spontaneous pneumothorax is a potentially life-threatening event [2]. Additionally, the recurrence rate is high: ∼30% in primary spontaneous pneumothorax and ∼40% in secondary spontaneous pneumothorax [4–7]. Pneumothorax is one of the conditions the American College of Radiology recommends should be communicated to the physician within minutes to avoid patient decompensation [8].
Chest radiography is the simplest and most common examination [9, 10], and pneumothorax is usually diagnosed in conjunction with the patient's history and clinical presentation [2]. Although errors or delays in diagnosis can harm the patient, the signs of pneumothorax on chest radiography are subtle and up to 20% of occult pneumothoraces are missed on examination [11]. One reason for this is said to be that the workload far exceeds the number of radiologists [12, 13], and computer-based approaches have been developed to assist physicians in their daily work and are expected to be an approach to prevent missed cases.
Deep learning (DL) is one of the fields of artificial intelligence (AI) and it has improved tremendously in the field of medical imaging [14, 15], and with it, the number of certified medical devices that can be used in clinical practice has been increasing [16]. It is formally defined as a computational model that consists of multiple processing layers and learns representations of data with multiple levels of abstraction [17]. In fact, given raw data, DL develops the representations needed for pattern recognition on its own and does not require domain expertise to design data structures or feature extractors [14, 17]. This feature of DL allows it to learn the features important for classification on its own, rather than being directed by a human. Thus, DL requires careful bias assessment and accumulation of original articles for model training and evaluation [18].
This study is a systematic review and meta-analysis of 63 studies on the application of DL to pneumothorax diagnosis, comparing the diagnostic performance of DL and physicians in each modality. Studies in which physicians’ pneumothorax diagnosis performance is supported by DL are examined separately. At this time, there are no meta-analyses of DL diagnosis for pneumothorax.
Methods
Study registration and guidelines
This systematic review was prospectively registered with PROSPERO (CRD42022351985). Our study followed the guidelines of the Preferred Reporting Items for a Systematic Review and Meta-analysis of Diagnostic Test Accuracy Studies (PRISMA-DTA) [19, 20]. Two authors performed all screening, data collection, applicability assessments and bias assessments in duplicate (T.S. and D.U.), and a third independent reviewer was consulted in the event of a disagreement (T.M.).
Search strategy and study selection
The search strategy for identifying studies that developed and/or validated a DL model for the purposes of pneumothorax diagnosis was developed with an information specialist. The search strategy was as follows: original articles that included the words or variations of “artificial intelligence” or “deep learning” or “neural networks” and also the word “pneumothorax” were included. Peer-reviewed studies in any language from inception to September 2022 from the following databases were evaluated: MEDLINE, Scopus, Web of Science, Cochrane Central Register of Controlled Trials (CENTRAL), and the Institute of Electrical and Electronics Engineers and the Institution of Engineering and Technology IEEE Xplore. The titles and abstracts were screened prior to full-text screening. Studies were included if they were: primary research studies of pneumothorax diagnosis in humans that developed and/or validated a DL model. Any target population, study setting or comparator group was eligible. Studies were excluded if they were: conference abstracts or proceedings, letters to the editor, review articles, or segmentation or detection only studies. Excluded studies, including the reason for exclusion, were recorded in a PRISMA flow diagram (figure 1) [20].
Data extraction
We extracted information including study design, sample size, comparator groups and numerical results into a predefined data sheet. Contingency tables were constructed using the available diagnostic performance information for each model. These were used to calculate summary area under the curve (AUC), sensitivity and specificity. All available contingency tables were included in the meta-analysis. The datasets involved in the development of a model were defined as the training set (for training the model), tuning set (for tuning hyperparameters) and validation test set (for estimating the performance of the model) [21].
Statistical analysis
We estimated the diagnostic performance of both the DL model and physicians by carrying out a meta-analysis with random effects of studies providing both internal and external validation contingency tables [22]. These contingency tables were used to construct hierarchical summary receiver operating characteristic (ROC) curves and to calculate pooled sensitivities and specificities, with the anticipation of a high level of heterogeneity [23]. Between-study heterogeneity was represented using the 95% prediction region of the hierarchical summary ROC curves. Statistical significance was defined as a p-value of <0.05. All calculations were performed using R version 4.0.0 with the metafor and meta4diag libraries [24].
Quality assessment
The Prediction Model Study Risk of Bias Assessment Tool (PROBAST) was used to assess bias and applicability of the included studies [18]. This tool evaluates bias across four domains (participants, predictors, outcomes and analysis) and then these domains are combined into an overall assessment. Our assessment of bias and applicability in the first domain was based on both the images used to develop the models and the patient population the models were tested on. We did not include domain 2 (predictors) in the assessment of bias or applicability. Details of modifications made to PROBAST are provided in supplementary table S1.
Publication bias
Publication bias was assessed using the effective sample size funnel plot described by Egger et al. [25].
Results
Study selection and characteristics
We identified 532 studies, of which 255 were duplicates. After screening, 63 studies were included in the systematic review and 32 studies were included in the meta-analysis (figure 1 and table 1). Among the 63 studies, 56 studies identified pneumothorax on chest radiography [26–81], four studies on computed tomography [82–85], one study on ECG [86], one study used chest radiography and photography using a smartphone [87], and one study used chest radiography and tabular data [88]. Six studies developed and internally tuned DLs [37, 52, 63, 67, 74, 76], 25 studies also internally tested their DLs [32, 33, 35, 38, 40, 41, 43, 45, 47, 48, 50, 55, 60, 65, 69, 70, 73, 75, 79–83, 85, 86] and 32 studies externally tested the DLs [26–31, 34, 36, 39, 42, 44, 46, 49, 51, 53, 54, 56–59, 61, 62, 64, 66, 68, 71, 72, 77, 78, 84, 87, 88].
Five studies compared the performance of DL with physicians: two studies compared DL with experts (not a resident or technologist) [42, 83], two studies compared DL with both experts and non-experts [43, 68] and one study compared DL with non-experts only [34]. Two studies compared the performance of DL with radiology reports written in daily clinical practice [42, 56]. Two studies included physician performance with and without DL assistance as a comparison group [44, 68]. Detailed physician characteristics are shown in supplementary table S4.
As for model development, to generate a reference standard for image labelling, 18 studies used expert consensus [27–33, 35–38, 49, 53–55, 71, 77, 83], two relied on the opinion of a single expert reader [76, 85], 16 used pre-existing radiological reports or other imaging modalities [34, 41, 43, 45, 46, 52, 60, 61, 67, 75, 78–82, 87], one study defined their reference standard as surgical confirmation (indicated for surgery) [86], 11 studies used mixed methods (any combination of the aforementioned) [40, 47, 48, 50, 51, 62, 63, 65, 69, 70, 73] and two studies did not report how their reference standard was generated [74, 88]. As for model testing, to generate a reference standard for image labelling, 26 studies used expert consensus [26–28, 30–33, 38, 39, 44, 51, 54–57, 61, 64, 66, 68, 71–73, 77, 80, 83, 84], two relied on the opinion of a single expert reader [58, 85], 11 used pre-existing radiological reports or other imaging modalities [35, 40, 41, 48, 50, 60, 79, 81, 82, 87, 88], one study defined their reference standard as surgical confirmation (indicated for surgery) [86], 12 studies used mixed methods (any combination of the aforementioned) [29, 34, 36, 42, 43, 46, 47, 49, 53, 59, 65, 69] and five studies did not report how their reference standard was generated [45, 62, 70, 75, 78].
Study participants
There was large variation in the number of participants represented by each dataset (median (interquartile range (IQR)) 5288 (516–30 805); range 100–538 390) (supplementary table S2). The proportion of participants with pneumothorax in each dataset also ranged widely (median (IQR) 17.2% (10.8–25.0%)). 23 studies did not describe the sex of the study participants [27, 31–33, 36–38, 55, 59, 62, 65, 69–71, 73, 76–78, 81, 82, 86–88] and 24 studies did not include age information [27, 31–33, 36–38, 55, 59, 62, 65, 69–71, 73, 74, 76–78, 81, 82, 86–88]. Detailed dataset characteristics are shown in supplementary table S2.
Model development
The size of the training (median (IQR) 17 265 (8540–86 524)), tuning (median (IQR) 1598 (924–3468)) and test (median (IQR) 1684 (575–3107)) datasets at the patient level varied widely (table 1). Two out of 50 (4%) studies that developed a model did not report the size of each dataset separately [40, 69]. In studies that performed external model validation, the median dataset size was 1137 (range 175–112 120). 17 studies included localisation of pneumothorax in model output to improve end-user interpretability [26–28, 30–33, 36–38, 40, 47, 56, 59, 68, 84, 85]. Detailed DL characteristics are shown in supplementary table S3.
Quality assessment
PROBAST assessment led to an overall rating of 36 (57%) studies as high risk of bias (figure 2). The main contributing factors to this assessment were studies that did not perform external validation or internally validated models with small sample sizes. Five (8%) studies were judged to be at high risk of bias in the participant domain because of inclusion and exclusion criteria.
Meta-analysis
We extracted 89 contingency tables from 32 studies that provided sufficient information to calculate contingency tables for pneumothorax classification [27–36, 38, 39, 42–46, 48, 51, 53–56, 58, 61, 68, 70, 71, 75, 78, 80, 87]. 68 contingency tables were extracted for reported DL performance and 21 contingency tables were extracted for physician performance. Hierarchical summary ROC curves from the studies evaluating DL or physician performance of all studies are included in figure 3. The total AUC was 0.97 (95% CI 0.96–0.98) for DL and 0.97 (95% CI 0.96–0.98) for physicians. The total pooled sensitivity was 84% (95% CI 79–89%) for DL and 85% (95% CI 73–92%) for physicians and the pooled specificity was 96% (95% CI 94–98%) for DL and 98% (95% CI 95–99%) for physicians (table 2). Two studies reported physician performance with DL assistance and one study showed no significant difference with respect to specificity, but a moderate increase in sensitivity and an increase in accuracy. The other study showed no significant difference with respect to sensitivity or specificity, but a slight increase in accuracy [44, 68]. Accuracy, sensitivity and specificity changed from 92–99% to 97–99%, 67–94% to 85–96% and 100% to 99–100% before and after the use of DL, respectively (table 3).
Publication bias
We assessed publication bias by regression analysis of the funnel plot to quantify asymmetry (supplementary figure S1) and it suggested a high risk of publication bias (p<0.05).
Discussion
In our meta-analysis of DL for pneumothorax diagnosis, DL and physician competence were comparable. The total AUC was 0.97 (95% CI 0.96–0.98) for DL and 0.97 (95% CI 0.96–0.98) for physicians. The total pooled sensitivity was 84% (95% CI 79–89%) for DL and 85% (95% CI 73–92%) for physicians and the pooled specificity was 96% (95% CI 94–98%) for DL and 98% (95% CI 95–99%) for physicians. To the best of our knowledge, this article is the first systematic review and meta-analysis of pneumothorax diagnostic DL.
We found data investigating two possible clinical uses of diagnostic DL for pneumothorax during our review: one is to use DL for triage and the other is to use DL as a second opinion. Although pneumothorax is regularly diagnosed in patients presenting to the emergency department, it is detected in a relatively low number of all radiographs performed. Hence, an easy and accurate screening tool is needed which may help prioritise patients coming to hospitals. In fact, one such study reports a reduction in the reporting delay for pneumothorax [30] and two studies reported that reading times were shorter with AI assisting the physician than with the physician alone [26, 53]. Additionally, in intensive care units, chest radiographs are frequently taken and their reading is often labour intensive. The support of AI is expected to both improve the speed of reading and reduce the total workload [89]. Two papers reported data on the use of DL to complement physicians’ decision making [44, 68]. The small number precludes meta-analysis, although one study showed no significant difference with respect to specificity, but a moderate increase in sensitivity and an increase in accuracy; the other study showed no significant difference with respect to sensitivity or specificity (table 3). Although further data related to the performance of physicians supported with DL is required, these two works support the results obtained from this meta-analysis. There is one AI report of chest ultrasound, which is considered more sensitive than chest radiography, and this is an area of promising future research [90]. AI models at present are most useful as a screening tool to determine the presence or absence of pneumothorax. It does not incorporate individual patient's circumstances and other medical conditions in the making of management decisions, especially regarding treatment and follow-up, as physicians currently have to do. Whether additional AIs to measure detailed features (e.g. size, evidence of tension, etc.) of the pneumothorax are warranted, and how they could best integrate patients’ clinical details, will be subjects for future studies.
Confounding factors in images can create bias in diagnostic imaging DL. About 10% (six out of 63) of the articles included in this study mentioned chest tubes as a confounding factor, indicating that DL can recognise tubes and that this can be a strong bias. In other words, a DL may recognise a chest drain as a therapeutic intervention for pneumothorax and diagnose pneumothorax. Although physicians may also use such medical device information to suspect the presence of pneumothorax, a DL with such biases may have poor performance for diagnosing pneumothorax prior to the intervention. Although the impact of such confounding factors may be small or large, their impact should be taken into account when evaluating the model for clinical use to prevent any detriment to the patient. There is a study which reported that the influence of such confounding factors can be avoided when training DL by using annotation of the pneumothorax cavity [73]. External devices (e.g. chest tubes, central lines or indwelling pleural catheters) and patient features (e.g. skin folds and thickened pleura) may impact the results. These aspects need to be investigated in future research.
In this study, we found that the majority of the included articles were at high risk of bias according to PROBAST [18]. One reason for this is that medical DL research is at the intersection between medicine and engineering, each with different concepts. There were both medical and engineering papers in our included articles. PROBAST is only one method of evaluating bias from a medical perspective. Of course, for medical researchers, medical DL must first adhere to the “do no harm” principle for patients [91]. Therefore, it is important to evaluate medical DL in various validation settings and this should be an important factor to prevent bias. On the other hand, DL, which benefits not only medicine but also many other fields, is a product of the accumulated knowledge of engineering researchers. For engineering researchers, one of the key factors is that the DL must perform better relative to prior research and the ingenuity to achieve this can be the novelty. From this aspect, it is reasonable to develop and validate the DL using the same open dataset as prior studies to demonstrate improved performance. Open datasets have contributed greatly to the development of DL because of their ease of use, the results are highly reproducible and comparisons with previous studies are easy to make even if there is bias in a clinical sense. In addition, patient privacy issues may make it difficult to access each hospital's data. In order to better use DL created by engineering research in medical practice, medical researchers must verify biases from various perspectives, understand the characteristics of DL, and conduct research that will benefit patients and reduce the daily clinical burden on physicians. Medical and engineering researchers should cooperate and share roles to advance medical care. About half (32 out of 63) of the papers included in this study were externally validated, which is the most important factor in the evaluation of AI. The high risk of bias in this study was largely influenced by the fact that 35 out of the 63 papers included risks of bias in the analysis portion of the study. It is likely that a more refined design of the analysis would allow for a low risk of bias study and also a better understanding of the pneumothorax diagnosis AI model.
The present study has several limitations. More than half (57%) of the included studies were classified as high risk of bias by PROBAST, limiting the conclusions that could be drawn from the meta-analysis. In addition, some papers lacked training and validation details, which contributed to the high risk of bias. Also, in terms of comparing model performance, it is reasonable that multiple studies used the same large open database for training and validation, but actual clinical practice will have a variety of different cohorts, which reduces the applicability of our conclusions to significantly different cohorts. Furthermore, publication bias also affected the results of this study.
In order to provide better medical care to patients and reduce the burden on physicians, pneumothorax diagnosis DL and physicians may complement each other to improve the accuracy of pneumothorax diagnosis in clinical practice. DL will be used in various medical fields in the future. Therefore, it is important to build up evidence by integrating individual original research and capturing overall characteristics through systematic review and meta-analysis.
Points for clinical practice
Use of AI as an adjunct to physicians’ diagnosis of pneumothorax may have potential benefits and deserves further exploration.
Questions for future research
How much improvement has been made in the performance of pneumothorax diagnosis by physicians with AI assistance?
To what extent do confounding factors inherent in chest radiographs impact pneumothorax diagnostic AIs?
Supplementary material
Supplementary Material
Please note: supplementary material is not edited by the Editorial Office, and is uploaded as it has been supplied by the author.
Supplementary material ERR-0259-2022.SUPPLEMENT
Footnotes
Provenance: Submitted article, peer reviewed.
Data availability: Study protocol and metadata are available from the corresponding author.
Conflict of interest: The authors have nothing to disclose.
- Received December 23, 2022.
- Accepted March 16, 2023.
- Copyright ©The authors 2023
This version is distributed under the terms of the Creative Commons Attribution Non-Commercial Licence 4.0. For commercial reproduction rights and permissions contact permissions{at}ersnet.org