First author [ref.] | Population/type of study | Source of data | Main findings | Reference standard/ground truth | Algorithm type | Datasets | Type of internal validation/availability of external validation |
Solid pulmonary nodules/masses | |||||||
Ardila [38] | CT chest of lung cancer screening patients/retrospective | NLST | Prediction of cancer risk based on CT findings For testing dataset, ROC of 94.4% (95% CI 91.1–97.3%) For validation set, AUC of 95.5% (95% CI 88.0–98.4%) | Histology; follow-up | CNN | 42 290 CT images from 14 851 patients | Not reported/yes |
Baldwin [39] | CT chest of lung cancer screening patients/retrospective | The IDEAL study (Artificial Intelligence and Big Data for Early Lung Cancer Diagnosis) | The AUC for CNN was 89.6% (95% CI 87.6–91.5%), compared with 86.8% (95% CI 84.3–89.1%) for the Brock model (p≤0.005) | Histology; follow-up | CNN | 1397 nodules in 1187 patients | Not reported/yes |
Massion [40] | CT chest of lung cancer screening patients/retrospective | NLST for model derivation and internal validation/externally tested on cohorts from two academic institutions | The AUC for CNN was 83.5% (95% CI 75.4–90.7%) and 91.9% (95% CI 88.7–94.7%) on two different cohorts (Vanderbilt and Oxford University) | Histology; follow-up | CNN | 14 761 benign nodules from 5972 patients, and 932 malignant nodules from 575 patients | Not reported/yes |
Ciompi [41] | CT chest of lung cancer screening patients/retrospective | Training dataset from the Multicentric Italian Lung Detection trial and validation dataset from the Danish Lung Cancer Screening Trial | CNN can achieve performance at classifying nodule type within the interobserver variability among human experts (Cohen κ-statistics ranging from 0.58 to 0.65) | Expert consensus | CNN | 1352 nodules for training set and 453 nodules for validation set | Random split sample validation/yes |
Nam [42] | Chest radiographs to detect malignant nodule/retrospective | Analysis of data collected from Seoul National University Hospital, Boramae Hospital and National Cancer Center, University of California San Francisco Medical Center | Chest radiograph classification and nodule detection performances of deep learning-based automatic detection were a range of 0.92–0.99 (AUROC) | Expert consensus | CNN | 43 292 chest radiographs from 34 676 patients | Random split sample validation/yes |
Wang [43] | Mediastinal lymph node metastasis of NSCLC from 18F-FDG PET/CT images/retrospective | Data collected at the Affiliated Tumor Hospital of Harbin Medical University | The performance of CNN is not significantly different from classic machine-learning methods and expert radiologists | Expert consensus | CNN | 1397 lymph node stations from 168 patients | Resampling method/no |
Wang [44] | Solitary pulmonary nodule ≤3 cm, histologically confirmed adenocarcinoma/retrospective | Analysis of data collected from Fudan University Shanghai Cancer Center | Algorithm showed AUROC of 0.892, which was higher than three expert radiologists in classifying invasive adenocarcinoma from pre-invasive lesions | Histology | CNN | CT scan from 1545 patients | Random split sample validation/no |
Zhao [45] | Thin-slice chest CT scan before surgical treatment; nodule diameter ≤10 mm/retrospective | Secondary analysis of data from Huadong Hospital affiliated to Fudan University | Based on classification of tumour invasiveness, deep-learning algorithm achieved better classification performance than the radiologists (63.3% versus 56.6%) | Histology | CNN | Pre-operative thin-slice CT; 523 nodules for training/128 nodules for testing | Not reported/no |
Fibrotic lung diseases | |||||||
Walsh [46] | HRCT showing diffuse fibrotic lung disease confirmed by at least two thoracic radiologists/retrospective | Secondary analysis of data from La Fondazione Policlinico, Universitario and University of Parma (Italy) | Interobserver agreement between the algorithm and the radiologists’ majority opinion (n=91) was good (κw=0·69) | Expert consensus | CNN | HRCT; 929 scans for training/89 scans for validation | Not reported/yes |
Christe [47] | HRCT showing NSIP or UIP confirmed by two thoracic radiologists/retrospective | HRCT dataset from the Lung Tissue Research Consortium | Interobserver agreements between the algorithm and the radiologists’ opinion were fair to moderate (κw=0.33 and 0.47) | Expert consensus | CNN | HRCT 105 patients (54 of NSIP and 51 for UIP) | Not reported/no |
Raghu [48] | The whole-transcriptome RNA sequencing data from transbronchial biopsy samples/prospective | Bronchial Sample Collection for a Novel Genomic Test (BRAVE) study in 29 US and European sites | The molecular signatures had high specificity (88%) and sensitivity (70%) against diagnostic reference pathology (ROC-AUC 0.87, 95% CI 0.76–0.98) | Histology | ML; type not reported | 94 patients in clinical utility analysis | Not reported/no |
PH | |||||||
Sweatt [49] | Peripheral blood biobank/prospective | Peripheral blood biobanked at Stanford University, USA and University of Sheffield, UK | Four distinct immunological clusters were identified. Cluster I had unique sets of upregulated proteins (TRAIL, CCL5, CCL7, CCL4, MIF), which was the cluster with the least favourable 5-year transplant-free survival rates (47.6%, 95% CI 35.4–64.1%) | N/A | Unsupervised ML | Blood biobanked; 281 patients for discovery cohort/104 patients for validation cohort | Resampling method/yes |
Leha [50] | Echocardiographic parameters/retrospective | King's College Hospital (UK); University Medical Center Gottingen and University of Regensburg (Germany) | Among five ML algorithms, random forest of regression trees is the best method to identify PH patients (AUC 0.87, 95% CI 0.78–0.96) with accuracy of 0.83 | Right heart catheterisation | Five ML algorithms (random forest of classification trees, random forest of regression trees, lasso-penalised logistic regression, boosted classification trees, SVM) | 90 patients with invasively determined PAP with corresponding echocardiographic estimations of PAP | Resampling method/no |
Asthma | |||||||
Wu [51] | 100 clinical, physiological, inflammatory and demographic variables/prospective | Severe Asthma Research Program (SARP) cohort from National Heart, Lung, and Blood Institutes (USA) | Four asthma clusters with differing CS responses were identified. Those in CS-responsive cluster were older, more nasal polyps, and high blood eosinophils. After CS, there was the highest increase in lung function in this group | N/A | Unsupervised ML; MML-MKKC | 346 adult asthmatics with paired (before and after CS) sputum data | Random split sample validation/no |
Pleural effusion | |||||||
Khemasuwan [52] | 19 candidate clinical variables from retrospective cohort of patients with pleural infection | A tertiary care, university-affiliated hospital, Utah, USA | Candidate predictors of tPA/DNase failure were the presence of pleural thickening (48% relative importance) and presence of an abscess/necrotising pneumonia (24%) | N/A | Supervised ML (extreme gradient boosting and coupled with decision trees) | 84 patients with pleural infection and received intrapleural tPA/DNase | Random split sample validation/no |
PFT interpretation and clinical diagnosis | |||||||
Topalovic [53] | PFT tests and clinical diagnosis/prospective | University Hospital Leuven (Belgium) | Pulmonologists’ interpretation of PFTs matched guideline in 74.4±5.9% of cases and made correct diagnosis in 44.6±8.7% versus AI algorithm matched the PFT pattern interpretations in 100% and assigned correct diagnosis in 82% (p<0.0001) | ATS/ERS guideline and expert panel | ML; type not reported | Dataset based on 1430 historical cases/50 cases in prospective analysis | Not reported/yes |
SARS-CoV-2 pandemic | |||||||
Wang [54] | CT chest of patients with atypical pneumonia/retrospective | Xi'an Jiaotong University First Affiliated Hospital, Nanchang University First Hospital and Xi'An No.8 Hospital of Xi'An Medical College (China) | An internal validation achieved a total accuracy of 82.9% with specificity of 80.5% and sensitivity of 84%. The external testing dataset showed a total accuracy of 73.1% with specificity of 67% and sensitivity of 74% | Confirmed nucleic acid testing of SARS-CoV-2 | CNN | CT images from 99 patients, of which 44 were confirmed cases of SARS-CoV-2 | Random split sample validation/no |
Li [55] | CT chest of patients with atypical pneumonia/retrospective | Six medical centres, China | AUC values for COVID-19 was 0.96 (95% CI 0.94–0.99). Sensitivity of 90% (95% CI 83–94%) and specificity 96% (95% CI 93–98%) | Confirmed nucleic acid testing of SARS-CoV-2 | CNN | 4356 chest CT examinations from 3322 patients | Not reported/yes |
PH: pulmonary hypertension; PFT: pulmonary function test; SARS-CoV-2: severe acute respiratory syndrome-coronavirus-2; CT: computed tomography; NLST: National Lung Screening Trial; ROC: receiver operating characteristic; AUC: area under the curve; CNN: convolutional neural network; AUROC: area under the ROC curve; NSCLC: nonsmall cell lung cancer; 18F-FDG PET: fluorine-18 2-fluoro-2-deoxy-d-glucose positron emission tomography; HRCT: high-resolution computed tomography; κw: weighted κ-coefficient; NSIP: nonspecific interstitial pneumonia; UIP: usual interstitial pneumonia; ML: machine learning; TRAIL: tumor necrosis factor-related apoptosis-inducing ligand; CCL: C-C motif chemokine ligand; MIF: macrophage migration inhibitory factor; N/A: not applicable; SVM: support vector machine; PAP: pulmonary arterial pressure; CS: corticosteroids; MML-MKKC: multiview learning-multiple Kernel k-means clustering; tPA: intrapleural tissue plasminogen activator; DNase: deoxyribonuclease; ATS: American Thoracic Society; ERS: European Respiratory Society; COVID-19: coronavirus disease 2019.