Introduction
Acute respiratory failure (ARF) is one of the most common postoperative complications in pediatric cardiac surgery. The pathophysiology of ARF after cardiac surgery is multifactorial and includes adverse effects of general anesthesia, including iatrogenic atelectasis due to induced apnea at the main stage of the operation, cardiopulmonary bypass (CPB), surgical trauma, hypothermia [1, 2, 3], as well as pre-existing pulmonary damage.
General anesthesia leads to reduced tone in the diaphragm and intercostal muscles and a decrease in functional residual capacity of the lungs, which potentially contributes to postoperative atelectasis and increased shunt fraction to 15% or higher [4]. CPB induces a systemic inflammatory response with activation of leukocytes and platelets, complement components, coagulation and fibrinolytic systems, and cytokines, promoting capillary leak, inflammation, and microatelectasis of the parenchyma, interstitial edema, leading to post-perfusion lung injury [2, 3, 5]. During revision and pericardiotomy, as well as during the use of a coagulator, there is a risk of phrenic nerve injury, which can cause paralysis or paresis of the diaphragmatic dome and promote the development of atelectasis in the lower lobes of the lungs [6, 7]. Additionally, we hypothesize that pre-existing pulmonary damage and the type of congenital heart defect (CHD) significantly contribute to the development of postoperative ARF, and CHD may potentially predispose to interstitial edema and atelectasis. In turn, postoperative ARF increases the risk of adverse outcomes, including renal, neurological, and infectious complications, prolonged mechanical ventilation (MV), prolonged intensive care unit (ICU) and hospital stays, and mortality [8].
Despite the existence of classical diagnostic methods for ARF, they have limitations, particularly in critically ill patients and when transportation is restricted. Over the past decades, a new diagnostic method for the respiratory system based on ultrasonography has been developed and successfully applied in adult practice, with active implementation in pediatrics. Lung ultrasound allows timely detection of pulmonary pathologies in ICU. The method demonstrates high sensitivity for various pulmonary pathologies, including alveolar consolidation [9], interstitial syndrome [10, 11], pleural effusion [12, 13], and pneumothorax [9, 14, 15].
Data indicate that lung ultrasound (LUS) can predict postoperative complications in pediatric cardiac surgery and is used for visually guided intensive care [16]. For example, in the study by Ghotra GS et al., a multifactorial analysis showed that a LUS scoring system is an independent predictor of pulmonary complications, duration of MV, and ICU stay [17]. Similar results were obtained by Girona-Alarcón M et al. [18]. In the study by Cantinotti M et al., LUS evaluation was identified as an independent predictor of ICU length of stay and successful tracheal extubation [19]. Diaphragmatic ultrasound, according to literature data, also allows prediction of successful extubation [20].
Predicting the development of ARF in the postoperative period enables evidence-based decisions on treatment strategies and prevention of complications.
The present pilot study evaluates the technical feasibility of a proposed approach to predict ARF in this patient population, aiding in treatment decisions and minimizing complications. The results provide foundational data for planning larger-scale studies, identifying limitations of the current approach, and proposing solutions. Thus, pilot study results may form the basis for developing and implementing a predictive model in clinical practice.
Objective
To develop a predictive model for postoperative ARF in infants (1–12 months) undergoing radical cardiac surgery using lung ultrasound.
Materials and methods
A prospective, randomized, controlled study was conducted at the intensive care unit (ICU) of the Federal Center for Cardiovascular Surgery in Krasnoyarsk. Sixty infants with CHD aged 1–12 months, admitted between October 2023 and December 2024, were included via total coverage sampling if they met inclusion criteria. This approach ensures representativeness of the sample and minimizes potential biases. The sample size of 60 patients is typical for pediatric cardiac surgery studies, reflecting the rarity of such interventions. Given the limited sample size, a rigorous methodology was applied: model training on a training dataset, freezing, calibration on an independent calibration dataset, and Monte Carlo cross-validation with 1,000 iterations on an independent test dataset.
The study protocols and informed consent forms were approved by the local ethics committee of the Voino-Yasenetsky Krasnoyarsk State Medical University under the Ministry of Health of Russia (Protocol No. 122/2023 dated November 29, 2023). Informed consent was signed by legal guardians of all involved patients upon admission to the cardiac surgery department, prior to study initiation.
Inclusion criteria:
- Age 1–12 months during the study.
- Scheduled primary radical correction of CHD under CPB.
- Informed consent signed by the child’s legal guardian.
Non-inclusion criteria:
- Neonates or children older than 1 year.
- Severe comorbidities in history that could significantly impact the early postoperative period.
- Cyanotic CHD.
- Palliative correction of CHD.
- CHD correction without CPB.
- Absence of signed informed consent.
Exclusion criteria:
- Withdrawal of informed consent.
- In-hospital mortality (1 case).
The study comprised three stages. Stage 1 involved the data collection for model development during preoperative and postoperative assessment periods. Stage 2 consisted of an assessment of acute respiratory failure (ARF) presence/absence during the evaluation period prior to transfer to the specialized department (primary endpoint). Stage 3 included the model creation, testing, statistical analysis, and result formulation (fig. 1).
Fig. 1. Flowchart of the research design. Stage 1 — preoperative and postoperative assessment periods. Stage 2 — assessment period prior to transfer to the specialized department
Lung Ultrasound Protocol
Lung ultrasound was performed using linear probes (5–12 MHz, Philips CX50, Singapore) and (8–18 MHz, GE Logiq E, USA) to assess interstitial syndrome, alveolar consolidation syndrome, air bronchograms, local B+ syndrome, pleural effusion, and pneumothorax. The abdominal mode was used with a scanning depth of 3.0–4.5 cm. For the assessment of diaphragmatic motion characteristics, a convex probe (1–5 MHz, Philips CX50, Singapore) and a microconvex probe (4.2–11 MHz, GE Logiq E, USA) were utilized. The abdominal mode was also used, with scanning depth individually adjusted based on the patient's anatomical features.
The lung ultrasound protocol was conducted by a single researcher during each stage of the study. The following ultrasound syndromes were evaluated:
- Interstitial syndrome
- Alveolar consolidation syndrome
- Pleural effusion (hydrothorax)
- Pneumothorax
- Characteristics of diaphragmatic motion
- Local B+ syndrome
In most studies evaluating the predictive capabilities of lung ultrasound, ultrasound syndromes were assessed using a 36-point scale derived from scoring 12 lung zones [17, 21], or alternative scales [18]. Different ultrasound syndromes were accounted for within a single unified scoring system (e.g., isolated interstitial syndrome scored 1–2 points, while alveolar consolidation in the examined zone scored 3 points [17, 21]). Based on the hypothesis that all ultrasound syndromes contribute differently to the probability of postoperative acute respiratory failure (ARF), it was decided to evaluate each syndrome independently and introduce an original scoring system to quantify the severity of each syndrome separately within the same ultrasound protocol.
The thoracic cavity was divided into two halves — left and right — using the sternum as the anterior landmark and the vertebral column as the posterior landmark. Each half was further subdivided into three regions: anterior, lateral, and posterior. The anterior region extended from the parasternal line to the anterior axillary line. The lateral region extended from the anterior axillary line to the posterior axillary line. The posterior region extended from the posterior axillary line to the vertebral column. Each region was then split into upper and lower zones, approximately equally, due to the lack of standardized landmarks for children aged 1 month to 1 year. In total, six zones were identified in each half of the thoracic cavity, resulting in 12 zones overall. Scanning was performed with the linear probe positioned longitudinally (parallel to the intercostal spaces) from right to left and top to bottom.
Interstitial Syndrome.
Assessment of interstitial syndrome was conducted according to established ultrasound criteria [9, 11, 22–25]. The severity of the interstitial syndrome was evaluated semi-quantitatively based on the density of B-lines. In each segment, the intercostal space with the highest number of B-lines visible in one ultrasound slice was selected.
The severity of the interstitial syndrome was based on the number of B-lines in relation to intercostal space, scored as follows: 0 points – No B-lines present, 1 point – Up to 10 % B-lines in one ultrasound slice (isolated B-lines), 2 points — 10–50 % B-lines in one ultrasound slice, 3 points — 50–90 % B-lines in one ultrasound slice, 4 points – More than 90 % B-lines in one ultrasound slice (“white lung”) (fig. 2). In cases where the proportion of B-lines approached borderline values (e.g., 50 %), the higher score was assigned (3 points).
The severity of the interstitial syndrome in each lung was calculated separately by summing the scores across all zones: mild interstitial syndrome (≤ 6 points per lung), moderate interstitial syndrome (7–12 points per lung), severe interstitial syndrome (13–18 points per lung), extremely severe interstitial syndrome (≥ 19 points per lung).
Results were entered into the database.
Fig. 2. Assessment of the severity of interstitial syndrome in points. The white arrow indicates the B-line — a vertical hyperechoic beam originating from the pleural line (blue arrow): A — 1 point (up to 10 % B-lines in the ultrasound slice — single B-lines); B — 2 points (from 10 % to 50 % B-lines in the ultrasound slice); C — 3 points (from 50 % to 90 % B-lines in the ultrasound slice); D — 4 points (more than 90 % B-lines in one ultrasound slice — “white lung”). (Credit — Pfeifer A.A.)
Alveolar Consolidation Syndrome.
The diagnosis of interstitial syndrome was performed according to established ultrasound criteria [9, 25]. The severity of alveolar consolidation was assessed based on its size in relation to the entire examined zone, scored as follows: 0 points — no alveolar consolidation, 1 point — small subpleural consolidation visible in one intercostal space, 2 points — alveolar consolidation visible in more than one intercostal space but occupying less than 50 % of the examined zone, 3 points — alveolar consolidation occupying more than 50 % of the examined zone, 4 points — alveolar consolidation occupying the entire scanning zone.
In cases where alveolar consolidation was scored as 4 points, counting of B-lines was not feasible. In such cases, the interstitial syndrome score was assigned 0 points. For borderline values of alveolar consolidation severity, the higher score was assigned.
In the presence of alveolar consolidation, air bronchograms were also evaluated [9]. Dynamic and static air bronchograms, as well as their absence, were separately assessed: 0 points — air bronchograms absent due to lack of alveolar consolidation, 1 point — dynamic air bronchogram. 2 points — static air bronchogram, 3 points — no air bronchograms within the area of alveolar consolidation.
Pleural Effusion Syndrome.
Upon detection of pleural effusion signs [9, 25], an approximate volume measurement was conducted. Patients were placed in a strictly horizontal position for 5 minutes to ensure uniform distribution of the effusion within the pleural cavity. Then the thickness of the effusion was measured three times along the posterior axillary line. The maximum thickness (in mm) was substituted into the modified Balik formula [26], adjusted for body weight. Scoring was as follows: 0 points — no pleural effusion, 1 point — minimal pleural effusion (< 5 mL/kg body weight), 2 points — moderate pleural effusion (5–10 mL/kg body weight), 3 points — massive pleural effusion (> 10 mL/kg body weight).
For borderline values of pleural effusion severity, the higher score was assigned.
Pneumothorax.
Pneumothorax was assessed with the patient lying strictly horizontally (on the back). The “lung point” — the boundary of the pneumothorax — was determined [9]. Scoring was as follows: 0 points — no pneumothorax, 1 point — minor pneumothorax: lung point does not reach the anterior axillary line, 2 points — moderate pneumothorax: lung point located between the anterior and middle axillary lines, 3 points — massive pneumothorax: lung point located posterior to the middle axillary line or not identifiable.
For borderline values of pneumothorax severity, the higher score was assigned.
Diaphragmatic Motion Characteristics.
The motion of both diaphragmatic domes was assessed in "B" and "M" modes according to a previously described method [27]. In spontaneous breathing, diaphragmatic motion was classified as follows: 0 points — normal: synchronous movement of both diaphragmatic domes; caudal movement during inspiration; upward-directed wave in M-mode; 1 point — paresis (paralysis): no or minimal dome movement; isoline or low-amplitude upward-directed wave in M-mode; 2 points — paradoxical motion: cranial movement of the diaphragmatic dome; inverted wave in M-mode.
During mechanical ventilation in the appropriate assessment period, diaphragmatic motion was classified as “normal”.
Local B+ Syndrome.
The local B + syndrome was documented separately for each lung zone when compact clusters of multiple "draining" B-lines appeared significantly exceeding the number of B-lines in other intercostal spaces within the same zone. Results were entered into the electronic database in a binary format (present/absent, 1/0).
Preoperative and Postoperative Assessment Periods
Intensive Care Groups
Given that the authors are conducting a parallel study assessing the effectiveness of intensive care using lung ultrasound, which includes all 60 patients, and considering the potential clinical significance of accounting for the intensive care strategy by the predictive model (under ultrasound guidance or without), at the preoperative stage upon admission to Cardiac Surgery Department No. 4, patients were randomized into two groups:
- Group 1 — Ultrasound-Guided intensive care group: Postoperative intensive care was administered based on lung ultrasound data, with dynamic ultrasound monitoring performed. Lung ultrasound was conducted at least three times at the postoperative stage: within 2 hours after the patient's transfer from the operating room to the intensive care unit (ICU) of the Pediatric Department of Anesthesiology and Intensive care, 2–4 hours after the initiation of therapy, and following tracheal extubation. The frequency and number of lung ultrasounds in this group were not limited.
- Group 2 — Non-Ultrasound-Guided intensive care group: Intensive care was administered based on clinical examination, laboratory findings, instrumental data, the type of congenital heart defect (CHD), the surgery, and the experience of the attending intensivist. Results of the lung ultrasound were not disclosed to the attending intensivist, and no recommendations were provided. Intensive care was performed without the use of lung ultrasound, which was accounted for during database formation and predictive model development to enhance its accuracy, sensitivity, and specificity.
Demographic Data
The database included the following demographic parameters: sex, weight, height, and age.
Lung Ultrasound
Lung ultrasound was performed by the researcher according to the original protocol described above. The database included preoperative lung ultrasound results obtained 18–24 hours before surgery and postoperative results obtained within 2 hours after the patient's transfer from the operating room to the ICU of the Pediatric Department of Anesthesiology and Intensive care.
Intraoperative Data
This section documented the impact of CHD on pulmonary circulation (CHD with hypervolemia of the pulmonary circulation or CHD without hypervolemia of the pulmonary circulation), cardiopulmonary bypass (CPB) duration in minutes, and aortic occlusion duration in minutes.
The study included children with the following types of CHD:
- CHD with hypervolemia of the pulmonary circulation: septal defects (ventricular septal defect, atrial septal defect, or their combinations).
- CHD without hypervolemia of the pulmonary circulation: pulmonary valve stenosis, aortic valve stenosis, coarctation of the aorta, anomalous origin of the left coronary artery from the pulmonary artery, and mitral valve insufficiency.
Features of ultrasound-guided intensive care
Interstitial Syndrome
Therapy for interstitial syndrome was adjusted based on its severity as assessed by lung ultrasound, through the administration of diuretic therapy, reduction of fluid infusion, and keeping serum protein at normal level. For mild interstitial syndrome, no therapeutic adjustments were made; instead, dynamic fluid balance monitoring was performed with repeat lung ultrasound evaluations 2–4 hours after the initiation of therapy and following tracheal extubation.
It is well-established that diuretic therapy reduces extravascular lung water, leading to decreased interstitial syndrome severity on lung ultrasound [28]. For moderate-to-severe interstitial syndrome, diuretic therapy was administered via fractional intravenous doses of furosemide (0.2–0.4 mg/kg), with possible changing to continuous infusion up to a maximum dose of 1 mg/kg/h under close clinical monitoring of diuresis, fluid balance, and lung ultrasound findings.
According to published data, reduced albumin level cause extravascular lung fluid accumulation due to decreased oncotic pressure, and maintaining normal albumin level improves ventilation-perfusion relationships [29]. Therefore, in cases of severe or extremely severe interstitial syndrome identified on lung ultrasound, total protein levels were assessed, controlled, and maintained through intravenous albumin administration.
Alveolar Consolidation Syndrome
Intensive care for alveolar consolidation syndrome involved recruitment maneuvers, adjustment of positive end-expiratory pressure (PEEP), patient positioning, and postural drainage with airway clearance.
Recruitment maneuvers were performed under ultrasound guidance using a stepwise increase in PEEP levels according to a modified technique [30]. Unlike the original method, we did not reduce PEEP to zero but instead personalized it for each patient, as described below. These maneuvers were performed manually without reliance on automated ventilator modes. If lung ability to recruitment was observed (reduction in consolidation size on ultrasound), the maneuver was repeated after every disconnection of the breathing circuit.
Given evidence that excessive PEEP may impair lymphatic drainage of lung parenchyma, potentially increasing extravascular lung water [31], PEEP levels were personalized based on the total score of alveolar consolidation syndrome: PEEP 5 in the absence of consolidations, PEEP 6 for scores of 1–8, and PEEP 7 for scores exceeding 8.
Patients were positioned and postural drainage was performed based on zones with the most severe alveolar consolidation syndrome, as these interventions have been shown to positively affect atelectatic lung tissue [32].
Pleural Effusion
Therapy was administered in accordance with the volume of pleural effusion. For effusions < 5 mL/kg body weight, conservative treatment with furosemide and reduced fluid infusion was implemented, followed by dynamic monitoring. For effusions between 5–10 mL/kg, therapy depended on the severity of respiratory failure symptoms, with decisions regarding pleural drainage or conservative management made in consultation with pediatric cardiac surgeons. For effusions > 10 mL/kg, pleural drainage was performed.
Pneumothorax
Small pneumothoraxes were monitored dynamically or drained after discussion with the head of the Pediatric Department of Anesthesiology and Intensive care and pediatric cardiac surgeons. Moderate and large pneumothoraxes were managed with pleural drainage.
Abnormal Diaphragmatic Motion
In cases of abnormal diaphragmatic motion, pediatric cardiac surgeons were consulted to discuss diaphragmatic plication.
Assessment period before transfer to the specialized department
During this assessment period, the presence or absence of acute respiratory failure (ARF) criteria was evaluated.
ARF Criteria
The evaluation of ARF in patients was conducted immediately prior to transfer from the Pediatric Department of Anesthesiology and Intensive care to Cardiac Surgery Department No. 4. The decision to transfer was made collaboratively, involving the head of the Pediatric Department of Anesthesiology and Intensive care and the head of Cardiac Surgery Department No. 4, taking into account the patient's clinical condition, laboratory findings, and instrumental data.
To assess ARF manifestations, all children were transitioned to spontaneous breathing with atmospheric oxygen for 10 minutes, followed by arterial blood sampling to measure PaO2, SaO2, and PaO2/FiO2. In this study, the criteria for ARF included: PaO2/FiO2 < 300 and/or SaO2 < 92% and/or an increase in respiratory rate > 30% compared to baseline values. All criteria were recorded in the database, and a conclusion was noted: “ARF present” or “ARF absent”.
Statistical Analysis
Data collection and preliminary analysis were performed using a custom program written in C# that facilitated interaction with an SQLite database and generated Excel files, which were used to develop predictive models. For normally distributed data, results were presented as mean ± standard deviation (M ± SD). Quantitative data not following a normal distribution were described as Me (Q1; Q3), where Me represents the median, Q1 is the first quartile (25th percentile), and Q3 is the third quartile (75th percentile). Frequency data were reported as N (%), where N is the absolute number of observations in the group, and % is the percentage of observations in the group. A p-value < 0.05 was considered statistically significant.
The normality of data distribution was assessed using the Shapiro-Wilk test. Since most parameters significantly deviated from normal distribution, comparisons of independent quantitative variables were performed using the nonparametric Mann—Whitney U-test. For parameters with normal distributions, Student's t-test was used for comparative analysis.
A total of 60 patients were included in the study and randomly divided into three datasets: training set — 42 patients (70 %), calibration set — 6 patients (10 %), and test set — 12 patients (20 %).
The Random Forest method—an ensemble approach based on multiple decision trees—was employed to create the predictive model. Each tree was trained on a random subset of data and features, reducing the risk of overfitting compared to individual decision trees. Ensemble modeling improved prediction accuracy through averaging results across multiple models, effectively handling nonlinear relationships between features, small sample sizes, and heterogeneous data.
Random allocation of patients into training, calibration, and test sets ensured no statistically significant differences in baseline characteristics, including the primary class (patients with ARF). This approach maximized representativeness, minimized bias, reduced overfitting, and enhanced study reliability, especially given the limited dataset.
Model training and calibration were performed on separate datasets, followed by model freezing and preservation. The calibration set was used for configuring hyperparameters of the machine learning model (MLM), and calibration quality was assessed using the Brier Score. Subsequently, the model was tested on the test set without cross-validation, with ROC analysis performed to calculate the area under the curve (AUC). To further evaluate MLM performance, Monte Carlo cross-validation with 1,000 iterations was conducted. Final metrics were calculated as the average of all iterations. Commonly accepted performance metrics included accuracy, recall, specificity, precision, and F1-score, derived from the confusion matrix, which identified true positives, true negatives, false positives, and false negatives.
A predictive model based on logistic regression (LRM) was also developed, with separate datasets used for training and testing. Univariate and multivariate regression analyses were performed, and a logistic regression equation was formulated. The model was tested on the test set with ROC analysis and AUC calculation. Monte Carlo cross-validation with 1,000 iterations was applied to deeply evaluate the effectiveness of the logistic regression model. Performance metrics were calculated similarly using the confusion matrix.
All data processing, predictive model development (MLM and LRM), statistical analysis, evaluation of model effectiveness, and data visualization were implemented using algorithms written in Python (Python Software Foundation, Version 3.12.0, Wilmington, DE: Python Software Foundation). Key libraries used included numpy, pandas, sklearn, sqlite3, scipy, matplotlib, seaborn, docx, joblib, and statsmodels.
The choice of Random Forest methodology with data allocation into three subsets (training, calibration, and testing), followed by model freezing, further calibration and Monte Carlo cross-validation, was driven by the following factors: limited dataset (only 60 patients), need for reliable and interpretable results, importance of probability predictions assessment and calibration, maintaining stability and accuracy for medical applications. This methodology provides model quality balancing with the constraints of a small dataset.
The use of a frozen model (a model trained on the training dataset and fixed without further parameter setting) for calibration and cross-validation offers several important advantages that enhance prediction quality and result reliability: 1) Resistance to Overfitting and Stable Predictions. Once the model is frozen after training, its parameters remain unchanged during calibration or testing. This approach helps avoid the risk of overfitting to new data, as calibration is performed solely to adjust probabilistic predictions rather than alter the model's structure or parameters. Thus, the model retains its ability to generalize based on the data from the training set. 2) Reusability. A fixed model can be easily reused for other tasks or datasets since its parameters remain constant. This makes it more versatile and practical for real-world applications. Monte Carlo cross-validation with 1,000 iterations ensures stability and reliability of the results, compensating for limitations in dataset size. Performing calibration improves the adequacy of probabilistic predictions. 3) Objective Assessment of Generalization Capability. Since the model is already trained and fixed, the results of cross-validation reflect its true performance on unseen data rather than its ability to adapt to each data split. This minimizes the risk of optimistic bias in metrics that could arise if the model were retrained on each split. 4) Stable Cross-Validation Results. Using a fixed model leads to more stable cross-validation outcomes because the model is not influenced by random factors associated with retraining. This allows for more reliable mean values and distributions of metrics. 5) Improved Interpretability. Cross-validation results obtained using a frozen model are easier to interpret, as they demonstrate how the model performs on data it has not encountered during training. This provides a clear understanding of its real-world performance.
This methodology ensures reliability, objectivity, and practical applicability of the predictive model in clinical settings.
Results
The baseline characteristics of the patients are presented in table 1. The baseline characteristics of patients in the training and calibration datasets are shown in table 2, while the baseline characteristics of patients in the test dataset are presented in table 3. No statistically significant differences were observed between the datasets.
The training and test datasets used for training and testing both models (Random Forest Machine Learning Model (MLM) and Logistic Regression Model (LRM) were identical in terms of size, included patients, and considered factors.
As a result of training the MLM, hyperparameters setting was performed, followed by calibration on the calibration dataset. The following hyperparameters were adjusted: n_estimators (number of trees in the forest): increasing the number of trees can improve model quality but increases training time; max_depth (maximum depth of the tree): limits tree depth to prevent overfitting; min_samples_split (minimum number of samples required to split an internal node): controls the minimum number of samples required to split a node; min_samples_leaf (minimum number of samples required at each leaf node): ensures sufficient data in terminal nodes to avoid overfitting; max_features (maximum number of features considered for splitting a node): influences model randomness; bootstrap (whether bootstrap samples are used when building trees): enables sampling with replacement.
The following hyperparameters values were set: n_estimators = 100, max_depth = 30, min_samples_split = 5, min_samples_leaf = 2, bootstrap = True.
The Brier Score was calculated as 0.095, and a calibration curve was constructed (fig. 3). The model was then frozen (saved). Subsequently, the model was tested on the test dataseFigt, including ROC analysis and construction of the ROC curve (fig. 4). The AUC for the MLM was 0.929.
In the second step, Monte Carlo cross-validation with 1,000 iterations was conducted on the saved MLM to assess its generalization capability. During each iteration, the original 12 test cases were randomly divided into subsets: 9 cases were used for simulated training to verify model consistency with the training data, while the remaining 3 cases were used exclusively for testing. Importantly, the model was not retrained during this process; it was used solely for prediction. Performance metrics were calculated for each split, and the final evaluation was determined as the average across all iterations. The calculated metrics were as follows: accuracy = 0.922 (92.2%), recall (sensitivity) = 1.0 (100%), specificity = 0.867 (86.7%), precision = 0.840, F1-Score = 0.913
The results are summarized in table 4, and the confusion matrix is presented in figure 5.
Additionally, a second predictive model based on logistic regression (LRM) was created, including the same patients in the training dataset (42 patients) and test dataset (12 patients), accounting for the same factors as in the MLM. Patients from the calibration dataset were not included in the creation or testing of this model. Univariate (table 5) and multivariate (table 6) regression analyses were performed. In the univariate analysis, 26 statistically significant factors were identified. A logistic regression equation was formulated:
Logit(P) = -0,8556 + 343,1066 * MainGroup + 6,2992 * LeftLungLateralUp_BLines_0 + -152,5780 * LeftLungLateralUp_LocalBSyndrome_1 + -20,8457 * LeftLungPosteriorUp_Consolidations_1 + 49,0374 * LeftLungPosteriorUp_AirBronchograms_1 + 4,2821 * RightLungAnteriorUp_BLines_0 + 43,7875 * RightLungAnteriorUp_BLines_1 + 0,9157 * RightLungLateralUp_BLines_0 + 45,5332 * RightLungLateralUp_BLines_1 + -65,3959 * RightLungLateralUp_Consolidations_1 + -43,1994 * RightLungLateralUp_AirBronchograms_1 + 140,8244 * RightLungPosteriorUp_BLines_1 + 2,1094 * RightLungPosteriorUp_Consolidations_0 + 19,6377 * RightLungPosteriorUp_Consolidations_1 + 14,4657 * RightLungPosteriorUp_AirBronchograms_0 + 5,8946 * RightLungPosteriorUp_AirBronchograms_1 + 16,1164 * RightLungLateralDown_Consolidations_1 + -101,0784 * RightLungLateralDown_LocalBSyndrome_0 + 77,5300 * RightLungLateralDown_AirBronchograms_1 + 11,0685 * RightLungPosteriorDown_BLines_1 + -35,5075 * RightLungPosteriorDown_Consolidations_1 + 37,3142 * RightLungPosteriorDown_AirBronchograms_1 + 27,4746 * RightLungIntersticialSyndrome_0 + -31,8331 * RightLungIntersticialSyndrome_1 + -14,5012 * PatientHeight + 0,9210 * Age
The model was tested on the test dataset with ROC analysis and construction of the ROC curve (Fig. 4). The AUC for the LRM was 0.70.
Subsequently, cross-validation was performed on the test dataset using the Monte Carlo method with 1,000 iterations, following principles identical to those used for the MLM. It is important to note that the LRM also did not overfit during these iterations. The following metrics were calculated: accuracy – 0.742 (74.2%), recall (sensitivity) – 0.387 (38.7%), specificity – 1.000 (100%), precision – 1.000, F1-score – 0.558 (table 7). The confusion matrix was constructed (fig. 5).
A comparison of the metrics of the two predictive models — MLM and LRM — was conducted (table 7).
| Parameter | Value |
|---|---|
| Age (days) | 114.0 (61.5; 184.8) |
| Patient weight (kg) | 5.8 (4.9; 6.8) |
| Patient height (cm) | 61.0 (56.8; 64.0) |
| Sex (male) | 29 (48.3%) |
| CHD with pulmonary hypervolemia | 42 (70.0%) |
| Respiratory failure | 35 (58.3%) |
| Duration of cardiopulmonary bypass (min) | 50.0 (40.0; 70.0) |
| Duration of aortic occlusion (min) | 28.5 (22.8; 42.5) |
| Parameter | Training sample (n = 42) | Test sample (n = 12) | p-value |
|---|---|---|---|
| Age (days) | 120.00 (60.50; 186.25) | 170.67 ± 104.68 | 0.382 |
| Weight (kg) | 5.97 ± 1.76 | 6.54 ± 2.37 | 0.377 |
| Height (cm) | 61.10 ± 8.02 | 64.17 ± 8.65 | 0.265 |
| Male sex, n (%) | 22 (52.4) | 6 (50.0) | 1.000 |
| CHD with pulmonary hypervolemia, n (%) | 33 (78.6) | 6 (50.0) | 0.113 |
| Respiratory failure, n (%) | 25 (59.5) | 7 (58.3) | 0.442 |
| Duration of cardiopulmonary bypass (min) | 50.00 (43.00; 75.75) | 45.00 (38.00; 75.00) | 0.755 |
| Duration of aortic occlusion (min) | 28.50 (23.25; 43.50) | 27.50 (23.50; 40.00) | 0.992 |
| Parameter | Training sample (n = 42) | Calibration sample (n = 6) | p-value |
|---|---|---|---|
| Age (days) | 120.00 (60.50; 186.25) | 76.67 ± 26.29 | 0.078 |
| Weight (kg) | 5.97 ± 1.76 | 5.69 (4.27; 5.70) | 0.249 |
| Height (cm) | 61.10 ± 8.02 | 55.67 ± 3.82 | 0.117 |
| Male sex, n (%) | 22 (52.4) | 3 (50.0) | 1.000 |
| CHD with pulmonary hypervolemia, n (%) | 33 (78.6) | 3 (50.0) | 0.313 |
| Respiratory failure, n (%) | 25 (59.5) | 5 (83.3) | 0.499 |
| Duration of cardiopulmonary bypass (min) | 50.00 (43.00; 75.75) | 58.00 ± 12.79 | 0.492 |
| Duration of aortic occlusion (min) | 28.50 (23.25; 43.50) | 31.33 ± 11.37 | 0.938 |
| Parameter | Group 1 (n = 6) | Group 2 (n = 6) | p-value |
|---|---|---|---|
| Age (days) | 191.67 ± 113.29 | 149.67 ± 90.55 | 0.532 |
| Weight (kg) | 7.12 ± 3.04 | 5.96 ± 1.14 | 0.439 |
| Height (cm) | 64.50 ± 11.59 | 63.83 ± 3.89 | 0.905 |
| Male sex, n (%) | 4 (66.7) | 4 (66.7) | 0.564 |
| CHD with pulmonary hypervolemia, n (%) | 4 (66.7) | 4 (66.7) | 0.564 |
| Respiratory failure, n (%) | 4 (66.7) | 5 (83.3) | 0.242 |
| Duration of cardiopulmonary bypass (min) | 57.50 (42.50; 66.50) | 39.00 (38.00; 84.25) | 0.630 |
| Duration of aortic occlusion (min) | 27.50 (21.75; 32.50) | 27.50 (24.50; 50.75) | 0.818 |
Fig. 3. Calibration curve of the MMO model (machine learning algorithm). Brier Score = 0.095
| Factor | Coefficient | p-value | 95% CI (Lower) | 95% CI (Upper) |
|---|---|---|---|---|
| Group | 3.296 | 0.000 | 1.860 | 4.732 |
| Patient Height | –0.078 | 0.031 | –0.150 | –0.007 |
| Patient Age | -0.008 | 0.017 | –0.014 | –0.001 |
| Lateral Upper Part of the Left Lung, Interstitial Syndrome (Preoperative) | 0.809 | 0.026 | 0.096 | 1.522 |
| Lateral Upper Part of the Left Lung, Local B+ Syndrome (Postoperative) | –1.266 | 0.029 | –2.399 | –0.132 |
| Posterior Upper Part of the Left Lung, Consolidations (Postoperative) | 0.616 | 0.021 | 0.091 | 1.141 |
| Posterior Upper Part of the Left Lung, Air Bronchograms (Postoperative) | 0.635 | 0.025 | 0.080 | 1.191 |
| Anterior Upper Part of the Right Lung, Interstitial Syndrome (Preoperative) | 0.719 | 0.039 | 0.035 | 1.403 |
| Lateral Upper Part of the Right Lung, Interstitial Syndrome (Preoperative) | 0.650 | 0.033 | 0.053 | 1.248 |
| Lateral Upper Part of the Right Lung, Interstitial Syndrome (Postoperative) | 1.072 | 0.008 | 0.274 | 1.870 |
| Lateral Upper Part of the Right Lung, Consolidations (Postoperative) | 0.859 | 0.048 | 0.009 | 1.710 |
| Lateral Upper Part of the Right Lung, Air Bronchograms (Postoperative) | 0.876 | 0.034 | 0.065 | 1.686 |
| Posterior Upper Part of the Right Lung, Interstitial Syndrome (Postoperative) | 1.295 | 0.003 | 0.428 | 2.162 |
| Posterior Upper Part of the Right Lung, Consolidations (Preoperative) | 0.632 | 0.015 | 0.122 | 1.142 |
| Posterior Upper Part of the Right Lung, Consolidations (Postoperative) | 0.965 | 0.002 | 0.357 | 1.574 |
| Posterior Upper Part of the Right Lung, Air Bronchograms (Preoperative) | 0.579 | 0.035 | 0.041 | 1.118 |
| Posterior Upper Part of the Right Lung, Air Bronchograms (Postoperative) | 0.824 | 0.005 | 0.244 | 1.403 |
| Lateral Lower Part of the Right Lung: Consolidations (Postoperative) | 0.694 | 0.016 | 0.132 | 1.257 |
| Lateral Lower Part of the Right Lung: Local B+ Syndrome (Preoperative) | –2.015 | 0.018 | –3.679 | –0.351 |
| Lateral Lower Part of the Right Lung: Air Bronchograms (Postoperative) | 0.796 | 0.012 | 0.175 | 1.416 |
| Posterior Lower Part of the Right Lung: Interstitial Syndrome (Postoperative) | 0.695 | 0.040 | 0.030 | 1.360 |
| Posterior Lower Part of the Right Lung: Consolidations (Postoperative) | 0.845 | 0.002 | 0.320 | 1.369 |
| Posterior Lower Part of the Right Lung: Air Bronchograms (Postoperative) | 1.099 | 0.000 | 0.487 | 1.710 |
| Right Lung: Interstitial Syndrome (Preoperative) | 0.859 | 0.016 | 0.163 | 1.556 |
| Right Lung: Interstitial Syndrome (Postoperative) | 0.765 | 0.041 | 0.030 | 1.499 |
| Factor | Coefficient |
|---|---|
| Group | 343.107 |
| Lateral Upper Part of the Left Lung, Interstitial Syndrome (Preoperative) | 6.299 |
| Lateral Upper Part of the Left Lung, Local B+ Syndrome (Postoperative) | –152.578 |
| Posterior Upper Part of the Left Lung, Consolidations (Postoperative) | –20.846 |
| Posterior Upper Part of the Left Lung, Air Bronchograms (Postoperative) | 49.037 |
| Anterior Upper Part of the Right Lung, Interstitial Syndrome (Preoperative) | 4.282 |
| Anterior Upper Part of the Right Lung, Interstitial Syndrome (Postoperative) | 43.788 |
| Lateral Upper Part of the Right Lung, Interstitial Syndrome (Preoperative) | 0.916 |
| Lateral Upper Part of the Right Lung, Interstitial Syndrome (Postoperative) | 45.533 |
| Lateral Upper Part of the Right Lung, Consolidations (Postoperative) | –65.396 |
| Lateral Upper Part of the Right Lung, Air Bronchograms (Postoperative) | –43.199 |
| Posterior Upper Part of the Right Lung, Interstitial Syndrome (Postoperative) | 140.824 |
| Posterior Upper Part of the Right Lung, Consolidations (Preoperative) | 2.109 |
| Posterior Upper Part of the Right Lung, Consolidations (Postoperative) | 19.638 |
| Posterior Upper Part of the Right Lung, Air Bronchograms (Preoperative) | 14.466 |
| Posterior Upper Part of the Right Lung, Air Bronchograms (Postoperative) | 5.895 |
| Lateral Lower Part of the Right Lung: Consolidations (Postoperative) | 16.116 |
| Lateral Lower Part of the Right Lung: Local B+ Syndrome (Preoperative) | –101.078 |
| Lateral Lower Part of the Right Lung: Air Bronchograms (Postoperative) | 77.530 |
| Posterior Lower Part of the Right Lung: Interstitial Syndrome (Postoperative) | 11.068 |
| Posterior Lower Part of the Right Lung: Consolidations (Postoperative) | –35.508 |
| Posterior Lower Part of the Right Lung: Air Bronchograms (Postoperative) | 37.314 |
| Right Lung: Interstitial Syndrome (Preoperative) Sum of Scores | 27.475 |
| Right Lung: Interstitial Syndrome (Postoperative) Sum of Scores | –31.833 |
| Patient Height | –14.501 |
| Patient Age | 0.921 |
Fig. 4. Results of the ROC analysis of predictive models before performing cross-validation. The ROC curve of the predictive model based on logistic regression (upper graph) has an AUC of 0.70. The ROC curve of the predictive model based on machine learning (lower graph) demonstrates an AUC of 0.929
Fig. 5. Confusion Matrix after performing Monte Carlo cross-validation with 1000 iterations for the predictive model based on logistic regression (upper matrix) and the model based on machine learning (lower matrix)
| Predictive models | Accuracy | Recall | Specificity | Precision | F1-score | AUC |
|---|---|---|---|---|---|---|
| MLM | 0.922 | 1.000 | 0.867 | 0.840 | 0.913 | 0.929 |
| LRM | 0.742 | 0.387 | 1.000 | 1.000 | 0.558 | 0.700 |
Discussion
Random allocation of patients into training, calibration, and test datasets allowed the models to be trained on one part of the data and validate their performance on another, unseen part of the data, which helped avoid overfitting. It is important to note that there were no statistically significant differences between the datasets, indicating that the models were trained and tested on data with similar characteristics. This potentially suggests that the models will perform comparably on new data, as they were not overfitted to the specific features of one dataset. The calibration dataset was used to configure the hyperparameters of the machine learning model (MLM), allowing us to explore how changes in parameters affect model performance and select the most appropriate settings, which in turn improved the accuracy of predictions. The following hyperparameters values were selected and set for the MLM based on the Random Forest algorithm: n_estimators: 100 — the number of trees in the forest, which enhances model stability and accuracy, max_depth: 30 — the maximum depth of each tree, which helps control overfitting by allowing trees to be deep enough to capture complex patterns but not so deep that they start fitting noise in the data, min_samples_split: 5 — the minimum number of samples required to split a node. This value helps prevent the creation of nodes that may be too specific to the training dataset; min_samples_leaf: 2 — the minimum number of samples that must be present in a leaf node. This value reduces overfitting by ensuring each leaf contains sufficient data, bootstrap: True — use of bootstrapping to create subsamples of data for training each tree. This improves the generalization ability of the model, as each tree is trained on a random subsample of the data.
These hyperparameters values were chosen to optimize model performance and achieve the best balance between accuracy and generalization. As a result, the Brier Score of 0.095 indicates good calibration of the predictive model.
Comparison of the metrics of the two models demonstrates significant differences in their effectiveness. The MLM based on Random Forest showed high accuracy (accuracy) at 92.2 %, indicating its ability to correctly classify the majority of observations. In contrast, the logistic regression model (LRM) demonstrated significantly lower accuracy at 74.2 %. The AUC value of 0.929 for the MLM indicates excellent ability to distinguish between positive and negative classes, whereas the AUC of 0.70 for the LRM indicates moderate distinction ability. Upon deeper analysis of effectiveness, the Recall (sensitivity) of the MLM was 100 %, demonstrating its ability to effectively detect positive cases. Conversely, the LRM showed significantly lower sensitivity at 38.7 %, indicating a high probability of missing positive cases of ARF. Regarding specificity, the MLM achieved a value of 86.7 %, while the LRM reached 100 %, indicating its ability to correctly classify negative cases. However, despite high specificity, the LRM had low F1-scores of 0.558, indicating a mismatch between the number of true positives and false positives, whereas the MLM's F1-score of 0.913 is a very high value.
The situation where Precision equals 1.0 and Recall equals 0.387 can occur when the model makes highly accurate predictions but misses a significant number of true positive cases. This happens when the model predicts the positive class only in cases where it is absolutely confident in its prediction, leading to the absence of false positives (FP = 0). This explains the high Precision value for the LRM, meaning all positive predictions made by the model are correct; however, this occurs because the model makes a small number of positive predictions overall and thus fails to identify many actual positive cases (it has many false negatives - FN), resulting in low Recall. In this case, Recall = 0.387 indicates that the model identified only 38.7% of all true positive cases. The constructed Confusion Matrixes allow for a more detailed analysis of the quality of predictions of the predictive models (Fig. 5).
The comparison of MLM and LRM models shows that MLM outperforms LRM across all key metrics, including AUC, accuracy, recall, and F1-score. The machine learning model demonstrates high classification efficiency, whereas logistic regression, despite its high specificity and prediction precision, exhibits insufficient recall and overall effectiveness. Thus, the MLM shows more balanced and higher performance compared to LRM, making it more preferable for classification tasks in this context.
We analyzed the possible reasons for the differences in the effectiveness of the two models we developed. Complexity differences: Random Forest is an ensemble method that uses multiple decision trees to improve predictions, allowing the model to better capture complex dependencies in the data, whereas logistic regression may be too simplistic for this task. Small sample size: the sample size for training (42 individuals) and testing (12 individuals) is relatively small. In such conditions, Random Forest may generalize the data better due to its structure, while logistic regression may suffer from overfitting or insufficient generalization ability. The non-parametric nature of Random Forest: this method does not assume linear relationships between variables, making it more flexible when working with various types of data.
Several studies have proposed predictive models for diagnosing postoperative pulmonary complications in pediatric cardiac surgery using lung ultrasound. In the study by Cantinotti M. et al., two predictive models were created to predict successful tracheal extubation and the duration of ICU stay [19]. The study included children of all age groups (from newborns to 18 years), with both radical and palliative correction of congenital heart defects (CHD), and lung ultrasound was performed exclusively in the postoperative period. Moreover, lung assessment was conducted using a 36-point scale, considering only the interstitial syndrome and alveolar consolidation syndrome with summation of scores; evaluation of other ultrasound syndromes was not performed. Determination of postoperative ARF was not the primary endpoint and was not conducted. Additionally, the effectiveness of the predictive models developed by the authors was not assessed. Other predictive models for postoperative pulmonary complications were also proposed, taking into account lung ultrasound on preoperative and postoperative stages [17] and tracheal extubation failing based on diaphragmatic thickening fraction measurement [20]; however, their effectiveness was not evaluated on a test sample of patients, which raises doubts about the generalization ability of the created models.
Several predictive models for forecasting pulmonary complications in pediatric cardiac surgery have been published, involving large numbers of patients and based on machine learning algorithms, with their effectiveness evaluated on a test sample; however, lung ultrasound was not performed in these studies [33, 34]. For example, a predictive model for forecasting pulmonary complications in pediatric cardiac surgery was proposed, using a machine learning algorithm based on demographic data, surgical characteristics, and intraoperative arterial pressure measurements in 1964 patients with an AUC of 0.785 [33], and a predictive model for predicting pneumonia development in pediatric cardiac surgery with inclusion of 23,000 patients and an AUC of 0.929 [34]; however, lung ultrasound was not conducted or considered in these studies.
There are data suggesting that inclusion of BNP [35] and cystatin C [36] levels in the predictive model could further increase its effectiveness, but this requires additional financial costs and access to specialized laboratories.
Thus, the scientific and practical novelty of our study lies in the creation of two unique predictive models for postoperative ARF in infants after radical cardiac surgeries based on lung ultrasound and clinical data, where all main ultrasound syndromes were taken into account, and the effectiveness metrics of the created predictive models — accuracy, recall (sensitivity), specificity, precision, F1-score, AUC — were studied on a test sample of patients.
The predictive model based on the machine learning algorithm (MLM) demonstrated higher effectiveness and was named “LUCH-D” — Lung Ultrasound in Congenital Heart — Disease.
This study has several limitations:
- The single-center character of the study, as well as the performance of lung ultrasound by one researcher at all stages.
- Assessment of ARF was conducted at the stage of transfer from the ICU to the specialized department, which could have been performed on different postoperative days for different patients, potentially negatively affecting model effectiveness.
- The small sample size, which could influence the effectiveness and generalization ability of the predictive models, as well as the representativeness of the results. Further studies with larger patient cohorts are necessary.
The results of our study demonstrate the high importance of incorporating lung ultrasound data into the predictive model for forecasting the risk of ARF in patients during the postoperative period. The predictive models were trained and tested on a relatively small number of patients; future studies could focus on improving the models and adapting them to larger samples, which would enhance prediction accuracy and reliability. Additionally, our results confirm the feasibility of applying machine learning methods in medical practice for predicting various outcomes.
Conclusion
The conducted study made it possible to develop a predictive model “LUCH-D” – “Lung Ultrasound in Congenital Heart – Disease”, created using a machine learning algorithm for predicting the development of ARF in the postoperative period in infants after radical cardiac surgeries with high efficiency, as well as a predictive model based on logistic regression. It should be noted that the models were developed and tested on a small sample of patients, which could have influenced the results and requires further research. Accounting for lung ultrasound parameters may enhance the effectiveness of predictive models. Assessment of the severity of ultrasound signs according to the presented protocol may be useful for planning and conducting studies devoted to developing predictive models that forecast respiratory complications in the postoperative period. Nevertheless, this study is a pilot and requires validation on larger and more diverse patient samples. Future studies should include patients from different medical centers and consider the possibility of modifying the proposed model for its simplification and further improvement of prediction accuracy.
Disclosure. The authors declare no competing interests.
Author contribution. Pfeifer A.A. — development of the article concept, justification of scientific and practical significance, collection and analysis of empirical data, statistical data processing, writing and editing the article text, data visualization, review and approval of the article text. Miller A.Yu. — analysis of empirical data, data visualization, review and approval of the article text, statistical data processing. Gurchenko S.A. — development of the article concept, analysis of empirical data, independent search for sources (full-text Russian and English), review and approval of the article text. Ilinykh K.A. — development of the article concept, review and approval of the article text. Sakovich V.A. — review and approval of the article text. Gritsan A.I. — development of the article concept, justification of scientific significance, writing and editing the article text, review and approval of the article text.
Ethics approval. The research protocols and informed consent projects were approved by the local ethics committee of the Voino-Yasenetsky Krasnoyarsk State Medical University (protocol No. 122/2023 dated November 29, 2023).
Funding source. This study was not supported by any external sources of funding.
Data Availability Statement. The data supporting the conclusions of this study can be obtained from the corresponding author upon reasonable request. The data are not publicly available as they contain information that may endanger the confidentiality of the research participants.

