Introduction
The global outbreak of SARS-CoV-2 in 2019 resulted in a worldwide pandemic [1, 2]. The increase in COVID-19 patients and the resulting strain on CT departments [3] highlighted the importance of bedside ultrasound [4, 5] as a critical diagnostic tool for clinicians [5, 6]. However, like other ultrasound applications, the analysis and interpretation of specific lung ultrasound (US) findings remains subjective and experience-dependent [7]. Despite this, several studies [8, 9, 10] have demonstrated a high degree of consistency in results among lung ultrasound specialists with different levels of expertise. For example, De Molo C. et al. [11] reported excellent agreement between experienced and novice operators, with an intraclass correlation coefficient (ICC) of 0.975 (0.962-0.983).
The level of knowledge and skill necessary for novice lung ultrasound operators to accurately acquire and interpret images has been a topic of debate. Thus, Lerchbaumer M.H. et al. [8] suggested that specialized training may not be essential for interpretation of results, expert review is often recommended. Rouby J.J. et al. [12] found that a brief, easy-to-implement training program in bedside lung ultrasound led by an experienced physician provided an adequate learning curve for residents with limited experience in evaluating lung ultrasound (LUS) patterns in critically ill patients.
Interrater agreement in assessing the severity of infiltration syndrome has been an important issue [8, 13], and Russell F.M. et al. [10] investigated this criterion for training novice operators. The study found that 11 scans were required to achieve proficiency, with no significant difference between trainees with no previous ultrasound experience and those with over 25 previous patient scans (p = 0.64). In contrast, Baker K et al. [14] suggested that a 4-hour introductory lung ultrasound course was sufficient for good interrater agreement, contradicting previous research. Millington et al. [9] also suggested that operators should complete at least 50 lung ultrasound examinations before being considered pre-trained. Consequently, the data collected from researchers regarding prior experience has been somewhat variable.
There is a consensus among researchers that using a protocol with a large number of zones is optimal [4]. Studies by Mento F. et al. [15] and Rouby J.J. et al. [12] indicated that the ideal protocol should include at least 10 and preferably 12 zones. Smargiassi A. include 16 zones [16]. This approach was adopted by De Molo C. et al. [11] and A. Kumar A. et al. [17], while Millington S.J. et al. [9] employed an 8-zone protocol and Gullett J. et al. [18] used a 10-zone protocol. A comparative study of protocols with different numbers of zones by Tung-Chen Y. et al. [19] showed that the 12-zone protocol yielded the highest interclass correlation coefficient.
Variations have been noted in the LUS grading scales used. Some investigators, including Lerchbaumer M.H. et al. [8], De Molo C. et al. [11], and Fatima N. et al. [13], used a semi-quantitative approach with scores ranging from 0 to 3 for each zone. Others, such as Baker K. [14], Kumar A. et al. [17], and Millington S.J. et al. [9], opted for a qualitative descriptive scale assessing specific parameters [18]. Russell F. M. et al. [10] used a quantitative method by calculating the total number of B-lines.
Despite the diversity of approaches, all studies reported high interrater agreement across different lung ultrasound protocols. However, no consensus has been reached on the optimal lung ultrasound protocol and scale for both experienced and novice operators. To address the challenge of subjectively scoring abnormal lung ultrasound artifacts (B-lines), researchers developed the LUS NMHC (Russian abbreviation for National Medical Surgical Center) scale. This scale assesses the percentage of the visual field occupied by A-line artifacts indicative of aeration [20], aiming to provide a more objective evaluation. A full description of this protocol is available in the original publication [20].
Objectives
To evaluate the interrater agreement of a 16-zone protocol using the semiquantitative LUS scale and the original LUS NMHC in intensive care patients with coronavirus infection (COVID-19) during lung ultrasound monitoring by expert and novice operators. We hypothesize that a 16-zone protocol using the semiquantitative LUS and LUS NMHC scales will demonstrate high interrater agreement among anesthesiologists and intensivists with different levels of experience in lung ultrasound.
Material and methods
A two-center retrospective cohort study was conducted to evaluate the interrater agreement of lung ultrasound protocols in intensive care patients with COVID-19. The study was conducted from March 6, 2020 to December 22, 2021 at the Pirogov National Medical and Surgical Center and Moscow Multidisciplinary Clinical Center "Kommunarka".
Study participants were divided into two cohorts. The first cohort, consisting of 18 patients, was examined by two operators: a novice trained in lung ultrasound and an expert. They performed the examination together, discussing and reaching consensus in cases of disagreement. The novice, who had received training in lung ultrasound techniques, independently scanned 10 patients according to a protocol. Both examiners were board-certified in critical care medicine and anesthesiology. The expert with more than 5 years of experience in pulmonary ultrasound was also certified in ultrasonography.
For each patient, ultrasound data and cine-loop recordings of each lung zone were stored in the machine's memory. After 30 days, both the expert and the novice independently reviewed these recordings and completed 16-zone LUS protocols. This process was repeated after 6 months, with both examiners independently reviewing the cine-loop recordings and scoring them according to the 16-zone LUS NMHC protocol.
In the second patient cohort (143 patients), the examination was performed independently by expert and novice operators on the same patients according to a 16-zone protocol using the original LUS NMHC score. The results of the protocols were compared.
The study included patients with COVID-19 who were initially admitted to the intensive care unit (ICU) and underwent both chest CT and LUS within 24 hours of admission. The diagnosis of COVID-19 was based on either a positive PCR test for SARS-CoV-2 or clinical manifestations with radiological signs of infection even with a negative PCR test on admission [21].
Inclusion criteria were:
- Patients presenting to the emergency department with signs of respiratory failure and suspected coronavirus pneumonia, later confirmed by laboratory or radiological examination.
- Patients treated in the intensive care unit (ICU) with a diagnosis of "COVID-19, virus identified" (U07.1) or "suspected COVID-19, virus not identified" (U07.2).
Exclusion criteria were:
- Patients aged less than 18 years
- Patients admitted to the ICU with a diagnosis of clinical death
- Patients with CT evidence of bacterial pneumonia.
Lung ultrasound protocols and scales
The ultrasound devices used in the study were the Sonosite Edge II (Fujifilm Sonosite, USA) and the Logiq E (GE HealthCare, China), both equipped with a convex transducer. The basic mode was set to abdominal, with a scan depth of 11–13 cm.
To compare the differences between different types of lung ultrasound scales in terms of interrater agreement, a 16-zone [22, 23] protocol with two types of scores (LUS and original LUS NMHC) was tested:
- 16-zone LUS. This type of score is based on the scoring of the leading ultrasound sign (number of B-lines, thickened pleura or irregular pleural line, consolidations, etc.) and is well documented in the literature [24, 25, 26, 27].
- 16-zone LUS NMHC. Original protocol with the scoring based on the evaluation of the ratio of the percentage of the visual field occupied by artifacts from the aerated lung (A-lines) to the total visual field (fig. 1, table 1).
Fig. 1. Lung diagram for the 16-Zone LUS and 16-Zone LUS NMHC protocols Note: MAL — middle axillary line, PAL — posterior axillary line
| Points | LUS score | LUS NMHC score |
|---|---|---|
| 0 | Normal lung profile with no pleural abnormalities. Scattered (< 3) B-lines possible | The A-lines occupy 100 % of the examined area, and up to two B-lines per field of view are acceptable. B-lines cannot be coalescent or brightened, and A-lines should be clearly visible against their background. |
| 1 | Moderate interstitial syndrome, up to 5 B-lines in the field of view. Abnormal pleural line. | A-lines occupy > 50 %
of the intercostal spaces in the field of view. OR A-lines occupy 100 % of the scan with multiple B-lines that are clearly visible against the background of the A-lines |
| 2 | Significant interstitial syndrome, subpleural consolidations less than 10 mm | A-lines occupy
< 50 % of the intercostal spaces in the field of view OR B-lines to A-lines ratio is 1 : 1 with the presence of subpleural consolidation less than 15 mm * |
| 3 | Large consolidation
greater than 10 mm |
Large consolidation greater than 15 mm with or without pleural effusion |
Statistical analysis
Data collection and primary analysis were performed using Microsoft Office Excel 2019 software. Descriptive statistics of the quantitative data are reported as Me (Q1; Q3), where Me is the median, Q1 is the first quartile (25th percentile), and Q3 is the third quartile (75th percentile). Frequency values are given as N (%), where N is the absolute number of cases in the group, and % is the percentage of cases in the group. The Shapiro-Wilk test was used to test the normality of data distribution. Due to the non-normal distribution of most evaluated parameters, group comparisons were performed using the non-parametric Mann–Whitney U test for unrelated quantitative variables and the Wilcoxon signed-rank test for related variables. When comparing three or more dependent samples with quantitative data, a two-factor Friedman rank analysis of variance for related samples was employed. For frequency variables, unrelated groups were compared using the chi-square test or Fisher's exact test (when the frequency of the outcome was less than 10 %). Spearman's rank correlation coefficient was used to determine the strength of the relationship between parameters. Steiger's Z-test was used to determine the presence of significant differences between the two correlations. The two-tailed significance level was set at p = 0.05.
To determine the prognostic quality of the quantitative predictors, ROC analysis was performed by plotting the specificity and sensitivity and calculating the area under the curve (AUC) and its 95% confidence interval (CI). The optimal cut-off point (Youden index), along with specificity (Sp) and sensitivity (Se) values, was calculated using MedCalc 20.305 software.
Statistical analysis of the study data was performed using IBM SPSS Statistics for Windows, Version 27.0.1 (Armonk, NY: IBM Corp). The Microsoft Office Excel 2019 software platform was employed for visualization and tabular presentation of results.
Study results
A total of 161 hospitalized patients with COVID-19 who underwent both chest CT and lung ultrasound on admission were included in the study. Their demographic and clinical data are summarized in table 2. The mean age was 69.2 ± 14.6 years and 67 patients (41.6%) were male.
| Parameter | Total (n = 161) |
|---|---|
| Males, n (%) | 67 (41.6) |
| Age, years | 69.2 ± 14.6 |
| 16-zone LUS, n (%) | 18 (11.1) |
| 16-zone LUS NMHC, n (%) | 143 (88.8) |
To assess the consistency of ultrasound findings, researchers recruited an initial group of 18 patients. They employed a two-factor Friedman rank analysis of variance and related samples ANOVA to evaluate potential differences in results between ultrasound operators using two protocols: 16-zone LUS and 16-zone LUS NMHC. The Shapiro-Wilk test confirmed that the variables were normally distributed. The analysis revealed no significant differences in ultrasound data among different operators. For the 16-zone LUS protocol, p-values were 0.982 (ANOVA) and 0.220 (Friedman test), while for the 16-zone LUS NMHC protocol, p-values were 0.988 (ANOVA) and 0.058 (Friedman test). These results, as shown in Table 3, indicate the homogeneity of the ultrasound findings across operators.
| Parameter | M ± sd | M (Q1;Q3) | Normality test (Shapiro–Wilk test) | ANOVA | Friedman test |
|---|---|---|---|---|---|
| Expert: revision of LUS | 16.7 ± 2.5 | 15(8; 28) | 0.436 | 0.982 | 0.220 |
| Novice: revision of LUS | 17.1 ± 2.5 | 17(7; 27) | 0.171 | ||
| Baseline: expert + novice LUS | 17.4 ± 2.5 | 17(9; 29) | 0.443 | ||
| Expert: revision of LUS NMHC | 15.2 ± 2.4 | 12(7; 26) | 0.199 | 0.988 | 0.058 |
| Novice: revision of LUS NMHC | 15.7 ± 2.4 | 13(7; 24) | 0.329 | ||
| Baseline: expert + novice LUS NMHC | 15.4 ± 2.4 | 12(7; 27) | 0.276 |
Pearson correlation analysis was performed to determine the strength of the correlation between baseline and post-revision ultrasound results obtained by operators with different levels of experience. All correlations were found to be statistically significant with p < 0.001 and strong correlation coefficients R > 0.9 (table 4), regardless of the protocol used (16-zone LUS or LUS NMHC).
Pearson correlation analysis was also performed to determine the strength of the correlation between baseline ultrasound findings and revisions by operators with different levels of experience using the 16-zone LUS and LUS NMHC protocols, along with the percentage of CT lung involvement. All correlations were statistically significant with p < 0.001 (table 4) and strong correlation coefficients R > 0.9.
| Parameter | Pearson correlation Expert: Operator 1, p-value ; R (95% CI) |
Pearson correlation Expert: Baseline, p-value; R (95% CI) |
Pearson correlation Operator 1: Baseline, p-value; R (95% CI) |
Pearson correlation with CT scan, p-value; R (95% CI) |
|---|---|---|---|---|
| Expert: revision of LUS | p < 0.001; R = 0.975 (0.935; 0.991) |
p < 0.001; R = 0.995 (0.986; 0.998) |
p < 0.001; R = 0.983 (0.954; 0.993) |
p < 0.001; R = 0.915 (0.783; 0.968) |
| Novice: revision of LUS | p < 0.001; R = 0.935 (0.832; 0.976) | |||
| Baseline: expert + novice LUS | p < 0.001; R = 0.915 (0.782; 0.968) | |||
| Expert: revision of LUS NMHC | p < 0.001; R = 0.995 (0.986; 0.998) |
p < 0.001; R = 0.997 (0.992; 0.999) |
p < 0.001; R = 0.992 (0.978; 0.997) |
p < 0.001; R = 0.925 (0.807; 0.972) |
| Expert: revision of LUS NMHC | p < 0.001; R = 0.938 (0.838; 0.977) | |||
| Baseline: expert + novice LUS NMHC | p <0.001; R = 0.928 (0.812; 0.973) |
To compare the ultrasound results obtained with the 16-zone LUS and LUS NMHC protocols, the medians of the groups were compared using the Wilcoxon criterion and the paired samples test (table 5). Only in the group where the examinations were performed by a novice operator during the revision, no significant differences were revealed in the medians of the 16-zone LUS and LUS NMHC protocols. A significant difference in results was found for the baseline examination (expert along with novice) and the expert revision using the LUS and LUS NMHC protocols.
| Parameter | LUS, Me (Q1;Q3) |
LUS NMHC, Me (Q1; Q3) |
Wilcoxon signed rank test for related samples |
Paired sample test | Pearson correlation (p-value; R correlation coefficient) LUS - LUS NMHC |
|---|---|---|---|---|---|
| Expert: revision | 15 (8; 28) | 12 (7; 26) | 0.021 | 0.016 | p < 0.001; R = 0.971 (0.925; 0.989) |
| Novice: revision | 17 (7; 27) | 13 (7; 24) | 0.189 | 0.133 | p < 0.001; R = 0.939 (0.845; 0.977) |
| Baseline: expert + novice | 17 (9; 29) | 12 (7; 27) | 0.008 | 0.007 | p < .001; R = 0.966 (0.911; 0.987) |
Steiger's Z-test was used to determine significant differences between the two correlations. Since the P-value for correlations with CT of the 16-zone LUS and LUS NMHC protocols was greater than 0.05 (table 6), it can be concluded that there is no significant difference in interrater agreement for these protocols.
| Parameter | Differences in LUS correlation with CT; differences in LUS NMHC correlation with CT (p-value; z) |
|---|---|
| Expert: revision | 0.658; 0.442 |
| Novice: revision | 0.912; 0.111 |
| Baseline: expert + novice | 0.463; 0.734 |
For patients in the second cohort, Spearman rank correlation analysis was also performed between ultrasound scans (16-zone LUS protocol) obtained by different operators and the percentage of CT lung involvement. All correlations were statistically significant with P values less than 0.001 (table 7) and correlation coefficients R > 0.9, indicating a strong positive correlation (table 8).
| Parameter | Percentage of lung involvement (CT) | 16-zone LUS protocol, baseline | 16-zone LUS protocol, expert | 16-zone LUS protocol, novice |
|---|---|---|---|---|
| Percentage of lung involvement (CT) | NA | < 0.001* | < 0.001* | < 0.001* |
| 16-zone LUS protocol, baseline | < 0.001* | NA | < 0.001* | < 0.001* |
| 16-zone LUS protocol, expert | < 0.001* | < 0.001* | NA | < 0.001* |
| 16-zone LUS protocol, novice | < 0.001* | < 0.001* | < 0.001* | NA |
| Parameter | Percentage of lung involvement (CT) | 16-zone LUS protocol, baseline | 16-zone LUS protocol, expert | 16-zone LUS protocol, novice |
|---|---|---|---|---|
| Percentage of lung involvement (CT) | NA | 0.901 | 0.903 | 0.932 |
| 16-zone LUS protocol, baseline | 0.901 | NA | 0.995 | 0.977 |
| 16-zone LUS protocol, expert | 0.903 | 0.995 | NA | 0.972 |
| 16-zone LUS protocol, novice | 0.932 | 0.977 | 0.972 | NA |
To determine the agreement of ultrasound results in the second cohort (16-zone LUS NMHC protocol) between operators, a two-factor Friedman ranked analysis of variance was performed on paired samples. No significant difference (p = 0.181) was detected between the ultrasound results of the three operators using the 16-zone protocol (table 9).
| LUS operator | Me (Q1; Q3) | Number of cases | p-value, Friedman test |
|---|---|---|---|
| 16-zone LUS, expert | 12 (5; 21) | 142 | 0.181 |
| 16-zone LUS, operator 2 | 8.5 (6; 16) | 26 | |
| 16-zone LUS, operator 3 | 9.5 (4; 17) | 6 |
To determine the strength of correlation between ultrasound results obtained by different operators, correlation analysis was performed using the Spearman test. All correlations were statistically significant (p < 0.001) and strong correlation coefficients (R > 0.9).
Discussion
Although most authors emphasize the good interrater agreement in lung ultrasound between operators with different experience [8, 11, 17], there is still no consensus regarding the optimal protocol and LUS score.
All investigators agree that a protocol with a sufficient number of zones, most likely at least 12, should be used [11, 17, 19]. We used a 16 zone protocol with two scales: LUS and LUS NMHC.
In the first cohort of patients, we performed a concordance analysis with cine-loop recording for each lung zone. The first study was performed jointly by an expert and a novice operator, and the score was assigned by consensus. In this case, to overcome the problem of recording low-quality cine-loops, the technical aspect of lung ultrasound was performed by the expert [8, 9]. Subsequently, the cine-loops were independently reviewed by the expert and the novice operator and scored according to the LUS scale.
Another examined score using the 16-zone protocol was LUS NMHC. The same cine-loops were reviewed by an expert and a novice operator and scored six months later using the LUS NMHC system.
Correlation analysis showed that the baseline and revision results of the expert and novice operators had a significant strong correlation (p < 0.001, R > 0.9), regardless of the score (LUS or LUS NMHC). The same results were obtained when determining the correlation of expert and novice operator data with the percentage of lung involvement on CT.
When comparing the medians of the LUS and LUS NMHC scores using the Wilcoxon test and the paired sample test for the groups where the protocols were performed jointly by an expert and a novice operator, as well as for the expert review groups, a significant difference was found between the results when using the LUS and LUS NMHC protocols. Based on these data, we believe that these 2 scores are not interchangeable, although there are general similarities in the number of items and scoring criteria. No statistically significant difference was observed for the novice operator revision groups, regardless of the score used. It is likely that less operator experience has an impact on the results. Steiger's Z-test was used to compare correlations between LUS and LUS NMHC scores and CT results for the baseline study and expert and novice operator revision. The P value was greater than 0.05, indicating that there is no statistically significant difference in interrater agreement for these scores. Both 16-zone protocols, LUS and the original LUS NMHC, have high interrater agreement.
To validate the efficacy of the 16-zone LUS NMHC protocol in the second cohort of patients, we performed correlation and variance analyses to compare ultrasound findings from different operators with CT-determined percentages of lung tissue involvement. The results showed statistically significant correlations with P values less than 0.001 and correlation coefficients (R) greater than 0.9. In addition, analysis of variance demonstrated no significant differences between the results of different operators (p = 0.181).
The development of the LUS NMHC scale faced a known challenge [8, 13, 14] in distinguishing between scores of 1 and 2, which indicate the severity of the infiltration syndrome. We propose that quantifying the percentage of A-lines in the visual field will enable more accurate grading by operators with varying experience levels, rather than relying solely on the number of abnormalities. Additionally, quantifying A-lines may help differentiate the severity of the infiltration syndrome, particularly in cases where B-lines may or may not be visible alongside A-lines. The existing literature has not addressed how to assign scores of 1 or 2 in such situations.
It is still uncertain what level of knowledge and skill is necessary for a novice to operate a pulmonary ultrasound effectively. In our study, the novice operator received 1.5 hours of pre-training and then performed 10 pulmonary ultrasound protocols under expert supervision. This training was sufficient to achieve results indicative of interrater agreement, a conclusion supported by other investigators [8, 12, 10].
The results of our study supported the hypothesis that a 16-zone protocol using semiquantitative LUS and LUS NMHC scales would demonstrate high interrater agreement among anesthesiologists and intensivists with varying levels of experience in lung ultrasound. We found that regardless of the scoring method used, the correlation between operator scores exceeded 0.9 (p > 0.001), further confirming our hypothesis of high interrater agreement.
Conclusion
The results of this study confirm the high interrater agreement of both the semi-quantitative LUS scale described in the literature and the original LUS NMHC scale. This suggests that semi-quantitative lung ultrasound scales, as opposed to quantitative scales that rely on direct calculation of artifacts per field of view, can effectively support continuous patient monitoring by intensivists with varying levels of lung ultrasound experience. In addition, our results suggest that a short training period can enable novice lung ultrasound operators to achieve consistent patient monitoring results across specialists.
Disclosure. The authors declare no competing interests.
Authors' contributions. All authors, according to ICMJE criteria, were involved in the conception of the article, acquisition and analysis of data, drafting and editing of the article, and review and approval of the article.
Ethics approval. This study was approved by the local ethics committee of the Pirogov National Medical and Surgical Center, Moscow, Russia (reference number: 11-26.10.2021).
Funding source. This study was not supported by any external funding sources.
Data availability statement. Data supporting the findings of this study are available from the corresponding author upon reasonable request. Data are not publicly available due to privacy concerns and the potential risk of compromising participant confidentiality.

