Introduction

Respiratory muscle strength and endurance are important predictors of lung function, and respiratory muscle weakness may contribute to the development of chronic respiratory insufficiency.1,2 This muscle weakness can also cause dyspnea and exertion intolerance.3

Several methods for assessing respiratory muscle performance (strength and endurance) have been developed, including voluntary (volitional) and involuntary tests.4 According to the American Thoracic Society and the European Respiratory Society’s (ATS/ERS) statement on Respiratory Muscle Testing, the principal advantage of volitional tests is that they give an estimate of inspiratory or expiratory muscle strength, are simple to perform, and are well tolerated by patients.1 However, it can be difficult to ensure that the subject is making a truly maximal effort.1

The measurement of maximal inspiratory pressure (MIP) is a classic volitional test widely used in clinical practice to evaluate inspiratory muscle strength.1,4,5 Despite wide clinical use, the maneuver is not intuitive, and a low value might not mean weakness but rather a lack of compliance.5,6

The search for a method of measuring inspiratory muscle strength that would overcome the limitations of MIP resulted in the proposal to measure nasal inspiratory pressure during sniffing.7,8 Despite the advantage of being available in most health centers, this method may underestimate muscle strength in patients with upper airway dysfunction and should be used with caution in those with nasal protection.5,6 Another volitional test, the Test of Incremental Respiratory Endurance (TIRE), provides a comprehensive assessment of inspiratory muscle performance by measuring MIP over time. The integration of MIP over inspiratory duration provides sustained maximal inspiratory pressure.8 Thus, the sustained MIP has been described as single-breath inspiratory work capacity and represents single-breath work/endurance.9

It is essential that diagnostic tests for evaluating respiratory muscle strength and endurance have proven reliability, feasibility, and validity if they are to be used in practice with confidence.10 This review aims to synthesize studies that evaluated the psychometric properties of volitional tests used to measure respiratory muscle strength and endurance. To our knowledge, no systematic review has synthesized studies evaluating these psychometric properties, and this study will close this knowledge gap.

Methods

This systematic review was completed in accordance with the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines.11

Data sources and searches

MEDLINE/PubMed, LILACS, Cochrane Central Register of Controlled Trials, Scopus and SciELO were searched for relevant studies from inception through June 2024 with no language restrictions. A standard protocol was developed for searches, and whenever possible, controlled vocabulary was used (e.g., MeSH terms for Medline). Keywords included: Reproducibility of results, Maximal respiratory pressures, Respiratory muscles, Muscle Strength. Synonyms were used to sensitize the search. The sensitivity-optimized search strategy developed by Higgins and Green was used to identify studies in MEDLINE/PubMed.12 The search strategy for MEDLINE/PubMed is shown in Figure 1. For ongoing studies or when confirmation of any data or additional information was needed, the authors were contacted by email.

Figure 1
Figure 1.Flow diagram showing baseline screening and study selection and exclusion criteria.

Study selection

This systematic review included cross-sectional and analytical studies that evaluated the psychometric properties of volitional respiratory muscle strength and endurance tests. Studies were included regardless of their publication status or language. Studies were eligible for inclusion if they met the following criteria: a) included healthy or patient populations (respiratory, cardiac, neurological, musculoskeletal or systemic conditions); b) investigated the psychometric properties of volitional tests that measure muscle strength and respiratory resistance. Articles that were reviews and that assessed non-volitional tests were excluded.

Two authors independently evaluated the titles and abstracts of identified studies. If at least one author considered a study as potentially relevant, the full text was evaluated by two additional authors for eligibility. In the event of a disagreement, the authors discussed the reasons for their decisions, and a final decision was made by consensus. Additional studies were sought by examining the included studies’ reference lists.

For this review, reliability was defined as a measure of the stability or consistency of a measure. We consider reliability as the degree to which the scores of people who have not changed are the same on repeated measures in various situations, including repetition on different occasions (test-retest reliability and intra-rater reliability), by different persons (inter-rater reliability), or in the form of different replicates (items) on a multi-item instrument (internal consistency).13 Thus, reliability is the ability to reproduce a consistent result in time and space or from different observers.13,14 Validity was considered as a property that defines the extent to which an instrument measures the construct it intends to measure (truthfulness).13,14 This property measures how well a new instrument compares to a well-established gold standard.15 Studies that presented different definitions or applicability of the concepts explored were excluded.

Data extraction

Two reviewers independently extracted data from included studies using a data extraction form.12 Characteristics including subjects’ age and gender, type of volitional test used, type of device, whether warm-up was performed, total test time, number of series, repetitions, stopping criterion, instruction and demonstration, screen incentive, psychometric properties (reliability and validity) and instrument calibration were extracted where available. Not all studies evaluated all psychometric properties.

In the case of reliability, a benchmark of intraclass correlation coefficients (ICCs) was used to interpret the displayed ICCs: ICC < 0.40 indicating poor, ICC 0.40 -0.75 indicating fair to good, and ICC > 0.75 indicating excellent reliability (16). Regarding validity, Pearson correlations were categorized as high if > 0.70, moderate between 0.50 and 0.70 and low if < 0.50.16

Methodological quality of studies

The methodological quality of the included studies was assessed by two authors using the Critical Appraisal Tool (CAT) scale.17 The CAT scale is a specific tool to evaluate studies that tested psychometric properties and contains 13 evaluation items. Five of the items pertain to both validity and reliability issues, four to validity issues alone, and four to reliability issues only. Each study was rated “yes” when the information was described in sufficient detail or “no” when there was insufficient information to clarify. A final percentage rating column (%) was added based on the items each study achieved (% = (“yes” items x 100)/number of items scored). Studies were considered of high quality if they scored equal to or above 70%.17

Results

The systematic literature search identified 1,078 unique studies. Thirty-five studies were considered potentially relevant and were retrieved for full-text review. Twenty-eight studies18–44 met the eligibility criteria. Figure 1 describes the results of the systematic literature search. The results of the critical assessment of the quality of the studies included are presented in Table 1. No study achieved the maximum score. Three studies scored above 70%, and four studies scored below 50%. The sample sizes of the included studies ranged from 10 to 544 individuals. Collectively, study subjects included 1,887 healthy participants (19 studies), 181 individuals with chronic obstructive pulmonary disease (COPD),5 46 with asthma,2 20 with non-cystic fibrosis bronchiectasis,1 20 with cystic fibrosis,31 72 with multiple sclerosis1 and 14 individuals with chronic ventilatory failure.1 Some studies evaluated more than one population. The characteristics of the included studies are provided in Table 2 (see Supplementary Material).

Table 1.Evaluation of the methodological quality of the studies with the CAT.
Study 1 2 3 4 5 6 7 8 9 10 11 12 13 %
1 Areias et al, 2020 x x x x x x x x 61.5
2 Goulart et al, 2020 x x x x x x x x x 69.2
3 Larribaut et al, 2020 x x x x 30.7
4 Basso-Vanelli et al, 2018 x x x x x x x 53.8
5 Formiga et al, 2018 x x x x x x x x 61.5
6 Silva et al, 2018
7 Silva et al, 2017 x x x x x x x x 61.5
8 Woszezenki et al, 2017 x x x x x x x x x x 76.9
9 Grams et al, 2015 x x x x x x x x 61.5
10 Jalan et al, 2015 x x x x x x x x x x 76.9
11 Langer et al, 2013 x x x x x x x x x 69.2
12 Moran et al, 2010 x x x x x x x x x x 76.9
13 Terzi et al, 2010 x x x x x x x x 61.5
14 Kamide et al, 2009 x x x x x x 46.1
15 Enright et al, 2006 x x x x x x X x 61.5
16 Romer et al, 2004 x x x x x x x x x 69.2
17 Domenech-Clar et al, 2003 x x x x x X x 53.8
18 Volianitis et al, 2001 x x x x x x x x 61.5
19 Mcconnell et al, 1999 x x x 23.1
20 Smeltzer et al, 1999 x x x x x x 46.1
21 Maillard et al, 1998 x x x x x x x x 61.5
22 Sette et al, 1997 x x x x x x x x 61.5
.23 Larson et al, 1993 x x x x x x x X x 69.2
24 Carroll et al, 1992 x x x x x x x 53.8
25 Multz et al,1990 x x x x x x x x 615
26 Mcelvaney et al, 1989 x x x x x x x x 61.5
27 Larson et al, 1987 x x x x x x x x 61.5
28 Curtis et al, 2024 x x x x x x x x 61.5

%: (Items “yes” x 100)/number of items scored; 1. If human subjects were used, did the authors give a detailed description of the sample of subjects used to perform the test? 2. Did the authors clarify the qualification, or competence of the rater(s) who performed the test? 3. Was the reference standard explained? 4. If interrater reliability was tested, were raters blinded to the findings of other raters? 5. If intrarater reliability was tested, were raters blinded to their own prior findings of the test under evaluation? 6. Was the order of examination varied? 7. If human subjects were used, was the time period between the reference standard and the index test short enough to be reasonably sure that the target condition did not change between the two tests? 8. Was the stability of the variable being measured taken into account when determining the suitability of the time interval between repeated measures? 9. Was the reference standard independent of the index test? 10. Was the execution of the test described in sufficient detail to permit replication of the test? 11. Was the execution of the reference standard described in sufficient detail to permit its replication? 12. Were withdrawals from the study explained? 13. Were the statistical methods appropriate for the purpose of the study? %: final percentage of validity or reliability.

Outcomes

The strength and resistance of the respiratory muscles were measured using several approaches: static maximum inspiratory pressure (MIP), static maximum expiratory pressure (MEP), dynamic maximum inspiratory pressure (S-index), sustained maximum inspiratory pressure (SMIP), nasal inspiratory pressure measured through of two different types of sensors (SNIP and SNIFF), manual measurements of respiratory muscles (MMRM), inspiratory muscle resistance (IME) and maximal incremental inspiratory muscle performance (MIMP). Less examined tests included the TIRE, which applied maneuvers to sustain maximal inspiration using an oxygen sensor; the PBU (pressure biofeedback unit), a device for reading biofeedback from respiratory muscles; and the inspiratory work capacity (IWC), which was described as a method of evaluating the maximum working peak of traditional MIP.

Reliability

Twenty-five studies examined the reliability of volitional tests used to measure respiratory muscle strength and endurance (see Table 3 in Supplementary Material). The reliability of MIP and MEP was analyzed in healthy participants in 12 studies,20,23–26,30–33,37,43,44 COPD patients in 2 studies,9,21 and asthmatics in one study.19 Some studies simultaneously evaluated MIP and MEP. The reliability of the S-Index (one study), SNIP (three studies) and MIMP (one study) was analyzed only in healthy participants. The reliability of SMIP was analyzed only in patients with COPD (two studies). The reliability of the MMRM was only in asthmatic patients (one study). Some studies carried out very specific analyses. For example, the study by Grams et al. compared the validity of MIP values evaluated by the standard method versus the unidirectional expiratory valve method and determined that the mean MIP values measured by the unidirectional expiratory valve method were 14.3% higher (-117.3 ± 24.8 cmH2O) than the mean values of the standard MIP method (-102.5 ± 23.9 cmH2O).26

Studies that evaluated the test-retest reliability of respiratory muscle strength and volitional endurance tests reported outcomes that ranged from good to excellent, except with some specific analyses. In general, MIP and MEP showed greater reliability across studies. Of the 12 studies that demonstrated good reliability, 9 evaluated MIP alone, 2 MEP alone and 1 study evaluated only the measurement of expiratory force. The maximum ICC value reported for MIP was 0.979 (CI 0.947–0.991), while the highest reported MEP ICC was 0.989 (CI 0.022–0.001).9,32 Manual evaluation of respiratory muscle strength, a less common method of assessment, reported higher levels of reliability when evaluated during measurement of anterior diaphragm, ICC 0.79 (CI 0.60 – 0.89), rectus abdominis, ICC 1.0 (IC 0.90-1.0), and lower intercostal, ICC of 0.81 (CI 0.50 – 0.87) muscles.19 The SMIP was evaluated in two studies, in the work of Basso-Vanelli et al., using Power Breathe, and showed high reliability, with an ICC of 0.96 (CI 0.92–0.99).21 The second study of SMIP by Formiga et al. reported an ICC of 0.994 (CI 0.986–0.998).9 The innovative biofeedback test (PBU) showed a good ICC of 0.89 (CI 0.66–0.95).23 The SNIP evaluated in two studies demonstrated its best result with an ICC of 0.92 (CI 0.91–0.94),30 while the capsule sensor manometer (CSPG-V) showed excellent results with ICC between 0. 92 and 0.96.26

Validity

Nine studies examined the validity of volitional tests to measure respiratory muscle strength and endurance (see Table 4 in Supplementary Material). The validity of MIP was analyzed in healthy participants (two studies), COPD (two studies) and non-CF bronchiectasis subjects (one study). Formiga et al. evaluated the construct validity of the test of incremental measures of respiratory resistance (PIM and SMIP) and of inspiratory muscle performance in people with COPD.9 The values presented were (r = 0.399, p = 0.006) and ID (0.413, p = 0.004) for the correlation between 6MWT and SMIP, in addition to a value of (r = −0.322, p= 0.019) and ID (r = −0.320, p=0.019) for the correlation between mMRC and SMIP. Convergent validity examined the degree to which MIP and SMIP were associated with specific COPD-related outcomes such as dyspnea and functional exercise capacity as measured by the 6-minute walk test (6MWT).9 No significant correlation was found between vital capacity, distance covered in the 6MWT and measured MIP/MEP. Furthermore, no significant association was found between MIP and dyspnea; therefore, patients with more frequent signs of dyspnea do not necessarily have reduced inspiratory muscle strength.

The validity of the MEP was analyzed in healthy participants (one study) and patients with non-CF bronchiectasis (one study), both of which showed good statistical correlation in the presented data.23,28 The validity of the S-index (one study) and the validity of the SMIP were analyzed only in patients with COPD (one study), showing moderate and weak correlations, respectively.9,18 Silva et al. evaluated the convergent validity of the two most traditional measures of respiratory muscle strength, MIP and MEP, with PBU, finding good validity values (MIP-PBU: r2 = 0.72; MEP-PBU: r2 = 0.75 In general, the MIP presented excellent validity, presenting robust data in most studies, with MIP values between 0.74 and 0.84 (r2).18,38

Discussion

To our knowledge, this is the first systematic review on the reliability and validity of measuring respiratory muscle strength and endurance using volitional tests. Respiratory muscle strength and endurance are complementary facets of respiratory performance, each playing a crucial role in different contexts. Respiratory muscle strength refers to the ability of muscles, such as the diaphragm and intercostal muscles, to generate sufficient pressure to move large volumes of air rapidly and is essential in situations that require maximal respiratory effort, such as coughing or intense physical activity.18 In contrast, respiratory muscle endurance refers to the ability of these muscles to sustain respiratory activity over time without becoming fatigued and is vital for maintaining effective ventilation during prolonged exercise or everyday tasks that require constant breathing.21,34 Therefore, while respiratory muscle strength is crucial for rapid and intense responses, respiratory muscle endurance is indispensable for maintaining stable and efficient respiratory function in the long term.21 Therefore, these two properties are complementary, and the respiratory muscles are no different. Different activities may require greater power and explosiveness of the respiratory muscles. Still, they constantly need a basal level of activity to ensure fundamental organic functions, requiring resistance muscle fibres that can support this indispensable activity, which is to maintain the minimum level of necessary volumes and ventilatory capacities.21

This review demonstrates that volitional tests used to assess respiratory muscle strength and endurance vary in their psychometric properties. Most of the included studies evaluated the psychometric properties of traditional tests such as MIP and MEP. New approaches to measuring muscle strength and endurance have received fewer studies. The validity and reliability of traditional tests (MIP and MEP) appear to be more consistent in both healthy individuals and individuals with lung diseases when compared to new tests, although there are still a limited number of studies in specific populations. Some tests less used in clinical practice, such as SMIP, still require further studies to assess psychometric characteristics.9,18,23 Information on concurrent validity was restricted by a low number of studies.

The reliability of the tests used to measure respiratory muscle strength and endurance in healthy participants, COPD patients, and bronchiectasis without CF was adequate, with moderate to high correlations with the measures presented in the studies. Regarding validity estimates, we identified that volitional tests used to simultaneously measure respiratory muscle strength and endurance presented weak to moderate correlation when compared to tests that evaluated only inspiratory muscle strength in isolation, and to studies that tested only strength separately, obtaining results that demonstrated moderate to high validity, perhaps because strength measurement methods are more common in clinical practice, such as maximal inspiratory and expiratory pressures, which are widely replicated both in clinical practice and in field studies, when compared to endurance assessment methods.28

Several studies have attempted to measure the resistance capacity of respiratory muscles. Laribaut et al., in a healthy population, used an innovative mechanism, performing a combination of tests, applying a mechanism of resisted inspiration associated with an index of isocapnic hyperpnea. In both models, they presented excellent inter- and intra-examiner reliability results. Basso-Vaneli et al. used the Power Breathe K5 electronic device, performing a maximum inspiratory support test for a certain period of time and finding excellent reliability of the method applied to patients with COPD, with ICC between 0.92-0.99 (p < 0.001).20,21

Goulart also used an unusual tool for testing respiratory function, applying a form of manual measurement of muscle strength by feedback in asthmatic patients; however, it showed an uncertain correlation between the inter-examiner assessments for the intercostal muscles and for the posterior region of the diaphragm.19

Several studies evaluated the maximal inspiratory muscle strength through incremental tests. Formiga et al. performed a comparison with the traditional method of measuring maximum inspiratory pressure, finding moderate to low convergent validity in both tests.9 Silva, in addition to assessing inspiratory muscles, also verified the maximum expiratory pressure, adding a biofeedback measure (PBU) as a means of comparison with traditional methods, finding acceptable values of convergent validity. Two other studies used a software-guided digital manometer to perform the tests, another three used a digital load-imposing device for incremental testing, and one study used an analog load-imposing device.21,27 Most studies clearly describe the environment and isolation during the tests, a fundamental point for the quality of the assessments. Adverse events were not reported during maneuvers, and it is unclear whether they occurred.21,27

Moran et al. evaluated the validity of the traditional methods (MIP and MEP) of assessing respiratory muscle strength in patients with bronchiectasis and reported reliable correlation values between the measurements, as did McConnell, who found a correlation coefficient of 0.97 (p < 0.0001) for the measure of MIP.28 Sette et al., on the other hand, evaluated the MIP and the measure in relation to time to exercise limitation (TLIM).38

In reviewing the included studies, the degree of heterogeneity reported across practices is evident. Variations in the types of protocols used, even when employing the same devices, make the selection of the most reliable and valid approach difficult. Such challenges may explain the tendency to use traditional tests to measure maximum respiratory pressure rather than incremental tests more routinely used in specialized centers. Furthermore, the traditional MIP and MEP presented robust data regarding validity and reliability values, which supports their applicability. Another factor that may justify the choice is that incremental tests require much more time to be carried out, in addition to greater understanding on the part of the patient, making it unfeasible in certain cases, and making the examiner opt for the traditional MIP maneuver.

Among the studies included in this review, the time interval between reliability test measurements was extremely heterogeneous, ranging from 30 minutes to one week between assessments. The study population in which the tests were administered was also varied, including individuals with COPD, cystic fibrosis, and healthy individuals. In general, the authors agree that incremental and inspiratory muscle endurance tests are more appropriate for evaluating inspiratory muscle performance or for load titration during IMT, being reliable and reproducible, even given the limitations exposed. However, in terms of reliability and validity, traditional MIP and MEP measurements are still considered the gold standard for evaluating respiratory muscle strength and endurance. This review is limited by the great heterogeneity between studies. Most studies did not clearly describe the execution time, the interval between sets, and the number of repetitions. In addition, the present work did not evaluate the responsiveness of the tests due to the lack of data availability in most studies.

Conclusion

This review demonstrates that volitional tests vary in reliability and validity for measures of respiratory muscle strength and endurance. The more traditional tests, such as the MIP and MEP, presented greater validity and reliability values compared to the other tests. Other volitional tests lack sufficient studies evaluating their psychometric properties, especially in unhealthy populations.


Contributors

All authors contributed to the conception or design of the work, the acquisition, analysis, or interpretation of the data. All authors were involved in drafting and commenting on the paper and have approved the final version.

Funding

This study did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Competing interests

All authors have completed the ICMJE uniform disclosure form and declare no conflict of interest.

Ethical approval

Not required for this article type. The data used to support this study can be found in the postgraduate repository of that university.

AI Statement

The authors confirm no generative AI or AI-assisted technology was used to generate content.