Abstract
Purpose
-
The integration of artificial intelligence (AI) in radiology has revolutionized diagnostics, optimizing precision and decision-making. Specifically in musculoskeletal imaging, AI tools can improve accuracy for upper extremity pathologies. This study aimed to assess the diagnostic performance of AI models in detecting musculoskeletal pathologies of the upper extremity using different imaging modalities.
Methods
-
A meta-analysis was conducted, involving searches on MEDLINE/PubMed, SCOPUS, Cochrane Library, Lilacs, and SciELO. The quality of the studies was assessed using the QUADAS-2 tool. Diagnostic accuracy measures including sensitivity, specificity, diagnostic odds ratio (DOR), positive and negative likelihood ratios (PLR, NLR), area under the curve (AUC), and summary receiver operating characteristic were pooled using a random-effects model. Heterogeneity and subgroup analyses were also included. All statistical analyses and plots were performed using the R software package.
Results
-
Thirteen models from ten articles were analyzed. The sensitivity and specificity of the AI models to detect musculoskeletal conditions in the upper extremity were 0.926 (95% CI: 0.900; 0.945) and 0.908 (95% CI: 0.810; 0.958). The PLR, NLR, lnDOR, and the AUC estimates were found to be 19.18 (95% CI: 8.90; 29.34), 0.11 (95% CI: 0.18; 0.46), 4.62 (95% CI: 4.02; 5.22) with a (P < 0.001), and 95%, respectively.
Conclusion
-
The AI models exhibited strong univariate and bivariate performance in detecting both positive and negative cases within the analyzed dataset of musculoskeletal pathologies in the upper extremity.
Introduction
The technology behind artificial intelligence (AI) is not recent; however, its implementation and applications in multiple areas of medicine have experienced a rapid and indisputable cross-sectoral positioning in recent years (1). Particularly, radiology has been one of the disciplines that has most evidenced its benefits due to its infinite capacity to simultaneously analyze enormous sets of data (2). In practical terms, it has ushered in a revolution in identifying patterns of injury, enhancing the precision of diagnoses and procedures, streamlining specialists’ image interpretation times, alleviating strain on overburdened healthcare systems and, most importantly, providing invaluable support to professionals in their decision-making processes (3). There are numerous examples of medical specialties that leverage imaging support with AI tools. These include endocrinology for categorizing thyroid nodules, gynecology through the implementation of digital mammograms, and neurology for aiding in the management of patients with acute stroke (4).
Recently, the integration of AI techniques in the field of musculoskeletal radiology has seen a gradual increase in the diagnostic support for injuries in traumatology and orthopedics. This demonstrates that its implementation significantly enhances the accuracy of diagnoses, consequently influencing the quality of life for individuals (5). Among the most widely used algorithms to enhance image quality by segmenting structures and reconstructing elements, those derived from the field of machine learning stand out. Specifically, convolutional neural networks (CNNs) and deep learning-based techniques have proven to be highly relevant (6). These techniques not only expedite the process but also enhance accuracy and enable automated real-time detection of lesions. This represents significant strides in identifying potentially crucial elements for diagnosis, which might be overlooked by the human eye (7).
Among the most prevalent musculoskeletal injuries are those affecting the upper limb, not only due to their high frequency but also because of the significant disability they can cause. These injuries are characterized by impairing quality of life, social, work, and sports environments, affecting various aspects of the well-being of those who suffer from them. Furthermore, they stand out for exhibiting very extended evolution times, with some cases, such as epicondylitis in the elbow region, having records of over 24 months (8). Or in even more extensive pathologies like adhesive capsulitis in the shoulder, which can have durations exceeding 36 months, leading to various degrees of restricted movement, pain, and functional as well as emotional incapacity (9). Therefore, achieving an efficient reduction in patient recovery times demands the full integration of the entire medical team and the implementation of highly specific therapeutic approaches. A key strategy is to enhance diagnostic capability, as increasing precision allows for the individualized selection of available therapeutic alternatives for each patient (10). In this regard, there are various existing imaging tools that support specialists in diagnosing musculoskeletal conditions. Among them, MRI stands out as a diagnostic reference despite not being the gold standard due to its depth in capturing the affected structures. However, it is not always possible to opt for this examination due to its high cost, limited availability, and radiation exposure. As an alternative, other radiological resources like computed tomography (CT) and soft tissue ultrasound are also available. These are requested in complement according to the clinical picture presented by the individual. Despite the musculoskeletal radiology field having developed a solid foundation for diagnosing various conditions over the years through the incorporation of these exams, a high level of uncertainty still prevails among professionals when it comes to making a diagnosis. Among the most studied reasons are the varying levels of specialist training, years of professional practice, image quality, the brands of medical equipment, the performance of certain stress tests on tissues to achieve better findings, and the potential simultaneous presence of more than one pathology (11).
For this reason, the development and implementation of artificial intelligence in the diagnostic field has gained significant momentum. This technology not only enhances performance in diagnostic accuracy but also enriches the decision-making process for the medical team (12). Recently, various studies have been conducted to enhance diagnostic capabilities in the field of musculoskeletal radiology (13). However, to the best of the authors’knowledge, there has not yet been an article that integrates all pathologies affecting the upper extremity, considering all existing imaging diagnostic methods, and supported using these advanced technologies.
The objective of this meta-analysis was to assess the diagnostic performance of AI models in detecting musculoskeletal pathologies of the upper extremity using different imaging modalities.
Methods
Reporting
The present systematic review was conducted in accordance with the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) statement.
Research question
The research question aimed to evaluate the diagnostic performance of various AI models used to detect musculoskeletal pathologies in the upper limb across different imaging modalities. These models have been implemented as complementary elements in medical practice for several years. However, they have recently gained significant traction in medical practice to enhance diagnostic accuracy and support decision-making by healthcare professionals. The criteria for PICOT (Participants, Interventions, Comparison, Outcome, and Time) are detailed in Table 1.
PICOT strategy for this study.
Acronym | Component | Explanation |
---|---|---|
(P) | Population | Patients diagnosed with upper limb pathology who have any type of diagnostic imaging |
(I) | Intervention | Any type of artificial intelligence model used for diagnostic purposes |
(C) | Comparison | Conventional diagnostic method |
(O) | Outcome | Evaluate the diagnostic performance of various AI models |
(T) | Type of study | Diagnostic study |
Search strategy and data sources
Two authors (GD and CR) conducted a systematic literature search in the databases. MEDLINE/PubMed (https://www.ncbi.nlm.nih.gov/pubmed/), SCOPUS (https://www.scopus.com/home.uri), Cochrane Library (https://www.cochranelibrary.com/), Lilacs (https://lilacs.bvsalud.org/en/), and SciELO (https://scielo.conicyt.cl/). The discrepancies were resolved by a third senior evaluator (FF). A time frame of 10 years up until August 2023 was considered. Additionally, the keywords corresponding to the orthopedics field were validated by one of the co-authors, who is a specialist in upper extremity musculoskeletal conditions. The diagnostic area keywords were validated by one collaborator, who is a specialist radiologist. Finally, the terms related to artificial intelligence were validated by the senior author, who holds a PhD in engineering and specializes in AI. The selected search terms were ‘ligament’, ‘muscle’, ‘tendon’, ‘bone’, ‘convolutional neural network’, ‘deep learning’, ‘machine learning’, ‘artificial intelligence’, ‘accuracy’, ‘specificity’, ‘ROC’, ‘sensitivity’, ‘wrist’, ‘hand’, ‘shoulder’, and ‘elbow’. All combinations were used, free texts were not utilized. Access was obtained to the total number of articles found.
Selection criteria
The inclusion criteria used were: (1) articles presenting musculoskeletal pathologies of the upper limb; (2) articles using biomedical image analysis; (3) articles including analysis techniques derived from AI; (4) articles reporting true positive (TP), false positive (FP), true negative (TN), and false negative (FN) values from which sensitivity and specificity can be calculated. If an article presented more than one AI model with metrics, it was also considered in the analysis; (5) articles written in English, Spanish, and/or Portuguese; (6) a maximum publication time frame of 10 years until August 2023 was considered. Exclusion criteria included: (1) review articles, letters, congress reports, papers, cadaveric articles, technique descriptions and (2) medical or technological devices, sensors, virtual reality, or any type of tangible (hardware) or intangible (software) objects that do not use AI algorithms.
Data extraction
Two independent researchers reviewed the titles, abstracts, and duplicate texts. Finally, the articles that met the inclusion and exclusion criteria were included. Discrepancies between the evaluators were resolved by a third reviewer. The following data were extracted: authors and year of publication, country of origin, musculoskeletal condition, number of patients, number of images, diagnostic algorithm, TP, FP, FN, and TN.
Ethical approval
All the studies adhered to the Helsinki Declaration’s principles and were approved by an ethical committee board. Prior informed consent was secured from all individual participants who were part of each study that was included.
Risk of bias (quality) assessment
To assess the quality of the studies and the presence of risk of bias in the selected studies, the instrument known as the Quality Assessment of Diagnostic Accuracy Studies (QUADAS-2) guidelines (14) was used. This assigns three categories (low, unclear, or high) for each item. The following factors were assessed: (A) Risk of bias: (i) Patient selection; (ii) Index test; (iii) Reference standard; (iv) Flow and timing. (B) Applicability concerns: (i) Patient selection; (ii) Index test; (iii) Reference standard.
Patient selection was considered in terms of how the patients were chosen for the study. The index test details the index test, including its administration and interpretation. The reference standard describes the reference standard (gold standard) and how it was carried out and interpreted. Finally, in the flow and timing section, it is explained whether there were patients who did not undergo the index tests or the reference standard, and it establishes the interval and any intervention between the index tests and the reference standard.
Statistical analysis
Summary statistics were generated for the diagnostic accuracy of the tests based on the metrics of TPs, FNs, FPs, and TNs. Univariate and bivariate analyses were conducted for each of the AI models, following the guidelines recommended by Shim for meta-analysis development (15, 16).
Univariate analysis
The total effect size was calculated using the number of events and the number of samples in the proportion-type data. Sensitivity and specificity were calculated individually for each model and globally. Positive and negative likelihood ratios were also calculated. The method based on the logit transformation and then the inverse transformation (the logit transformation and the Clopper-Pearson method) were used, as proportion-type data produce more stable results through this transformation. An estimate of the pooled effect was performed by calculating a diagnostic odds ratio (DOR) and then its logarithmic transformation (lnDOR) to compare and combine the different results. Forest plots were presented to graphically represent the obtained information.
Bivariate analysis
Subgroup analyses were conducted (denoted as ‘g’) based on the type of images. Group 0 was considered as the reference in the case of magnetic resonance imaging, while a value of 1 was assigned to images that did not meet this standard, such as X-rays and ultrasounds. DORs were calculated according to the subgroups, and the results were graphically represented using a forest plot. Summary measures of diagnostic accuracy were calculated using the AUC curve estimator, and a summary receiver operating characteristic (SROC) curve was constructed.
Heterogeneity analysis
A random-effects model was applied due to the heterogeneity among the studies. The quantification of heterogeneity was done using the inverse variance method when calculating the weights of individual studies, and the tau value, which represents the between-study variance, was calculated using the DerSimonian–Laird estimator. The proportion of variability attributed to heterogeneity rather than random chance among studies, both in the overall dataset and in subgroups, was assessed using Higgins' I² indicator. Values between 0% and 40% suggest that heterogeneity might not be relevant, values between 30% and 60% indicate moderate heterogeneity, values between 50% and 90% signal substantial heterogeneity, and values between 75% and 100% denote considerable heterogeneity. The proportion of total variability attributable to sampling variability was calculated using the Cochrane’s Q test.
Packages and reports
To conduct the analyses, the following packages were utilized: ‘mada’, ‘mvtnorm’, ‘ellipse’, ‘mvmeta’, ‘meta’, ‘metafor’, and ‘rmeta’ for the meta-analysis of diagnostic accuracy within the R statistical program. A significance level of <0.05 and 95% CIs were considered. Three decimal places were considered for reporting the results. All statistical analyses and plots were carried out using the R (v 4.1.3) software package.
Results
Search results
The studies included in this meta-analysis are presented in Fig. 1. Following the search strategy employed and access to various databases, a total of 3535 scientific articles were identified. After excluding those that did not meet the selection criteria, 60 references were carefully evaluated. However, only 42 articles were deemed eligible. Finally, the analysis could be conducted with just ten articles that provided all the necessary information (Fig. 1).
Studies' features
A total of 4341 patients presented 31 880 images, including X-rays (six studies), ultrasounds (one study), and nuclear magnetic resonances (three studies). The included studies originated from different countries, with the presence of the USA and Turkey, each contributing 2 studies, followed by Chile, France, Korea, Romania, Switzerland, and Taiwan, each with one article. All selected articles presented various medical diagnoses exclusively affecting the upper limb. Participants of both genders and all ages were included. Different AI models were employed to assess diagnostic accuracy, considering only the models that reported all metrics. Among them, notable models include BoneView, CNN-1, CNN-2, DesNet-161, EfficientNetB3, FSN-8-CNN, Inception-V3, nnU-Net, ResNet-50, ResNet-152, VGG-16, and VGG-19 (Table 2; 17, 18, 19, 20, 21, 22, 23, 24, 25, 26).
Characteristics of the included studies.
References | Country | Imaging | Condition | N1 | N2 | TP | FP | FN | TN | Algorithm |
---|---|---|---|---|---|---|---|---|---|---|
Cohen et al. (17) | France | Rx | Wrist fracture | 637 | 1917 | 206 | 41 | 17 | 373 | Bone View (Gleamer) |
Georgeanu et al. (18) | Romania | MRI | Malignant bone tumors | 23 | 2106 | 80 | 20 | 4 | 63 | Res-Net 50 |
Hess et al. (19) | Switzerland | MRI | Rotator cuff tear | 76 | 171 | 2 | 2 | 0 | 59 | nnU-Net |
Kim et al. (20) | Korea | Rx | Distal radio ulnar fracture | 2609 | 9984 | 271 | 66 | 29 | 624 | Des-Net 161 |
269 | 72 | 31 | 618 | Res-Net 152 | ||||||
Lin et al. (21) | Taiwan | US | Bicipital peritendinous | – | 3801 | 72 | 7 | 3 | 18 | CNN 2 |
Ozkaya et al. (22) | Turkey | Rx | Scaphoid fracture | 390 | 390 | 28 | 46 | 4 | 12 | CNN 1 |
Saavedra et al. (23) | Chile | MRI | Supraspinatus fatty infiltration | – | 606 | 108 | 39 | 6 | 1512 | VGG 19 |
105 | 31 | 9 | 1520 | Res-Net 50 | ||||||
99 | 29 | 15 | 1522 | Inception V3 | ||||||
Tecle et al. (24) | USA | Rx | II metacarpal osteoporosis | – | 265 | 28 | 10 | 6 | 221 | FSN-8-CNN |
Üreten et al. (25) | Turkey | Rx | Wrist fracture | – | 275 | 61 | 2 | 7 | 65 | VGG 16 |
Yoon et al. (26) | USA | Rx | Scaphoid fracture | – | 11838 | 900 | 469 | 26 | 910 | Efficient-Net B3 |
TP, true positive; FP, false positive; FN, false negative; TN, true negative.
Risk of bias
The ten articles underwent analysis using the QUADAS-2 tool. Only two studies showed a significant risk of bias in the ‘Flow and timing’ domain due to insufficiently detailed patient selection methods. Additionally, one study exhibited a high risk of bias in both the ‘index test’ and ‘reference standard’ domains, respectively, as it remained unclear whether the test results were interpreted without knowledge of the reference standard results. It is noteworthy that all studies adhered to rigorous reference standards, warranting a classification of low risk of bias in the remaining assessed domains, Fig. 2.
Univariate analysis
The individual analyses reported from 13 models belonging to the 10 articles were considered. The sensitivity was 0.926 (95% CI: 0.900; 0.945) in the random-effects model, indicating a high level of confidence in the model’s accuracy in correctly detecting positive cases. This means that the analyzed models are capable of correctly identifying 93% of positive cases in the evaluated dataset. In other words, when there are truly positive cases, the models correctly identify them 93 out of 100 times. The (τ²) of 0.2034 indicates that there is substantial variability in sensitivity among the studies included in the meta-analysis, and this cannot simply be attributed to chance. The value of I² was 74% (95% CI: 55; 85), meaning that 74% of the variability in sensitivity is explained by heterogeneity. The value of H for sensitivity was 1.96 (95% CI: 1.49; 2.58), suggesting there is evidence of significant heterogeneity among the studies, as shown in Fig. 3.
The reported specificity was 0.908 (95% CI: 0.810; 0.958) in the random-effects model, indicating a high level of confidence in the model’s accuracy in correctly detecting negative cases. In other words, the models are capable of correctly identifying 90% of negative cases in the analyzed dataset. That is, when there are truly negative cases, the models correctly detect them 90 out of 100 times. The (τ²) of 2.239 suggests that the results of individual studies differ significantly beyond what can be attributed to chance. The value of I² was 98.7% (95% CI: 98.4; 99), indicating that practically all the variability in the results of specificity heterogeneity among the studies cannot simply be attributed to chance. The value of H for specificity was 8.81 (95% CI: 7.88; 9.85), suggesting there is evidence of significant heterogeneity among the studies, as shown in Fig. 4.
The PLR was 19.18 (95% CI: 8.90; 29.34). This means that, on average, the probability of obtaining a positive result in individuals with the condition of interest is approximately 19 times higher than the probability of obtaining a positive result in individuals without the condition. Therefore, the tests performed are quite effective in identifying individuals who have the condition under evaluation.
Furthermore, the obtained NLR was 0.11 (95% CI: 0.18; 0.46). This means that obtaining a negative result in individuals with the condition of interest is approximately 0.11 times the probability of obtaining a negative result in individuals without the condition. In other words, the test has a good ability to rule out the presence of the condition in individuals who truly do not have it. This results in a strong indication of the test’s utility as a low NLR indicates that the test is effective in ruling out the condition in individuals without it.
The pooled effect estimator provided by the DOR was 4.68 (95% CI: 3.91; 5.45) with a P < 0.001. Therefore, the diagnostic test exhibits a moderately high discriminative power, as the probability of obtaining a positive result in individuals using these tools is 4.68 times higher than the probability of obtaining a positive result in subjects without the use of these models. The lnDOR yielded a slightly lower value of 4.62 (95% CI: 4.02; 5.22) with a P < 0.001, as shown in Fig. 5.
Bivariate analysis
The estimated global odds ratio in the random-effects model was 108.211 (95% CI: 50.124; 233.611) with a P-value <0.0001. The obtained result suggests a strong association between the compared categories in the study with high statistical significance. The (τ²) was 1.689 (95% CI: 0.715; 5.089), indicating substantial variability among the estimated effects of different studies. The value of I² was 87.5%, meaning there is high heterogeneity among the studies included in the meta-analysis. This implies that the results of individual studies are very different from each other and cannot be simply attributed to chance. The heterogeneity statistic H was 2.83 (95% CI: 2.26; 3.55), also suggesting a considerable amount of variability between the studies. The individual values that allow for the analysis of substantial differences between the groups are presented in Fig. 6, reporting the values of OR, τ², I², and the Q statistic, respectively.
The reported AUC estimation in this article is 95%, indicating a high probability of the model correctly classifying, thus demonstrating a high discriminative power of the model. Finally, Fig. 7 displays the performance of different AI models across various clinical contexts, providing a visual representation of the overall performance of the models under consideration.
Discussion
The earliest records of image processing using programming techniques date back to the 1960s (27, 28). However, diagnostic elements using machine learning, specifically supervised analysis where users define parameters and features based on expert criteria, began to be more frequently employed 20 years later. Among the most utilized algorithms were principal component analysis, support vector machines, and the first convolutional neural networks (29).
This meta-analysis demonstrates that the use of various machine learning models exhibits high diagnostic accuracy for detecting musculoskeletal pathologies in the upper extremity across different imaging modalities. The pooled sensitivity from all articles included in this review was 93%, and the pooled specificity was 91%, indicating their ability to correctly identify positive cases and exclude negative ones. Subgroup analysis yielded high odds ratios, indicating that the use of these strategies for diagnosing musculoskeletal pathologies with various medical imaging scenarios shows high diagnostic accuracy. However, it is noteworthy that within group 0, a single article demonstrates three different models for diagnosis, which warrants further attention and calls for more research in this area. Despite the fact that, in the last 5 years, the number of scientific publications utilizing various tools from machine learning has increased both for research development and for the applicability of machine learning models that allow the identification of diagnostic elements in biomedical images (30).
Furthermore, it is still noteworthy that it has not yet become widespread in the musculoskeletal field, lagging far behind disciplines such as oncology, pulmonology, and cardiovascular medicine (31). On the other hand, it is worth noting that there were more than 3500 published articles related to the area of interest for this meta-analysis. However, the lack of standardized elements in reporting the results of studies using machine learning tools for image diagnostics does not favor the analysis for this type of study. Therefore, there is a need for guidelines to guide the methodological structure for reporting future publications that use machine learning elements for diagnostic support in biomedical images. We emphasize that a guiding framework has recently been published to report statistics in biomedical studies on machine learning and AI, which will eventually contribute to improving the standardization of these reports in the musculoskeletal field (32). In this context, it is particularly important for clinicians to adopt standards for quality assessment when applying AI in diagnostic studies. A recent report highlighted an incomplete adoption of tools in diagnostic accuracy studies based on algorithm use, pointing out inconsistencies in all domains of quality assessment, thereby becoming barriers to clinical implementation (33). At the same time, global initiatives have been proposed for the development of standardized reports for the use of AI in healthcare, demonstrating the need to advance in these multidisciplinary issues (34). In this case, given the considerable diversity in existing dataset reports, it is imperative to establish standards for their presentation in the healthcare domain, particularly in applications that rely on the use of AI (35).
This meta-analysis's significant strength lies in demonstrating a high level of accuracy in detecting and classifying musculoskeletal conditions from various sources of images. Given the abundance of publications found, it is suggested that this man–machine partnership to enhance diagnostic precision in imaging will continue to evolve over time. Furthermore, the high diagnostic metrics detected in this meta-analysis strongly suggest that machine learning is an effective tool for identifying complex patterns of injuries that may sometimes be overlooked by expert eyes. Therefore, this will translate into a positive impact on public health, as it will provide more elements for diagnosis, ultimately improving decision-making, especially in scenarios with high diagnostic uncertainty, such as musculoskeletal pathologies.
There are some limitations to this study. First, we highlight that there was high heterogeneity among the evaluated studies. This could be attributed not only to the fact that the selected articles did not follow a uniform methodological structure for publication but also to the diversity in the origin of the publications. Each region may have specific strategies in the development of their scientific advances in this field. For this reason, to minimize this high variability, we opted to conduct the analyses using a random-effects model.
Secondly, the analyzed AI algorithms were diverse, which could potentially lead to greater variations in the results. Additionally, various pathologies were included, each with markedly different diagnostic criteria, which could potentially introduce errors in the datasets. Furthermore, the reference methods were not always the ideal gold standard, which is typically surgical intervention. MRI was considered an important validation method for the diagnoses.
For this reason, analyzing multiple AI models in musculoskeletal pathologies has several implications. For instance, achieving higher diagnostic accuracy is possible by exclusively utilizing artificial neural networks rather than traditional logistic regression models (36). The revolution in musculoskeletal diagnosis, facilitated by implementing complex algorithms, will set the stage for a gradual transition toward precision medicine. This transformative approach will convert diverse clinical manifestations into personalized diagnoses (37). Furthermore, it will be essential to possess representative datasets encompassing all potential musculoskeletal conditions under study to ensure accurate analyses and mitigate diagnostic biases. As such, it is crucial to remain mindful of the stages of data collection, preprocessing, model development, validation, and comprehensive implementation (38). Clinical professionals specializing in musculoskeletal diagnosis must undergo training to comprehend how models make decisions. This understanding is crucial for their effective collaboration within interdisciplinary teams (39). Therefore, to promote ongoing research effectively; a comprehensive plan should emphasize the development of specific models to address new pathologies or enhance existing ones. This approach will drive sustained innovation at the intersection of AI and the musculoskeletal field (40).
Thirdly and finally, the search strategy and information selection criteria were quite stringent to enable systematic analysis, which limited the acquisition of more articles featuring musculoskeletal injuries that did not meet our search parameters. For instance, only articles reporting diagnostic metrics such as TP, FP, FN, and TN were considered, significantly reducing the number of final articles for analysis. Additionally, considering the reported AI models or algorithms as units of analysis rather than the articles themselves could potentially introduce a selection bias. Therefore, despite the encouraging results of this meta-analysis, affirming that AI-derived tools for diagnostic support in biomedical imaging are positive, they should be considered and interpreted with moderation by different experts in the field.
Conclusion
The performance of the evaluated AI models in musculoskeletal pathologies of the upper extremity demonstrated high univariate and bivariate accuracy in detecting positive and negative cases in the analyzed dataset. It is suggested that future research should consider evaluating the same type of model and conducting studies that utilize AI tools in real-time.
Multiple challenges arise in the implementation of AI models for diagnosing musculoskeletal pathologies. These challenges include the complexity of clinical data associated with these conditions, inter- and intra-operator variability in interpreting the same image result, effective integration of models into clinical practice, and the need to externally validate model results. Nevertheless, some strategic solutions are proposed, such as the advanced integration of data processing and recognition techniques with a greater quantity and quality of information, providing more accessible presentation tools generated by the models, facilitating interoperability with electronic medical record systems, and ultimately conducting a greater number of studies in diverse musculoskeletal clinical environments to enhance result generalization.
ICMJE Conflict of Interest Statement
The authors declare that there is no conflict of interest that could be perceived as prejudicing the impartiality of the study reported.
Funding Statement
The study did not receive any specific grant from any funding agency in the public, commercial, or not-for-profit sector. This article was funded for publication by the Universidad Mayor.
Author contribution statement
Conceptualization: GD; software and statistical analysis: GD; data curation: GD and CR; writing-original draft preparation: GD; writing-review: CJ and FF; supervision and editing: FF. All authors have read and agreed to the final version of the manuscript.
Acknowledgement
The authors would like to extend special thanks to the Sport Medicine Data Science Center MEDS-PUCV.
References
- 1↑
Bohr A, & Memarzadeh K. The rise of artificial intelligence in healthcare applications. Artificial Intelligence in Healthcare 2020 25–60. (https://doi.org/10.1016/B978-0-12-818438-7.00002-2)
- 2↑
Jian J, Xia W, Zhang R, Zhao X, Zhang J, Wu X, Li Y, Qiang J, & Gao X. Multiple instance convolutional neural network with modality-based attention and contextual multi-instance learning pooling layer for effective differentiation between borderline and malignant epithelial ovarian tumors. Artificial Intelligence in Medicine 2021 121. (https://doi.org/10.1016/j.artmed.2021.102194)
- 3↑
Van Leeuwen KG, De Rooij M, Schalekamp S, Van Ginneken B, & Rutten MJCM. How does artificial intelligence in radiology improve efficiency and health outcomes? Pediatric Radiology 2022 52 2087–2093. (https://doi.org/10.1007/s00247-021-05114-8)
- 4↑
Krishnan G, Singh S, Pathania M, Gosavi S, Abhishek S, Parchani A, & Dhar M. Artificial intelligence in clinical medicine: catalyzing a sustainable global healthcare paradigm. Frontiers in Artificial Intelligence 2023 6 1227091. (https://doi.org/10.3389/frai.2023.1227091)
- 5↑
Lisacek-Kiosoglous AB, Powling AS, Fontalis A, Gabr A, Mazomenos E, & Haddad FS. Artificial intelligence in orthopaedic surgery exploring its applications, limitations, and future direction introduction: artificial intelligence, time for clear nomenclature. Bone and Joint Research 2023 12 447–454. (https://doi.org/10.1302/2046-3758.127.BJR-2023-0111.R1)
- 6↑
Sarker IH. Deep learning: a comprehensive overview on techniques, taxonomy, applications and research directions. SN Computer Science 2021 2 420. (https://doi.org/10.1007/s42979-021-00815-1)
- 7↑
Pelaccia T, & Forestier G. Deconstructing the diagnostic reasoning of human versus artificial intelligence. CMAJ 2019 2 1332–1337. (https://doi.org/10.1503/cmaj.190506)
- 8↑
Droppelmann G, Feijoo F, Greene C, Tello M, Rosales J, Yáñez R, Jorquera C, & Prieto D. Ultrasound findings in lateral elbow tendinopathy: a retrospective analysis of radiological tendon features [Version 1; peer review: 2 approved with reservations]. F1000Research 2022 11 1–1 5. (https://doi.org/10.12688/f1000research.73441.1)
- 9↑
Masuko K, Saraiva F, Navarro-Ledesma S, De La Serna D, Alayón F, López E, & Pruimboom L. A comprehensive view of Frozen Shoulder: a mystery syndrome. Front Med 2021 8 663703. (https://doi.org/10.3389/fmed.2021.663703)
- 10↑
Vetter TR, Schober P, & Mascha EJ. Diagnostic testing and decision-making: beauty is not just in the eye of the beholder. Anesthesia and Analgesia 2018 127 1085–1091. (https://doi.org/10.1213/ANE.0000000000003698)
- 11↑
Quinn L, Tryposkiadis K, Deeks J, De Vet HCW, Mallett S, Mokkink LB, Takwoingi Y, Taylor-Phillips S & & Sitch A Interobserver variability studies in diagnostic imaging: a methodological systematic review. British Journal of Radiology 2023 96 20220972. (https://doi.org/10.1259/bjr.20220972)
- 12↑
Kumar Y, Koul A, Singla R, & Ruchi MF. Artificial intelligence in disease diagnosis: a systematic literature review, synthesizing framework and future research agenda. Journal of Ambient Intelligence and Humanized Computing 2023 14 8459–8486. (https://doi.org/10.1007/s12652-021-03612-z)
- 13↑
Gyftopoulos S, Lin D, Knoll F, Doshi AM, Rodrigues TC, & Recht MP. Artificial intelligence in musculoskeletal imaging: current status and future directions. American Journal of Roentgenology 2019 213 506–513. (https://doi.org/10.2214/AJR.19.21117)
- 14↑
Whiting P. QUADAS-2 | Bristol Medical School: population health sciences | University of Bristol[Internet]. 2011. Available at: https://www.bristol.ac.uk/population-health-sciences/projects/quadas/quadas-2/
- 15↑
Shim SR, Kim SJ, & Lee J. Diagnostic test accuracy: application and practice using R software. Epidemiology and Health 2019 41 1–8. (https://doi.org/10.4178/epih.e2019007)
- 16↑
Shim SR. Meta-analysis of diagnostic test accuracy studies with multiple thresholds for data integration. Epidemiology and Health 2022 44 e2022083. (https://doi.org/10.4178/epih.e2022083)
- 17↑
Cohen M, Puntonet J, Sanchez J, Kierszbaum E, Crema M, Soyer P, & Dion E. Artificial intelligence vs. radiologist: accuracy of wrist fracture detection on radiographs. European Radiology 2023 33 3974–3983. (https://doi.org/10.1007/s00330-022-09349-3)
- 18↑
Georgeanu VA, Mămuleanu M, Ghiea S, & Selișteanu D. Malignant bone tumors diagnosis using magnetic resonance imaging based on deep learning algorithms. Medicina 2022 58. (https://doi.org/10.3390/medicina58050636)
- 19↑
Hess H, Ruckli AC, Bürki F, Gerber N, Menzemer J, Burger J, Schär M, Zumstein MA, & Gerber K. Deep-learning-based segmentation of the shoulder from MRI with inference accuracy prediction. Diagnostics 2023 13 1–13. (https://doi.org/10.3390/diagnostics13101668)
- 20↑
Kim MW, Jung J, Park SJ, Park YS, Yi JH, Yang WS, Kim JH, Cho BJ, & Ha SO. Application of convolutional neural networks for distal radio-ulnar fracture detection on plain radiographs in the emergency room. Clinical and Experimental Emergency Medicine 2021 8 120–127. (https://doi.org/10.15441/ceem.20.091)
- 21↑
Lin BS, Chen JL, Tu YH, Shih YX, Lin YC, Chi WL, & Wu Y. Using deep learning in ultrasound imaging of bicipital peritendinous effusion to grade inflammation severity. IEEE Journal of Biomedical and Health Informatics 2020 24 1037–1045. (https://doi.org/10.1109/JBHI.2020.2968815)
- 22↑
Ozkaya E, Topal FE, Bulut T, Gursoy M, Ozuysal M, & Karakaya Z. Evaluation of an artificial intelligence system for diagnosing scaphoid fracture on direct radiography. European Journal of Trauma and Emergency Surgery 2022 48 585–592. (https://doi.org/10.1007/s00068-020-01468-0)
- 23↑
Saavedra JP, Droppelmann G, García N, Jorquera C, & Feijoo F. High-accuracy detection of supraspinatus fatty infiltration in shoulder MRI using convolutional neural network algorithms. Frontiers Medicine 2023 10 1–12. (https://doi.org/10.3389/fmed.2023.1070499)
- 24↑
Tecle N, Teitel J, Morris MR, Sani N, Mitten D, & Hammert WC. Convolutional Neural Network for Second Metacarpal Radiographic Osteoporosis Screening. Journal of Hand Surgery 2020 45 175–181. (https://doi.org/10.1016/j.jhsa.2019.11.019)
- 25↑
Üreten K, Sevinç HF, İğdeli U, Onay A, & Maraş Y. Use of deep learning methods for hand fracture detection from plain hand radiographs. Ulusal Travma ve Acil Cerrahi Dergisi 2022 28 196–201. (https://doi.org/10.14744/tjtes.2020.06944)
- 26↑
Yoon AP, Lee YL, Kane RL, Kuo CF, Lin C, & Chung KC. Development and validation of a deep learning model using convolutional neural networks to identify scaphoid fractures in radiographs. JAMA Network Open 2021 4 e216096. (https://doi.org/10.1001/jamanetworkopen.2021.6096)
- 27↑
Malik P, Malik P, Pathania M, & Rathaur V. Overview of artificial intelligence in medicine. Journal of Family Medicine and Primary Care 2019 8 2328–2331. (https://doi.org/10.4103/jfmpc.jfmpc_440_19)
- 28↑
Lodwick GS, Keats TE, & Dorst JP. The coding of roentgen images for computer analysis as applied to lung Cancer1. Radiology 1963 81 185–200. (https://doi.org/101148/812185)
- 29↑
Hosny A, Parmar C, Quackenbush J, Schwartz LH, & Aerts HJWL. Artificial intelligence in radiology. Nature Reviews Cancer 2018 18 500–510. (https://doi.org/10.1038/s41568-018-0016-5)
- 30↑
Miller LE, Bhattacharyya D, Miller VM, & Bhattacharyya M. Recent trend in artificial intelligence-assisted biomedical publishing: A quantitative bibliometric analysis. Cureus 2023 15 e39224. (https://doi.org/10.7759/cureus.39224)
- 31↑
Bitkina OV, Park J, & Kim HK. Application of artificial intelligence in medical technologies: a systematic review of main trends. Digital Health 2023 9 20552076231189331. (https://doi.org/10.1177/20552076231189331)
- 32↑
Polce EM, & Kunze KN. A guide for the application of statistics in biomedical studies concerning machine learning and artificial intelligence. Arthroscopy 2023 39 151–158. (https://doi.org/10.1016/j.arthro.2022.04.016)
- 33↑
Jayakumar S, Sounderajah V, Normahani P, Harling L, Markar SR, Ashrafian H, & Darzi A. Quality assessment standards in artificial intelligence diagnostic accuracy systematic reviews: a meta-research study. NPJ Digital Medicine 5 1–13. (https://doi.org/10.1038/s41746-021-00544-y)
- 34↑
Hernandez-Boussard T, Bozkurt S, Ioannidis JPA, Shah NH & MINIMAR (MINimum Information for Medical AI Reporting). Developing reporting standards for artificial intelligence in health care. Journal of the American Medical Informatics Association JAMIA 2020 27 1–5. Available at: https://academic.oup.com/jamia/article/27/12/2011/5864179
- 35↑
Arora A, Alderman JE, Palmer J, Ganapathi S, Laws E, McCradden MD, Oakden-Rayner L, Pfohl SR, Ghassemi M, McKay F, et al.The value of standards for health datasets in artificial intelligence-based applications. Adewole O Adebajo 2023 29 36. (https://doi.org/10.1038/s41591-023-02608-w)
- 36↑
Qiu F, Li J, Zhang R, & Legerlotz K. Use of artificial neural networks in the prognosis of musculoskeletal diseases-a scoping review. BMC Musculoskeletal Disorders 2023 24 86. (https://doi.org/10.1186/s12891-023-06195-2)
- 37↑
Johnson KB, Wei WQ, Weeraratne D, Frisse ME, Misulis K, Rhee K, Zhao J, & Snowdon JL. Precision medicine, AI, and the future of personalized health care. Clinical and Translational Science 2021 14 86–93. (https://doi.org/10.1111/cts.12884)
- 38↑
Nazer LH, Zatarah R, Waldrip S, Ke JXC, Moukheiber M, Khanna AK, Hicklen RS, Moukheiber L, Moukheiber D, Ma H, et al.Bias in artificial intelligence algorithms and recommendations for mitigation. PLOS Digital Health 2023 2 e0000278. (https://doi.org/10.1371/journal.pdig.0000278)
- 39↑
Grote T, & Berens P. On the ethics of algorithmic decision-making in healthcare. Journal of Medical Ethics 2020 46 205–211. (https://doi.org/10.1136/medethics-2019-105586)
- 40↑
Debs P, & Fayad LM. The promise and limitations of artificial intelligence in musculoskeletal imaging. Frontiers in Radiology 2023 3 1242902. (https://doi.org/10.3389/fradi.2023.1242902)