-
Understanding the Rare Inflammatory Disease Using Large Language Models and Social Media Data
Authors:
Nan Miles Xi,
Hong-Long Ji,
Lin Wang
Abstract:
Sarcoidosis is a rare inflammatory disease characterized by the formation of granulomas in various organs. The disease presents diagnostic and treatment challenges due to its diverse manifestations and unpredictable nature. In this study, we employed a Large Language Model (LLM) to analyze sarcoidosis-related discussions on the social media platform Reddit. Our findings underscore the efficacy of…
▽ More
Sarcoidosis is a rare inflammatory disease characterized by the formation of granulomas in various organs. The disease presents diagnostic and treatment challenges due to its diverse manifestations and unpredictable nature. In this study, we employed a Large Language Model (LLM) to analyze sarcoidosis-related discussions on the social media platform Reddit. Our findings underscore the efficacy of LLMs in accurately identifying sarcoidosis-related content. We discovered a wide array of symptoms reported by patients, with fatigue, swollen lymph nodes, and shortness of breath as the most prevalent. Prednisone was the most prescribed medication, while infliximab showed the highest effectiveness in improving prognoses. Notably, our analysis revealed disparities in prognosis based on age and gender, with women and younger patients experiencing good and polarized outcomes, respectively. Furthermore, unsupervised clustering identified three distinct patient subgroups (phenotypes) with unique symptom profiles, prognostic outcomes, and demographic distributions. Finally, sentiment analysis revealed a moderate negative impact on patients' mental health post-diagnosis, particularly among women and younger individuals. Our study represents the first application of LLMs to understand sarcoidosis through social media data. It contributes to understanding the disease by providing data-driven insights into its manifestations, treatments, prognoses, and impact on patients' lives. Our findings have direct implications for improving personalized treatment strategies and enhancing the quality of care for individuals living with sarcoidosis.
△ Less
Submitted 12 May, 2024;
originally announced May 2024.
-
Predicting Survival of Tongue Cancer Patients by Machine Learning Models
Authors:
Angelos Vasilopoulos,
Nan Miles Xi
Abstract:
Tongue cancer is a common oral cavity malignancy that originates in the mouth and throat. Much effort has been invested in improving its diagnosis, treatment, and management. Surgical removal, chemotherapy, and radiation therapy remain the major treatment for tongue cancer. The survival of patients determines the treatment effect. Previous studies have identified certain survival and risk factors…
▽ More
Tongue cancer is a common oral cavity malignancy that originates in the mouth and throat. Much effort has been invested in improving its diagnosis, treatment, and management. Surgical removal, chemotherapy, and radiation therapy remain the major treatment for tongue cancer. The survival of patients determines the treatment effect. Previous studies have identified certain survival and risk factors based on descriptive statistics, ignoring the complex, nonlinear relationship among clinical and demographic variables. In this study, we utilize five cutting-edge machine learning models and clinical data to predict the survival of tongue cancer patients after treatment. Five-fold cross-validation, bootstrap analysis, and permutation feature importance are applied to estimate and interpret model performance. The prognostic factors identified by our method are consistent with previous clinical studies. Our method is accurate, interpretable, and thus useable as additional evidence in tongue cancer treatment and management.
△ Less
Submitted 22 December, 2022;
originally announced December 2022.
-
Tuning hyperparameters of doublet-detection methods for single-cell RNA sequencing data
Authors:
Nan Miles Xi,
Angelos Vasilopoulos
Abstract:
The existence of doublets in single-cell RNA sequencing (scRNA-seq) data poses a great challenge in downstream data analysis. Computational doublet-detection methods have been developed to remove doublets from scRNA-seq data. Yet, the default hyperparameter settings of those methods may not provide optimal performance. Here, we propose a strategy to tune hyperparameters for a cutting-edge doublet-…
▽ More
The existence of doublets in single-cell RNA sequencing (scRNA-seq) data poses a great challenge in downstream data analysis. Computational doublet-detection methods have been developed to remove doublets from scRNA-seq data. Yet, the default hyperparameter settings of those methods may not provide optimal performance. Here, we propose a strategy to tune hyperparameters for a cutting-edge doublet-detection method. We utilize a full factorial design to explore the relationship between hyperparameters and detection accuracy on 16 real scRNA-seq datasets. The optimal hyperparameters are obtained by a response surface model and convex optimization. We show that the optimal hyperparameters provide top performance across scRNA-seq datasets under various biological conditions. Our tuning strategy can be applied to other computational doublet-detection methods. It also offers insights into hyperparameter tuning for broader computational methods in scRNA-seq data analysis.
△ Less
Submitted 5 February, 2023; v1 submitted 1 November, 2022;
originally announced November 2022.
-
Prediction of Drug-Induced TdP Risks Using Machine Learning and Rabbit Ventricular Wedge Assay
Authors:
Jaela Foster-Burns,
Nan Miles Xi
Abstract:
Torsades de pointes (TdP) is an irregular heart rhythm as a side effect of drugs and may cause sudden cardiac death. A machine learning model that can accurately identify drug TdP risk is necessary. This study uses multinomial logistic regression models to predict three-class drug TdP risks based on datasets generated from rabbit ventricular wedge assay experiments. The training-test split and fiv…
▽ More
Torsades de pointes (TdP) is an irregular heart rhythm as a side effect of drugs and may cause sudden cardiac death. A machine learning model that can accurately identify drug TdP risk is necessary. This study uses multinomial logistic regression models to predict three-class drug TdP risks based on datasets generated from rabbit ventricular wedge assay experiments. The training-test split and five-fold cross-validation provide unbiased measurements for prediction accuracy. We utilize bootstrap to construct a 95% confidence interval for prediction accuracy. The model interpretation is further demonstrated by permutation predictor importance. Our study offers an interpretable modeling method suitable for drug TdP risk prediction. Our method can be easily generalized to broader applications of drug side effect assessment.
△ Less
Submitted 8 October, 2022;
originally announced October 2022.
-
Improving The Diagnosis of Thyroid Cancer by Machine Learning and Clinical Data
Authors:
Nan Miles Xi,
Lin Wang,
Chuanjia Yang
Abstract:
Thyroid cancer is a common endocrine carcinoma that occurs in the thyroid gland. Much effort has been invested in improving its diagnosis, and thyroidectomy remains the primary treatment method. A successful operation without unnecessary side injuries relies on an accurate preoperative diagnosis. Current human assessment of thyroid nodule malignancy is prone to errors and may not guarantee an accu…
▽ More
Thyroid cancer is a common endocrine carcinoma that occurs in the thyroid gland. Much effort has been invested in improving its diagnosis, and thyroidectomy remains the primary treatment method. A successful operation without unnecessary side injuries relies on an accurate preoperative diagnosis. Current human assessment of thyroid nodule malignancy is prone to errors and may not guarantee an accurate preoperative diagnosis. This study proposed a machine framework to predict thyroid nodule malignancy based on a novel clinical dataset we collected. The 10-fold cross-validation, bootstrap analysis, and permutation predictor importance were applied to estimate and interpret the model performance under uncertainty. The comparison between model prediction and expert assessment shows the advantage of our framework over human judgment in predicting thyroid nodule malignancy. Our method is accurate, interpretable, and thus useable as additional evidence in the preoperative diagnosis for thyroid cancer.
△ Less
Submitted 27 March, 2022;
originally announced March 2022.
-
Prediction of Drug-Induced TdP Risks Using Machine Learning and Rabbit Ventricular Wedge Assay
Authors:
Nan Miles Xi,
Dalong Patrick Huang
Abstract:
The evaluation of drug-induced Torsades de pointes (TdP) risks is crucial in drug safety assessment. In this study, we discuss machine learning approaches in the prediction of drug-induced TdP risks using preclinical data. Specifically, the random forest model was trained on the dataset generated by the rabbit ventricular wedge assay. The model prediction performance was measured on 28 drugs from…
▽ More
The evaluation of drug-induced Torsades de pointes (TdP) risks is crucial in drug safety assessment. In this study, we discuss machine learning approaches in the prediction of drug-induced TdP risks using preclinical data. Specifically, the random forest model was trained on the dataset generated by the rabbit ventricular wedge assay. The model prediction performance was measured on 28 drugs from the Comprehensive In Vitro Proarrhythmia Assay initiative. Leave-one-drug-out cross-validation provided an unbiased estimation of model performance. Stratified bootstrap revealed the uncertainty in the asymptotic model prediction. Our study validated the utility of machine learning approaches in predicting drug-induced TdP risks from preclinical data. Our methods can be extended to other preclinical protocols and serve as a supplementary evaluation in drug safety assessment.
△ Less
Submitted 14 January, 2022;
originally announced January 2022.
-
Statistical Learning in Preclinical Drug Proarrhythmic Assessment
Authors:
Nan Milex Xi,
Yu-Yi Hsu,
Qianyu Dang,
Dalong Patrick Huang
Abstract:
Torsades de pointes (TdP) is an irregular heart rhythm characterized by faster beat rates and potentially could lead to sudden cardiac death. Much effort has been invested in understanding the drug-induced TdP in preclinical studies. However, a comprehensive statistical learning framework that can accurately predict the drug-induced TdP risk from preclinical data is still lacking. We proposed ordi…
▽ More
Torsades de pointes (TdP) is an irregular heart rhythm characterized by faster beat rates and potentially could lead to sudden cardiac death. Much effort has been invested in understanding the drug-induced TdP in preclinical studies. However, a comprehensive statistical learning framework that can accurately predict the drug-induced TdP risk from preclinical data is still lacking. We proposed ordinal logistic regression and ordinal random forest models to predict low-, intermediate-, and high-risk drugs based on datasets generated from two experimental protocols. Leave-one-drug-out cross-validation, stratified bootstrap, and permutation predictor importance were applied to estimate and interpret the model performance under uncertainty. The potential outlier drugs identified by our models are consistent with their descriptions in the literature. Our method is accurate, interpretable, and thus useable as supplemental evidence in the drug safety assessment.
△ Less
Submitted 7 January, 2022; v1 submitted 1 August, 2021;
originally announced August 2021.
-
Protocol for Executing and Benchmarking Eight Computational Doublet-Detection Methods in Single-Cell RNA Sequencing Data Analysis
Authors:
Nan Miles Xi,
Jingyi Jessica Li
Abstract:
The existence of doublets is a key confounder in single-cell RNA sequencing (scRNA-seq) data analysis. Computational methods have been developed for detecting doublets from scRNA-seq data. We developed an R package DoubletCollection to integrate the installation and execution of eight doublet-detection methods. DoubletCollection also provides a unified interface to perform and visualize downstream…
▽ More
The existence of doublets is a key confounder in single-cell RNA sequencing (scRNA-seq) data analysis. Computational methods have been developed for detecting doublets from scRNA-seq data. We developed an R package DoubletCollection to integrate the installation and execution of eight doublet-detection methods. DoubletCollection also provides a unified interface to perform and visualize downstream analysis after doublet detection. Here, we present a protocol of using DoubletCollection to benchmark doublet-detection methods. This protocol can automatically accommodate new doublet-detection methods in the fast-growing scRNA-seq field.
△ Less
Submitted 25 June, 2021; v1 submitted 21 January, 2021;
originally announced January 2021.
-
The Duopoly Analysis of Graphics Card Market
Authors:
Nan Miles Xi
Abstract:
By analyzing the duopoly market of computer graphics cards, we categorized the effects of enterprise's technological progress into two types, namely, cost reduction and product diversification. Our model proved that technological progress is the most effective means for enterprises in this industry to increase profits. Due to the technology-intensive nature of this industry, monopolistic enterpris…
▽ More
By analyzing the duopoly market of computer graphics cards, we categorized the effects of enterprise's technological progress into two types, namely, cost reduction and product diversification. Our model proved that technological progress is the most effective means for enterprises in this industry to increase profits. Due to the technology-intensive nature of this industry, monopolistic enterprises face more intense competition compared with traditional manufacturing. Therefore, they have more motivation for technological innovation. Enterprises aiming at maximizing profits have incentives to reduce costs and achieve a higher degree of product differentiation through technological innovation.
△ Less
Submitted 18 October, 2020;
originally announced October 2020.