Cancer Health Disparities Drivers with BERTopic Modelling and Pycaret Evaluation

Abstract

The complex interplay of social, behavioural, lifestyle, environmental, health system, and natural health variables contribute to disparities in cancer treatment across racial and ethnic groups. Consequently, it is necessary to identify the variables contributing to cancer health inequalities and develop strategies to achieve health equality. Pubmed abstract on Cancer health disparities was scraped with a bio.Entrez python package. Preprocessed data with regex and Natural tool kit(NLTK), topic modelling with BERTopic embeddings, and c-TF-IDF to construct dense clusters and analyse top topics linked with Cancer health disparities. Model evaluation with Pycaret coherence score and web app deployment with Streamlit. The results showed that Topic 32 with terms obese, female, male, school, survey, student, post, and discrepancy had the best coherence score of 0.3687. In contrast, topic 8 with terms prevalence, adult, income, high, usage, diabetes, education, elderly, change and low, received the least coherence score of 0.3255. The model classifies each Subject Word score based on the scores, the granular topic concerns and trends related to cancer health disparities, investigates the connection between drivers of cancer health disparities, and evaluates the model with their coherence score values

    Similar works