11 research outputs found

    ToxiSpanSE: An Explainable Toxicity Detection in Code Review Comments

    Full text link
    Background: The existence of toxic conversations in open-source platforms can degrade relationships among software developers and may negatively impact software product quality. To help mitigate this, some initial work has been done to detect toxic comments in the Software Engineering (SE) domain. Aims: Since automatically classifying an entire text as toxic or non-toxic does not help human moderators to understand the specific reason(s) for toxicity, we worked to develop an explainable toxicity detector for the SE domain. Method: Our explainable toxicity detector can detect specific spans of toxic content from SE texts, which can help human moderators by automatically highlighting those spans. This toxic span detection model, ToxiSpanSE, is trained with the 19,651 code review (CR) comments with labeled toxic spans. Our annotators labeled the toxic spans within 3,757 toxic CR samples. We explored several types of models, including one lexicon-based approach and five different transformer-based encoders. Results: After an extensive evaluation of all models, we found that our fine-tuned RoBERTa model achieved the best score with 0.88 F1F1, 0.87 precision, and 0.93 recall for toxic class tokens, providing an explainable toxicity classifier for the SE domain. Conclusion: Since ToxiSpanSE is the first tool to detect toxic spans in the SE domain, this tool will pave a path to combat toxicity in the SE community

    Automated Identification of Sexual Orientation and Gender Identity Discriminatory Texts from Issue Comments

    Full text link
    In an industry dominated by straight men, many developers representing other gender identities and sexual orientations often encounter hateful or discriminatory messages. Such communications pose barriers to participation for women and LGBTQ+ persons. Due to sheer volume, manual inspection of all communications for discriminatory communication is infeasible for a large-scale Free Open-Source Software (FLOSS) community. To address this challenge, this study aims to develop an automated mechanism to identify Sexual orientation and Gender identity Discriminatory (SGID) texts from software developers' communications. On this goal, we trained and evaluated SGID4SE ( Sexual orientation and Gender Identity Discriminatory text identification for (4) Software Engineering texts) as a supervised learning-based SGID detection tool. SGID4SE incorporates six preprocessing steps and ten state-of-the-art algorithms. SGID4SE implements six different strategies to improve the performance of the minority class. We empirically evaluated each strategy and identified an optimum configuration for each algorithm. In our ten-fold cross-validation-based evaluations, a BERT-based model boosts the best performance with 85.9% precision, 80.0% recall, and 82.9% F1-Score for the SGID class. This model achieves 95.7% accuracy and 80.4% Matthews Correlation Coefficient. Our dataset and tool establish a foundation for further research in this direction

    BaDLAD: A Large Multi-Domain Bengali Document Layout Analysis Dataset

    Full text link
    While strides have been made in deep learning based Bengali Optical Character Recognition (OCR) in the past decade, the absence of large Document Layout Analysis (DLA) datasets has hindered the application of OCR in document transcription, e.g., transcribing historical documents and newspapers. Moreover, rule-based DLA systems that are currently being employed in practice are not robust to domain variations and out-of-distribution layouts. To this end, we present the first multidomain large Bengali Document Layout Analysis Dataset: BaDLAD. This dataset contains 33,695 human annotated document samples from six domains - i) books and magazines, ii) public domain govt. documents, iii) liberation war documents, iv) newspapers, v) historical newspapers, and vi) property deeds, with 710K polygon annotations for four unit types: text-box, paragraph, image, and table. Through preliminary experiments benchmarking the performance of existing state-of-the-art deep learning architectures for English DLA, we demonstrate the efficacy of our dataset in training deep learning based Bengali document digitization models

    Exposure-Based Screening for Nipah Virus Encephalitis, Bangladesh

    No full text
    We measured the performance of exposure screening questions to identify Nipah virus encephalitis in hospitalized encephalitis patients during the 2012–13 Nipah virus season in Bangladesh. The sensitivity (93%), specificity (82%), positive predictive value (37%), and negative predictive value (99%) results suggested that screening questions could more quickly identify persons with Nipah virus encephalitis

    OOD-Speech: A Large Bengali Speech Recognition Dataset for Out-of-Distribution Benchmarking

    Full text link
    We present OOD-Speech, the first out-of-distribution (OOD) benchmarking dataset for Bengali automatic speech recognition (ASR). Being one of the most spoken languages globally, Bengali portrays large diversity in dialects and prosodic features, which demands ASR frameworks to be robust towards distribution shifts. For example, islamic religious sermons in Bengali are delivered with a tonality that is significantly different from regular speech. Our training dataset is collected via massively online crowdsourcing campaigns which resulted in 1177.94 hours collected and curated from 22,64522,645 native Bengali speakers from South Asia. Our test dataset comprises 23.03 hours of speech collected and manually annotated from 17 different sources, e.g., Bengali TV drama, Audiobook, Talk show, Online class, and Islamic sermons to name a few. OOD-Speech is jointly the largest publicly available speech dataset, as well as the first out-of-distribution ASR benchmarking dataset for Bengali

    Transmission of Nipah Virus - 14 Years of Investigations in Bangladesh

    No full text
    International audienceBackgroundNipah virus is a highly virulent zoonotic pathogen that can be transmitted between humans. Understanding the dynamics of person-to-person transmission is key to designing effective interventions.MethodsWe used data from all Nipah virus cases identified during outbreak investigations in Bangladesh from April 2001 through April 2014 to investigate case-patient characteristics associated with onward transmission and factors associated with the risk of infection among patient contacts.ResultsOf 248 Nipah virus cases identified, 82 were caused by person-to-person transmission, corresponding to a reproduction number (i.e., the average number of secondary cases per case patient) of 0.33 (95% confidence interval [CI], 0.19 to 0.59). The predicted reproduction number increased with the case patient’s age and was highest among patients 45 years of age or older who had difficulty breathing (1.1; 95% CI, 0.4 to 3.2). Case patients who did not have difficulty breathing infected 0.05 times as many contacts (95% CI, 0.01 to 0.3) as other case patients did. Serologic testing of 1863 asymptomatic contacts revealed no infections. Spouses of case patients were more often infected (8 of 56 [14%]) than other close family members (7 of 547 [1.3%]) or other contacts (18 of 1996 [0.9%]). The risk of infection increased with increased duration of exposure of the contacts (adjusted odds ratio for exposure of >48 hours vs. ≤1 hour, 13; 95% CI, 2.6 to 62) and with exposure to body fluids (adjusted odds ratio, 4.3; 95% CI, 1.6 to 11).ConclusionsIncreasing age and respiratory symptoms were indicators of infectivity of Nipah virus. Interventions to control person-to-person transmission should aim to reduce exposure to body fluids. (Funded by the National Institutes of Health and others.

    Changing Contact Patterns Over Disease Progression: Nipah Virus as a Case Study

    No full text
    International audienceAbstract Contact patterns play a key role in disease transmission, and variation in contacts during the course of illness can influence transmission, particularly when accompanied by changes in host infectiousness. We used surveys among 1642 contacts of 94 Nipah virus case patients in Bangladesh to determine how contact patterns (physical and with bodily fluids) changed as disease progressed in severity. The number of contacts increased with severity and, for case patients who died, peaked on the day of death. Given transmission has only been observed among fatal cases of Nipah virus infection, our findings suggest that changes in contact patterns during illness contribute to risk of infection

    Transmission of Nipah virus — 14 years of investigations in Bangladesh

    Get PDF
    CITATION: Nikolay, B. et al. 2019. Transmission of Nipah Virus — 14 Years of Investigations in Bangladesh. New England Journal of Medicine, 380(19):1804-1814. doi:10.1056/NEJMoa1805376The original publication is available at https://www.nejm.org/BACKGROUND: Nipah virus is a highly virulent zoonotic pathogen that can be transmitted between humans. Understanding the dynamics of person-to-person transmission is key to designing effective interventions. METHODS: We used data from all Nipah virus cases identified during outbreak investigations in Bangladesh from April 2001 through April 2014 to investigate case-patient characteristics associated with onward transmission and factors associated with the risk of infection among patient contacts. RESULTS: Of 248 Nipah virus cases identified, 82 were caused by person-to-person transmission, corresponding to a reproduction number (i.e., the average number of secondary cases per case patient) of 0.33 (95% confidence interval [CI], 0.19 to 0.59). The predicted reproduction number increased with the case patient’s age and was highest among patients 45 years of age or older who had difficulty breathing (1.1; 95% CI, 0.4 to 3.2). Case patients who did not have difficulty breathing infected 0.05 times as many contacts (95% CI, 0.01 to 0.3) as other case patients did. Serologic testing of 1863 asymptomatic contacts revealed no infections. Spouses of case patients were more often infected (8 of 56 [14%]) than other close family members (7 of 547 [1.3%]) or other contacts (18 of 1996 [0.9%]). The risk of infection increased with increased duration of exposure of the contacts (adjusted odds ratio for exposure of >48 hours vs. ≤1 hour, 13; 95% CI, 2.6 to 62) and with exposure to body fluids (adjusted odds ratio, 4.3; 95% CI, 1.6 to 11). CONCLUSIONS: Increasing age and respiratory symptoms were indicators of infectivity of Nipah virus. Interventions to control person-to-person transmission should aim to reduce exposure to body fluids. (Funded by the National Institutes of Health and others.)National Institutes of Healthhttps://www.nejm.org/doi/full/10.1056/NEJMoa1805376Publisher’s versio
    corecore