511 research outputs found

    Challenges in discriminating profanity from hate speech

    Get PDF
    In this study, we approach the problem of distinguishing general profanity from hate speech in social media, something which has not been widely considered. Using a new dataset annotated specifically for this task, we employ supervised classification along with a set of features that includes -grams, skip-grams and clustering-based word representations. We apply approaches based on single classifiers as well as more advanced ensemble classifiers and stacked generalisation, achieving the best result of accuracy for this 3-class classification task. Analysis of the results reveals that discriminating hate speech and profanity is not a simple task, which may require features that capture a deeper understanding of the text not always possible with surface -grams. The variability of gold labels in the annotated data, due to differences in the subjective adjudications of the annotators, is also an issue. Other directions for future work are discussed

    Probing Local Atomic Environments to Model RNA Energetics and Structure

    Full text link
    Ribonucleic acids (RNA) are critical components of living systems. Understanding RNA structure and its interaction with other molecules is an essential step in understanding RNA-driven processes within the cell. Experimental techniques like X-ray crystallography, nuclear magnetic resonance (NMR) spectroscopy, and chemical probing methods have provided insights into RNA structures on the atomic scale. To effectively exploit experimental data and characterize features of an RNA structure, quantitative descriptors of local atomic environments are required. Here, I investigated different ways to describe RNA local atomic environments. First, I investigated the solvent-accessible surface area (SASA) as a probe of RNA local atomic environment. SASA contains information on the level of exposure of an RNA atom to solvents and, in some cases, is highly correlated to reactivity profiles derived from chemical probing experiments. Using Bayesian/maximum entropy (BME), I was able to reweight RNA structure models based on the agreement between SASA and chemical reactivities. Next, I developed a numerical descriptor (the atomic fingerprint), that is capable of discriminating different atomic environments. Using atomic fingerprints as features enable the prediction of RNA structure and structure-related properties. Two detailed examples are discussed. Firstly, a classification model was developed to predict Mg2+^{2+} ion binding sites. Results indicate that the model could predict Mg2+^{2+} binding sites with reasonable accuracy, and it appears to outperform existing methods. Secondly, a set of models were developed to identify cavities in RNA that are likely to accommodate small-molecule ligands. The models were also used to identify bound-like conformations from an ensemble of RNA structures. The frameworks presented here provide paths to connect the local atomic environment to RNA structure, and I envision they will provide opportunities to develop novel RNA modeling tools.PHDPhysicsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/163135/1/jingrux_1.pd

    Learning Sentence-internal Temporal Relations

    Get PDF
    In this paper we propose a data intensive approach for inferring sentence-internal temporal relations. Temporal inference is relevant for practical NLP applications which either extract or synthesize temporal information (e.g., summarisation, question answering). Our method bypasses the need for manual coding by exploiting the presence of markers like after", which overtly signal a temporal relation. We first show that models trained on main and subordinate clauses connected with a temporal marker achieve good performance on a pseudo-disambiguation task simulating temporal inference (during testing the temporal marker is treated as unseen and the models must select the right marker from a set of possible candidates). Secondly, we assess whether the proposed approach holds promise for the semi-automatic creation of temporal annotations. Specifically, we use a model trained on noisy and approximate data (i.e., main and subordinate clauses) to predict intra-sentential relations present in TimeBank, a corpus annotated rich temporal information. Our experiments compare and contrast several probabilistic models differing in their feature space, linguistic assumptions and data requirements. We evaluate performance against gold standard corpora and also against human subjects

    Secure Automatic Speaker Verification Systems

    Get PDF
    The growing number of voice-enabled devices and applications consider automatic speaker verification (ASV) a fundamental component. However, maximum outreach for ASV in critical domains e.g., financial services and health care, is not possible unless we overcome security breaches caused by voice cloning, and replayed audios collectively known as the spoofing attacks. The audio spoofing attacks over ASV systems on one hand strictly limit the usability of voice-enabled applications; and on the other hand, the counterfeiter also remains untraceable. Therefore, to overcome these vulnerabilities, a secure ASV (SASV) system is presented in this dissertation. The proposed SASV system is based on the concept of novel sign modified acoustic local ternary pattern (sm-ALTP) features and asymmetric bagging-based classifier-ensemble. The proposed audio representation approach clusters the high and low-frequency components in audio frames by normally distributing frequency components against a convex function. Then, the neighborhood statistics are applied to capture the user specific vocal tract information. This information is then utilized by the classifier ensemble that is based on the concept of weighted normalized voting rule to detect various spoofing attacks. Contrary to the existing ASV systems, the proposed SASV system not only detects the conventional spoofing attacks (i.e. voice cloning, and replays), but also the new attacks that are still unexplored by the research community and a requirement of the future. In this regard, a concept of cloned replays is presented in this dissertation, where, replayed audios contains the microphone characteristics as well as the voice cloning artifacts. This depicts the scenario when voice cloning is applied in real-time. The voice cloning artifacts suppresses the microphone characteristics thus fails replay detection modules and similarly with the amalgamation of microphone characteristics the voice cloning detection gets deceived. Furthermore, the proposed scheme can be utilized to obtain a possible clue against the counterfeiter through voice cloning algorithm detection module that is also a novel concept proposed in this dissertation. The voice cloning algorithm detection module determines the voice cloning algorithm used to generate the fake audios. Overall, the proposed SASV system simultaneously verifies the bonafide speakers and detects the voice cloning attack, cloning algorithm used to synthesize cloned audio (in the defined settings), and voice-replay attacks over the ASVspoof 2019 dataset. In addition, the proposed method detects the voice replay and cloned voice replay attacks over the VSDC dataset. Rigorous experimentation against state-of-the-art approaches also confirms the robustness of the proposed research

    Experiments in Language Variety Geolocation and Dialect Identification

    Get PDF
    Peer reviewe

    Exploring Optimal Voting in Native Language Identification.

    Get PDF
    We describe the submissions entered by the National Research Council Canada in the NLI-2017 evaluation. We mainly explored the use of voting, and various ways to optimize the choice and number of voting systems. We also explored the use of features that rely on no linguistic preprocessing. Long ngrams of characters obtained from raw text turned out to yield the best performance on all textual input (written essays and speech transcripts). Voting ensembles turned out to produce small performance gains, with little difference between the various optimization strategies we tried. Our top systems achieved accuracies of 87% on the essay track, 84% on the speech track, and close to 92% by combining essays, speech and i-vectors in the fusion track

    Portuguese Native Language Identification

    Get PDF
    This study presents the first Native Language Identification (NLI) study for L2 Portuguese.We used a sub-set of the NLI-PT dataset, containing texts written by speakers of five different native languages: Chinese, English, German, Italian, and Spanish.We explore the linguistic annotations available in NLI-PT to extract a range of (morpho-)syntactic features and apply NLI classification methods to predict the native language of the authors. The best results were obtained using an ensemble combination of the features, achieving 54:1% accuracy.info:eu-repo/semantics/publishedVersio
    corecore