8,393 research outputs found

    Network analysis of named entity co-occurrences in written texts

    Full text link
    The use of methods borrowed from statistics and physics to analyze written texts has allowed the discovery of unprecedent patterns of human behavior and cognition by establishing links between models features and language structure. While current models have been useful to unveil patterns via analysis of syntactical and semantical networks, only a few works have probed the relevance of investigating the structure arising from the relationship between relevant entities such as characters, locations and organizations. In this study, we represent entities appearing in the same context as a co-occurrence network, where links are established according to a null model based on random, shuffled texts. Computational simulations performed in novels revealed that the proposed model displays interesting topological features, such as the small world feature, characterized by high values of clustering coefficient. The effectiveness of our model was verified in a practical pattern recognition task in real networks. When compared with traditional word adjacency networks, our model displayed optimized results in identifying unknown references in texts. Because the proposed representation plays a complementary role in characterizing unstructured documents via topological analysis of named entities, we believe that it could be useful to improve the characterization of written texts (and related systems), specially if combined with traditional approaches based on statistical and deeper paradigms

    Identifying Authorship from Linguistic Text Patterns

    Get PDF
    Research that deals with linguistic text patterns is challenging because of the unstructured nature of text. This research presents a methodology to compare texts to identify whether two texts are written by the same or different authors. The methodology includes an algorithm to analyze the proximity of text, which is based upon Zipf’s Law [47][48]. The results have implications for text mining with applications to areas such as forensics, natural language processing, and information retrieval

    Outlier detection for multivariate categorical data

    Get PDF
    This is an Accepted Manuscript of an article published by Taylor & Francis in “ Quality and Reliability Engineering International ” on 06th June 2018, available online: https://onlinelibrary.wiley.com/doi/abs/10.1002/qre.2339The detection of outlying rows in a contingency table is tackled from a Bayesian perspective, by adapting the framework adopted by Box and Tiao for normal models to multinomial models with random effects. The solution assumes a 2–component mixture model of 2 multinomial continuous mixtures for them, one for the nonoutlier rows and the second one for the outlier rows. The method starts by estimating the distributional characteristics of nonoutlier rows, and then it does cluster analysis to identify which rows belong to the outlier group and which do not. The method applies to any type of contingency table, and in particular, it could be used on the analysis of multivariate categorical control charts. Here, the use of the method is illustrated through a simulated example and by applying it to help identify heterogeneities of style among the acts in the plays of the First Folio edition of Shakespeare dramaPeer ReviewedPostprint (author's final draft

    The anatomy of a collaborative writing tool for public participation in democracy

    Get PDF
    Two approaches to online collaborative writing for the formulation of norms (laws, bills) are discussed: a Wikipedia-like approach and a structured approach

    Text based classification of companies in CrunchBase

    Get PDF
    This paper introduces two fuzzy fingerprint based text classification techniques that were successfully applied to automatically label companies from CrunchBase, based purely on their unstructured textual description. This is a real and very challenging problem due to the large set of possible labels (more than 40) and also to the fact that the textual descriptions do not have to abide by any criteria and are, therefore, extremely heterogeneous. Fuzzy fingerprints are a recently introduced technique that can be used for performing fast classification. They perform well in the presence of unbalanced datasets and can cope with a very large number of classes. In the paper, a comparison is performed against some of the best text classification techniques commonly used to address similar problems. When applied to the CrunchBase dataset, the fuzzy fingerprint based approach outperformed the other techniques.info:eu-repo/semantics/submittedVersio
    • …
    corecore