8,393 research outputs found
Network analysis of named entity co-occurrences in written texts
The use of methods borrowed from statistics and physics to analyze written
texts has allowed the discovery of unprecedent patterns of human behavior and
cognition by establishing links between models features and language structure.
While current models have been useful to unveil patterns via analysis of
syntactical and semantical networks, only a few works have probed the relevance
of investigating the structure arising from the relationship between relevant
entities such as characters, locations and organizations. In this study, we
represent entities appearing in the same context as a co-occurrence network,
where links are established according to a null model based on random, shuffled
texts. Computational simulations performed in novels revealed that the proposed
model displays interesting topological features, such as the small world
feature, characterized by high values of clustering coefficient. The
effectiveness of our model was verified in a practical pattern recognition task
in real networks. When compared with traditional word adjacency networks, our
model displayed optimized results in identifying unknown references in texts.
Because the proposed representation plays a complementary role in
characterizing unstructured documents via topological analysis of named
entities, we believe that it could be useful to improve the characterization of
written texts (and related systems), specially if combined with traditional
approaches based on statistical and deeper paradigms
Identifying Authorship from Linguistic Text Patterns
Research that deals with linguistic text patterns is challenging because of the unstructured nature of text. This research presents a methodology to compare texts to identify whether two texts are written by the same or different authors. The methodology includes an algorithm to analyze the proximity of text, which is based upon Zipf’s Law [47][48]. The results have implications for text mining with applications to areas such as forensics, natural language processing, and information retrieval
Outlier detection for multivariate categorical data
This is an Accepted Manuscript of an article published by Taylor & Francis in “ Quality and Reliability Engineering International ” on 06th June 2018, available online: https://onlinelibrary.wiley.com/doi/abs/10.1002/qre.2339The detection of outlying rows in a contingency table is tackled from a Bayesian perspective, by adapting the framework adopted by Box and Tiao for normal models to multinomial models with random effects. The solution assumes a 2–component mixture model of 2 multinomial continuous mixtures for them, one for the nonoutlier rows and the second one for the outlier rows. The method starts by estimating the distributional characteristics of nonoutlier rows, and then it does cluster analysis to identify which rows belong to the outlier group and which do not. The method applies to any type of contingency table, and in particular, it could be used on the analysis of multivariate categorical control charts. Here, the use of the method is illustrated through a simulated example and by applying it to help identify heterogeneities of style among the acts in the plays of the First Folio edition of Shakespeare dramaPeer ReviewedPostprint (author's final draft
The anatomy of a collaborative writing tool for public participation in democracy
Two approaches to online collaborative writing for the formulation of norms (laws, bills) are discussed: a Wikipedia-like approach and a structured approach
Text based classification of companies in CrunchBase
This paper introduces two fuzzy fingerprint based text classification techniques that were successfully applied to automatically label companies from CrunchBase, based purely on their unstructured textual description. This is a real and very challenging problem due to the large set of possible labels (more than 40) and also to the fact that the textual descriptions do not have to abide by any criteria and are, therefore, extremely heterogeneous. Fuzzy fingerprints are a recently introduced technique that can be used for performing fast classification. They perform well in the presence of unbalanced datasets and can cope with a very large number of classes. In the paper, a comparison is performed against some of the best text classification techniques commonly used to address similar problems. When applied to the CrunchBase dataset, the fuzzy fingerprint based approach outperformed the other techniques.info:eu-repo/semantics/submittedVersio
- …