Search CORE

8,393 research outputs found

Network analysis of named entity co-occurrences in written texts

Author: Amancio Diego R.
Publication venue: 'IOP Publishing'
Publication date: 26/06/2016
Field of study

The use of methods borrowed from statistics and physics to analyze written texts has allowed the discovery of unprecedent patterns of human behavior and cognition by establishing links between models features and language structure. While current models have been useful to unveil patterns via analysis of syntactical and semantical networks, only a few works have probed the relevance of investigating the structure arising from the relationship between relevant entities such as characters, locations and organizations. In this study, we represent entities appearing in the same context as a co-occurrence network, where links are established according to a null model based on random, shuffled texts. Computational simulations performed in novels revealed that the proposed model displays interesting topological features, such as the small world feature, characterized by high values of clustering coefficient. The effectiveness of our model was verified in a practical pattern recognition task in real networks. When compared with traditional word adjacency networks, our model displayed optimized results in identifying unknown references in texts. Because the proposed representation plays a complementary role in characterizing unstructured documents via topological analysis of named entities, we believe that it could be useful to improve the characterization of written texts (and related systems), specially if combined with traditional approaches based on statistical and deeper paradigms

arXiv.org e-Print Archive

EDP Sciences OAI-PMH repository (1.2.0)

Identifying Authorship from Linguistic Text Patterns

Author: Baskerville Richard
Madden Joshua
Storey Veda C.
Publication venue: AIS Electronic Library (AISeL)
Publication date: 01/01/2019
Field of study

Research that deals with linguistic text patterns is challenging because of the unstructured nature of text. This research presents a methodology to compare texts to identify whether two texts are written by the same or different authors. The methodology includes an algorithm to analyze the proximity of text, which is based upon Zipf’s Law [47][48]. The results have implications for text mining with applications to areas such as forensics, natural language processing, and information retrieval

AIS Electronic Library (AISeL)

Outlier detection for multivariate categorical data

Author: Barnett
Besag
Box
Brown
Edmondson
Fernandez
Fienberg
Fuchs
Giron
Haberman
Holmes
Holmes
Hope
Kuhnt
Kuhnt
Mebane
Merriam
Mollie
Mosteller
Plummer M
Puig
Puig
Puig
Riba
Rousseeuw
Shahan
Simonoff
Stamatatatos
Yick
Zhao
Publication venue: 'Wiley'
Publication date: 01/01/2018
Field of study

This is an Accepted Manuscript of an article published by Taylor & Francis in “ Quality and Reliability Engineering International ” on 06th June 2018, available online: https://onlinelibrary.wiley.com/doi/abs/10.1002/qre.2339The detection of outlying rows in a contingency table is tackled from a Bayesian perspective, by adapting the framework adopted by Box and Tiao for normal models to multinomial models with random effects. The solution assumes a 2–component mixture model of 2 multinomial continuous mixtures for them, one for the nonoutlier rows and the second one for the outlier rows. The method starts by estimating the distributional characteristics of nonoutlier rows, and then it does cluster analysis to identify which rows belong to the outlier group and which do not. The method applies to any type of contingency table, and in particular, it could be used on the analysis of multivariate categorical control charts. Here, the use of the method is illustrated through a simulated example and by applying it to help identify heterogeneities of style among the acts in the plays of the First Folio edition of Shakespeare dramaPeer ReviewedPostprint (author's final draft

The anatomy of a collaborative writing tool for public participation in democracy

Author: Casati Roberto
Roncaglia Gino
Publication venue: HAL CCSD
Publication date: 31/10/2007
Field of study

Two approaches to online collaborative writing for the formulation of norms (laws, bills) are discussed: a Wikipedia-like approach and a structured approach

Archive Electronique - Institut Jean Nicod

Text based classification of companies in CrunchBase

Author: Batista F.
João P. Carvalho
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2015
Field of study

This paper introduces two fuzzy fingerprint based text classification techniques that were successfully applied to automatically label companies from CrunchBase, based purely on their unstructured textual description. This is a real and very challenging problem due to the large set of possible labels (more than 40) and also to the fact that the textual descriptions do not have to abide by any criteria and are, therefore, extremely heterogeneous. Fuzzy fingerprints are a recently introduced technique that can be used for performing fast classification. They perform well in the presence of unbalanced datasets and can cope with a very large number of classes. In the paper, a comparison is performed against some of the best text classification techniques commonly used to address similar problems. When applied to the CrunchBase dataset, the fuzzy fingerprint based approach outperformed the other techniques.info:eu-repo/semantics/submittedVersio