7 research outputs found

    Adapting Cross-Genre Author Profiling to Language and Corpus Notebook for PAN at CLEF 2016

    Get PDF
    Abstract This paper presents our approach to the Author Profiling (AP) task at PAN 2016. The task aims at identifying the author's age and gender under crossgenre AP conditions in three languages: English, Spanish, and Dutch. Our preprocessing stage includes reducing non-textual features to their corresponding semantic classes. We exploit typed character n-grams, lexical features, and nontextual features (domain names). We experimented with various feature representations (binary, raw frequency, normalized frequency, second order attributes (SOA), tf-idf) and machine learning algorithms (liblinear and libSVM implementations of Support Vector Machines (SVM), multinomial naive Bayes, logistic regression). For textual feature selection, we applied the transition point technique, except when SOA was used. We found that the optimal configuration was different for different languages at each stage

    Prototype/topic based Clustering Method for Weblogs

    Full text link
    [EN] In the last 10 years, the information generated on weblog sites has increased exponentially, resulting in a clear need for intelligent approaches to analyse and organise this massive amount of information. In this work, we present a methodology to cluster weblog posts according to the topics discussed therein, which we derive by text analysis. We have called the methodology Prototype/Topic Based Clustering, an approach which is based on a generative probabilistic model in conjunction with a Self-Term Expansion methodology. The usage of the Self-Term Expansion methodology is to improve the representation of the data and the generative probabilistic model is employed to identify relevant topics discussed in the weblogs. We have modified the generative probabilistic model in order to exploit predefined initialisations of the model and have performed our experiments in narrow and wide domain subsets. The results of our approach have demonstrated a considerable improvement over the pre-defined baseline and alternative state of the art approaches, achieving an improvement of up to 20% in many cases. The experiments were performed on both narrow and wide domain datasets, with the latter showing better improvement. However in both cases, our results outperformed the baseline and state of the art algorithms.The work of the third author was carried out in the framework of the WIQ-EI IRSES project (Grant No. 269180) within the FP7 Marie Curie, the DIANA APPLICATIONS Finding Hidden Knowledge in Texts: Applications (TIN2012-38603-C02-01) project and the VLC/CAMPUS Microcluster on Multimodal Interaction in Intelligent Systems.Perez-Tellez, F.; Cardiff, J.; Rosso, P.; Pinto Avendaño, DE. (2016). Prototype/topic based Clustering Method for Weblogs. Intelligent Data Analysis. 20(1):47-65. https://doi.org/10.3233/IDA-150793S476520

    DOCUMENT REPRESENTATION FOR CLUSTERING OF SCIENTIFIC ABSTRACTS

    Get PDF
    The key issue of the present paper is clustering of narrow-domain short texts, such as scientific abstracts. The work is based on the observations made when improving the performance of key phrase extraction algorithm. An extended stop-words list was used that was built automatically for the purposes of key phrase extraction and gave the possibility for a considerable quality enhancement of the phrases extracted from scientific publications. A description of the stop- words list creation procedure is given. The main objective is to investigate the possibilities to increase the performance and/or speed of clustering by the above-mentioned list of stop-words as well as information about lexeme parts of speech. In the latter case a vocabulary is applied for the document representation, which contains not all the words that occurred in the collection, but only nouns and adjectives or their sequences encountered in the documents. Two base clustering algorithms are applied: k-means and hierarchical clustering (average agglomerative method). The results show that the use of an extended stop-words list and adjective-noun document representation makes it possible to improve the performance and speed of k-means clustering. In a similar case for average agglomerative method a decline in performance quality may be observed. It is shown that the use of adjective-noun sequences for document representation lowers the clustering quality for both algorithms and can be justified only when a considerable reduction of feature space dimensionality is necessary

    Automatic extraction of agendas for action from news coverage of violent conflict

    Get PDF
    Words can make people act. Indeed, a simple phrase ‘Will you, please, open the window?’ can cause a person to do so. However, does this still hold, if the request is communicated indirectly via mass media and addresses a large group of people? Different disciplines have approached this problem from different angles, showing that there is indeed a connection between what is being called for in media and what people do. This dissertation, being an interdisciplinary work, bridges different perspectives on the problem and explains how collective mobilisation happens, using the novel term ‘agenda for action’. It also shows how agendas for action can be extracted from text in automated fashion using computational linguistics and machine learning. To demonstrate the potential of agenda for action, the analysis of The NYT and The Guardian coverage of chemical weapons crises in Syria in 2013 is performed. Katsiaryna Stalpouskaya has always been interested in applied and computational linguistics. Pursuing this interest, she joined FP7 EU-INFOCORE project in 2014, where she was responsible for automated content analysis. Katsiaryna’s work on the project resulted in a PhD thesis, which she successfully defended at Ludwig-Maximilians-Universität München in 2019. Currently, she is working as a product owner in the field of text and data analysis

    Automatic extraction of agendas for action from news coverage of violent conflict

    Get PDF
    Words can make people act. Indeed, a simple phrase ‘Will you, please, open the window?’ can cause a person to do so. However, does this still hold, if the request is communicated indirectly via mass media and addresses a large group of people? Different disciplines have approached this problem from different angles, showing that there is indeed a connection between what is being called for in media and what people do. This dissertation, being an interdisciplinary work, bridges different perspectives on the problem and explains how collective mobilisation happens, using the novel term ‘agenda for action’. It also shows how agendas for action can be extracted from text in automated fashion using computational linguistics and machine learning. To demonstrate the potential of agenda for action, the analysis of The NYT and The Guardian coverage of chemical weapons crises in Syria in 2013 is performed. Katsiaryna Stalpouskaya has always been interested in applied and computational linguistics. Pursuing this interest, she joined FP7 EU-INFOCORE project in 2014, where she was responsible for automated content analysis. Katsiaryna’s work on the project resulted in a PhD thesis, which she successfully defended at Ludwig-Maximilians-Universität München in 2019. Currently, she is working as a product owner in the field of text and data analysis

    On the Mono- and Cross-Language Detection of Text Re-Use and Plagiarism

    Full text link
    Barrón Cedeño, LA. (2012). On the Mono- and Cross-Language Detection of Text Re-Use and Plagiarism [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/16012Palanci

    On Clustering and Evaluation of Narrow Domain Short-Test Corpora

    Full text link
    En este trabajo de tesis doctoral se investiga el problema del agrupamiento de conjuntos especiales de documentos llamados textos cortos de dominios restringidos. Para llevar a cabo esta tarea, se han analizados diversos corpora y métodos de agrupamiento. Mas aún, se han introducido algunas medidas de evaluación de corpus, técnicas de selección de términos y medidas para la validez de agrupamiento con la finalidad de estudiar los siguientes problemas: -Determinar la relativa dificultad de un corpus para ser agrupado y estudiar algunas de sus características como longitud de los textos, amplitud del dominio, estilometría, desequilibrio de clases y estructura. -Contribuir en el estado del arte sobre el agrupamiento de corpora compuesto de textos cortos de dominios restringidos El trabajo de investigación que se ha llevado a cabo se encuentra parcialmente enfocado en el "agrupamiento de textos cortos". Este tema se considera relevante dado el modo actual y futuro en que las personas tienden a usar un "lenguaje reducido" constituidos por textos cortos (por ejemplo, blogs, snippets, noticias y generación de mensajes de textos como el correo electrónico y el chat). Adicionalmente, se estudia la amplitud del dominio de corpora. En este sentido, un corpus puede ser considerado como restringido o amplio si el grado de traslape de vocabulario es alto o bajo, respectivamente. En la tarea de categorización, es bastante complejo lidiar con corpora de dominio restringido tales como artículos científicos, reportes técnicos, patentes, etc. El objetivo principal de este trabajo consiste en estudiar las posibles estrategias para tratar con los siguientes dos problemas: a) las bajas frecuencias de los términos del vocabulario en textos cortos, y b) el alto traslape de vocabulario asociado a dominios restringidos. Si bien, cada uno de los problemas anteriores es un reto suficientemente alto, cuando se trata con textos cortos de dominios restringidos, la complejidad del problema se incrPinto Avendaño, DE. (2008). On Clustering and Evaluation of Narrow Domain Short-Test Corpora [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/2641Palanci
    corecore