10 research outputs found

    Effect of Tuned Parameters on a LSA MCQ Answering Model

    Full text link
    This paper presents the current state of a work in progress, whose objective is to better understand the effects of factors that significantly influence the performance of Latent Semantic Analysis (LSA). A difficult task, which consists in answering (French) biology Multiple Choice Questions, is used to test the semantic properties of the truncated singular space and to study the relative influence of main parameters. A dedicated software has been designed to fine tune the LSA semantic space for the Multiple Choice Questions task. With optimal parameters, the performances of our simple model are quite surprisingly equal or superior to those of 7th and 8th grades students. This indicates that semantic spaces were quite good despite their low dimensions and the small sizes of training data sets. Besides, we present an original entropy global weighting of answers' terms of each question of the Multiple Choice Questions which was necessary to achieve the model's success.Comment: 9 page

    Causal Latent Semantic Analysis (cLSA): An Illustration

    Get PDF
    Article discussing an illustration of causal latent semantic analysis (cLSA)

    ARGUMENTOS DA DECISÃO DE VOTO DE DEPUTADOS DURANTE A VOTAÇÃO DO IMPEACHMENT

    Get PDF
    The advances in techniques for analyzing unstructured data can help to better understand the positioning and votes of politicians who represent a population. This article analyses the underlying semantic relationship between the themes present in the arguments for the voting decision of parliamentarians of different political parties. For this, it uses discourse data from all the deputies during the impeachment voting, which took place in 2015. Weiss's (1983) perspective on the decision-making of politicians, and Festinger's (1957) theory of cognitive dissonance were used as the theoretical basis for the analysis. Additionally, using the technique of LSA (Latent semantic analysis) — a text mining technique based on matrix decomposition¬¬¬¬ — it aims to contribute to the analyses by bringing results related to the main associated terms, and the use of certain words in the political context. It was found that for the case presented, the deputies' discourse is not an element that enables the different voting groups to be distinguished, indicating that in order to understand the position of a politician, and better choose their representative, citizens need to go beyond the politicians’ discourse.El avance de técnicas para análisis de datos no estructurados puede ayudar a comprender mejor el posicionamiento y los votos de los políticos que representan una población. El objetivo del presente artículo es analizar la relación semántica latente de las temáticas presentes en los argumentos de la decisión de voto de los parlamentarios de diferentes partidos políticos. Para esto, se utilizaron datos de discurso de todos los diputados durante la votación del impeachment, ocurrida en 2015. En ese sentido, se utilizó como base teórica para la realización de los análisis la perspectiva de Weiss (1983) sobre la toma de decisión de políticos y la teoría de la disonancia cognitiva de Festinger (1957). Además, a partir del uso de la técnica LSA (Latent semantic analysis), técnica de minería de texto basada en descomposición matricial, se buscó contribuir con los análisis al traer resultados relacionados a los principales términos asociados y uso de determinadas palabras en el contexto político. Como resultados, se constató que, para el caso presentado, el discurso de los diputados no es elemento que permite separar a los diferentes grupos votantes, lo que indica que para comprender la posición de un político y elegir mejor su representante, los ciudadanos deben ir más allá de su discurso.O avanço de técnicas para análise de dados não estruturados pode auxiliar a compreender melhor o posicionamento e os votos dos políticos que representam uma população. O objetivo do presente artigo é analisar a relação semântica latente das temáticas presentes nos argumentos da decisão de voto dos parlamentares de diferentes partidos políticos. Para tal, foram utilizados dados de discurso de todos os deputados durante a votação do impeachment, ocorrida em 2015. Nesse sentido, utilizaram-se como base teórica para a realização das análises a perspectiva de Weiss (1983) sobre a tomada de decisão de políticos e a teoria da dissonância cognitiva de Festinger (1957). Adicionalmente, a partir do uso da técnica LSA (Latent semantic analysis), técnica de mineração de texto baseada em decomposição matricial, buscou-se contribuir com as análises ao trazer resultados relacionados aos principais termos associados e uso de determinadas palavras no contexto político. Como resultados, verificou-se que, para o caso apresentado, o discurso dos deputados não é um elemento que permite separar os diferentes grupos votantes, o que indica que, para compreender a posição de um político e escolher melhor seu representante, os cidadãos precisam ir além do seu discurso

    Sección bibliográfica

    Get PDF

    Text mining for social sciences: new approaches

    Get PDF
    The rise of the Internet has determined an important change in the way we look at the world, and then the mode we measure it. In June 2018, more than 55% of the world’s population has an Internet access. It follows that, every day we are able to quantify what more than four billion people do, how and when they do it. This means data. The availability of all these data raised more than one questions: How to manage them? How to treat them? How to extract information from them? Now, more than ever before, we need to think about new rules, new methods and new procedures for handling this huge amount of data, which are characterized by being unstructured, raw and messy. One of the most interesting challenge in this field regards the implementation of processes for deriving information from textual sources; this process is also known as Text Mining. Born in the mid-90s, Text Mining represents a prolific field which has evolved – thanks to technology evolution – from the Automatic Text Analysis, a set of methods for the description and the analysis of documents. Textual data, even if transformed into a structured format, present several criticisms as they are characterized by high dimensionality and noise. Moreover, online texts – like social media posts or blogs comments – are most of the time very short, and this means more sparseness of the matrices when the data are encoded. All these findings pose the problem of looking at new and advanced solutions for treating Web Data, that are able to overcome these criticisms and at the same time, return the information contained into these texts. The objective is to propose a fast and scalable method, able to deal with the findings of the online texts, and then with big and sparse matrices. To do that, we propose a procedure that starts from the collection of texts to the interpretation of the results. The innovative parts of this procedure consist of the choice of the weighting scheme for the term-document matrix and the co-clustering approach for data classification. To verify the validity of the procedure, we test it through two real applications: one concerning the topic of the safety and health at work and another regarding the subject of the Brexit vote. It will be shown how the technique works on different types of texts, allowing us to obtain meaningful results. For the reasons described above, in this research work we implement and test on real datasets a new procedure for content analysis of textual data, using a two-way approach in the Text Clustering field. As will be shown in the following pages, Text Clustering is a process of unsupervised classification that reproduces the internal structure of the data, by dividing the text into different groups on the basis of the lexical similarities. Text Clustering is mostly utilized for content analysis, and it might be applied for the classification of words, documents or both. In latter case we refer to two-way clustering, that is the specific approach we implemented within this research work for the treatment of the texts. To better organize the research work, we divided it into two parts: a first part of theory and a second one of application. The first part contains a preliminary chapter of literature review on the field of the Automatic Text Analysis in the context of data revolution, and a second chapter where the new procedure for text co-clustering is proposed. The second part regards the application of the proposed techniques on two different set of texts, one composed of news and another one composed of tweets. The idea is to test the same procedure on different type of texts, in order to verify the validity and the robustness of the method

    Clustering of scientific fields by integrating text mining and bibliometrics.

    Get PDF
    De toenemende verspreiding van wetenschappelijke en technologische publicaties via het internet, en de beschikbaarheid ervan in grootschalige bibliografische databanken, leiden tot enorme mogelijkheden om de wetenschap en technologie in kaart te brengen. Ook de voortdurende toename van beschikbare rekenkracht en de ontwikkeling van nieuwe algoritmen dragen hiertoe bij. Belangrijke uitdagingen blijven echter bestaan. Dit proefschrift bevestigt de hypothese dat de nauwkeurigheid van zowel het clusteren van wetenschappelijke kennisgebieden als het classificeren van publicaties nog verbeterd kunnen worden door het integreren van tekstontginning en bibliometrie. Zowel de tekstuele als de bibliometrische benadering hebben voor- en nadelen, en allebei bieden ze een andere kijk op een corpus van wetenschappelijke publicaties of patenten. Enerzijds is er een schat aan tekstinformatie aanwezig in dergelijke documenten, anderzijds vormen de onderlinge citaties grote netwerken die extra informatie leveren. We integreren beide gezichtspunten en tonen hoe bestaande tekstuele en bibliometrische methoden kunnen verbeterd worden. De dissertatie is opgebouwd uit drie delen: Ten eerste bespreken we het gebruik van tekstontginningstechnieken voor informatievergaring en voor het in kaart brengen van kennis vervat in teksten. We introduceren en demonstreren het raamwerk voor tekstontginning, evenals het gebruik van agglomeratieve hiërarchische clustering. Voorts onderzoeken we de relatie tussen enerzijds de performantie van het clusteren en anderzijds het gewenste aantal clusters en het aantal factoren bij latent semantische indexering. Daarnaast beschrijven we een samengestelde, semi-automatische strategie om het aantal clusters in een verzameling documenten te bepalen. Ten tweede behandelen we netwerken die bestaan uit citaties tussen wetenschappelijke documenten en netwerken die ontstaan uit onderlinge samenwerkingsverbanden tussen auteurs. Dergelijke netwerken kunnen geanalyseerd worden met technieken van de bibliometrie en de grafentheorie, met als doel het rangschikken van relevante entiteiten, het clusteren en het ontdekken van gemeenschappen. Ten derde tonen we de complementariteit aan van tekstontginning en bibliometrie en stellen we mogelijkheden voor om beide werelden op correcte wijze te integreren. De performantie van ongesuperviseerd clusteren en van classificeren verbetert significant door het samenvoegen van de tekstuele inhoud van wetenschappelijke publicaties en de structuur van citatienetwerken. Een methode gebaseerd op statistische meta-analyse behaalt de beste resultaten en overtreft methoden die enkel gebaseerd zijn op tekst of citaties. Onze geïntegreerde of hybride strategieën voor informatievergaring en clustering worden gedemonstreerd in twee domeinstudies. Het doel van de eerste studie is het ontrafelen en visualiseren van de conceptstructuur van de informatiewetenschappen en het toetsen van de toegevoegde waarde van de hybride methode. De tweede studie omvat de cognitieve structuur, bibliometrische eigenschappen en de dynamica van bio-informatica. We ontwikkelen een methode voor dynamisch en geïntegreerd clusteren van evoluerende bibliografische corpora. Deze methode vergelijkt en volgt clusters doorheen de tijd. Samengevat kunnen we stellen dat we voor de complementaire tekst- en netwerkwerelden een hybride clustermethode ontwerpen die tegelijkertijd rekening houdt met beide paradigma's. We tonen eveneens aan dat de geïntegreerde zienswijze een beter begrip oplevert van de structuur en de evolutie van wetenschappelijke kennisgebieden.SISTA;
    corecore