Search CORE

10 research outputs found

Effect of Tuned Parameters on a LSA MCQ Answering Model

Author: A. C. Graesser
Alain Lifchitz
C. H. Q. Ding
D. I. Martin
G. Denhière
G. Salton
G. Salton
Guy Denhière
J. Diaz
J. Diaz
J. Quesada
M. Efron
M. F. Porter
M. W. Berry
S. Deerwester
S. T. Dumais
S. T. Dumais
Sandra Jhean-Larose
W. Kintsch
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2009
Field of study

This paper presents the current state of a work in progress, whose objective is to better understand the effects of factors that significantly influence the performance of Latent Semantic Analysis (LSA). A difficult task, which consists in answering (French) biology Multiple Choice Questions, is used to test the semantic properties of the truncated singular space and to study the relative influence of main parameters. A dedicated software has been designed to fine tune the LSA semantic space for the Multiple Choice Questions task. With optimal parameters, the performances of our simple model are quite surprisingly equal or superior to those of 7th and 8th grades students. This indicates that semantic spaces were quite good despite their low dimensions and the small sizes of training data sets. Besides, we present an original entropy global weighting of answers' terms of each question of the Multiple Choice Questions which was necessary to achieve the model's success.Comment: 9 page

arXiv.org e-Print Archive

Causal Latent Semantic Analysis (cLSA): An Illustration

Author: Evangelopoulos Nicholas
Hossain Muhammad Muazzem
Prybutok Victor R.
Publication venue: 'Canadian Center of Science and Education'
Publication date: 01/04/2011
Field of study

Article discussing an illustration of causal latent semantic analysis (cLSA)

Crossref

UNT Digital Library

ARGUMENTOS DA DECISÃO DE VOTO DE DEPUTADOS DURANTE A VOTAÇÃO DO IMPEACHMENT

Author: Becker João Luiz
Behr Ariel
Marcolin Carla Bonato
Momo Fernanda da Silva
Publication venue: 'Editora UNIVALI'
Publication date: 24/05/2019
Field of study

The advances in techniques for analyzing unstructured data can help to better understand the positioning and votes of politicians who represent a population. This article analyses the underlying semantic relationship between the themes present in the arguments for the voting decision of parliamentarians of different political parties. For this, it uses discourse data from all the deputies during the impeachment voting, which took place in 2015. Weiss's (1983) perspective on the decision-making of politicians, and Festinger's (1957) theory of cognitive dissonance were used as the theoretical basis for the analysis. Additionally, using the technique of LSA (Latent semantic analysis) — a text mining technique based on matrix decomposition¬¬¬¬ — it aims to contribute to the analyses by bringing results related to the main associated terms, and the use of certain words in the political context. It was found that for the case presented, the deputies' discourse is not an element that enables the different voting groups to be distinguished, indicating that in order to understand the position of a politician, and better choose their representative, citizens need to go beyond the politicians’ discourse.El avance de técnicas para análisis de datos no estructurados puede ayudar a comprender mejor el posicionamiento y los votos de los políticos que representan una población. El objetivo del presente artículo es analizar la relación semántica latente de las temáticas presentes en los argumentos de la decisión de voto de los parlamentarios de diferentes partidos políticos. Para esto, se utilizaron datos de discurso de todos los diputados durante la votación del impeachment, ocurrida en 2015. En ese sentido, se utilizó como base teórica para la realización de los análisis la perspectiva de Weiss (1983) sobre la toma de decisión de políticos y la teoría de la disonancia cognitiva de Festinger (1957). Además, a partir del uso de la técnica LSA (Latent semantic analysis), técnica de minería de texto basada en descomposición matricial, se buscó contribuir con los análisis al traer resultados relacionados a los principales términos asociados y uso de determinadas palabras en el contexto político. Como resultados, se constató que, para el caso presentado, el discurso de los diputados no es elemento que permite separar a los diferentes grupos votantes, lo que indica que para comprender la posición de un político y elegir mejor su representante, los ciudadanos deben ir más allá de su discurso.O avanço de técnicas para análise de dados não estruturados pode auxiliar a compreender melhor o posicionamento e os votos dos políticos que representam uma população. O objetivo do presente artigo é analisar a relação semântica latente das temáticas presentes nos argumentos da decisão de voto dos parlamentares de diferentes partidos políticos. Para tal, foram utilizados dados de discurso de todos os deputados durante a votação do impeachment, ocorrida em 2015. Nesse sentido, utilizaram-se como base teórica para a realização das análises a perspectiva de Weiss (1983) sobre a tomada de decisão de políticos e a teoria da dissonância cognitiva de Festinger (1957). Adicionalmente, a partir do uso da técnica LSA (Latent semantic analysis), técnica de mineração de texto baseada em decomposição matricial, buscou-se contribuir com as análises ao trazer resultados relacionados aos principais termos associados e uso de determinadas palavras no contexto político. Como resultados, verificou-se que, para o caso apresentado, o discurso dos deputados não é um elemento que permite separar os diferentes grupos votantes, o que indica que, para compreender a posição de um político e escolher melhor seu representante, os cidadãos precisam ir além do seu discurso

Portal de Periódicos da Univali (Universidade do Vale do Itajaí)

Sección bibliográfica

Author: Editorial Equipo
Publication venue: Consejo Superior de Investigaciones Científicas
Publication date: 30/06/2007
Field of study

Revista española de Documentación Científica

Text mining for social sciences: new approaches

Author: CELARDO LIVIA
Publication venue
Publication date: 22/02/2019
Field of study

The rise of the Internet has determined an important change in the way we look at the world, and then the mode we measure it. In June 2018, more than 55% of the world’s population has an Internet access. It follows that, every day we are able to quantify what more than four billion people do, how and when they do it. This means data. The availability of all these data raised more than one questions: How to manage them? How to treat them? How to extract information from them? Now, more than ever before, we need to think about new rules, new methods and new procedures for handling this huge amount of data, which are characterized by being unstructured, raw and messy. One of the most interesting challenge in this field regards the implementation of processes for deriving information from textual sources; this process is also known as Text Mining. Born in the mid-90s, Text Mining represents a prolific field which has evolved – thanks to technology evolution – from the Automatic Text Analysis, a set of methods for the description and the analysis of documents. Textual data, even if transformed into a structured format, present several criticisms as they are characterized by high dimensionality and noise. Moreover, online texts – like social media posts or blogs comments – are most of the time very short, and this means more sparseness of the matrices when the data are encoded. All these findings pose the problem of looking at new and advanced solutions for treating Web Data, that are able to overcome these criticisms and at the same time, return the information contained into these texts. The objective is to propose a fast and scalable method, able to deal with the findings of the online texts, and then with big and sparse matrices. To do that, we propose a procedure that starts from the collection of texts to the interpretation of the results. The innovative parts of this procedure consist of the choice of the weighting scheme for the term-document matrix and the co-clustering approach for data classification. To verify the validity of the procedure, we test it through two real applications: one concerning the topic of the safety and health at work and another regarding the subject of the Brexit vote. It will be shown how the technique works on different types of texts, allowing us to obtain meaningful results. For the reasons described above, in this research work we implement and test on real datasets a new procedure for content analysis of textual data, using a two-way approach in the Text Clustering field. As will be shown in the following pages, Text Clustering is a process of unsupervised classification that reproduces the internal structure of the data, by dividing the text into different groups on the basis of the lexical similarities. Text Clustering is mostly utilized for content analysis, and it might be applied for the classification of words, documents or both. In latter case we refer to two-way clustering, that is the specific approach we implemented within this research work for the treatment of the texts. To better organize the research work, we divided it into two parts: a first part of theory and a second one of application. The first part contains a preliminary chapter of literature review on the field of the Automatic Text Analysis in the context of data revolution, and a second chapter where the new procedure for text co-clustering is proposed. The second part regards the application of the proposed techniques on two different set of texts, one composed of news and another one composed of tweets. The idea is to test the same procedure on different type of texts, in order to verify the validity and the robustness of the method

Archivio della ricerca- Università di Roma La Sapienza

Recommended from our members

Investigating the relationship between the business performance management framework and the Malcolm Baldrige National Quality Award framework.

Author: Hossain Muhammad Muazzem
Publication venue: 'University of North Texas Libraries'
Publication date: 01/08/2009
Field of study

The business performance management (BPM) framework helps an organization continuously adjust and successfully execute its strategies. BPM helps increase flexibility by providing managers with an early alert about changes and, as a result, allows faster response to such changes. The Malcolm Baldrige National Quality Award (MBNQA) framework provides a basis for self-assessment and a systems perspective for managing an organization's key processes for achieving business results. The MBNQA framework is a more comprehensive framework and encapsulates the underlying constructs in the BPM framework. The objectives of this dissertation are fourfold: (1) to validate the underlying relationships presented in the 2008 MBNQA framework, (2) to explore the MBNQA framework at the dimension level, and develop and test constructs measured at that level in a causal model, (3) to validate and create a common general framework for the business performance model by integrating the practitioner literature with basic theory including existing MBNQA theory, and (4) to integrate the BPM framework and the MBNQA framework into a new framework (BPM-MBNQA framework) that can guide organizations in their journey toward achieving and sustaining competitive and strategic advantages. The purpose of this study is to achieve these objectives by means of a combination of methodologies including literature reviews, expert opinions, interviews, presentation feedbacks, content analysis, and latent semantic analysis. An initial BPM framework was developed based on the reviews of literature and expert opinions. There is a paucity of academic research on business performance management. Therefore, this study reviewed the practitioner literature on BPM and from the numerous organization-specific BPM models developed a generic, conceptual BPM framework. With the intent of obtaining valuable feedback, this initial BPM framework was presented to Baldrige Award recipients (BARs) and selected academicians from across the United States who participated in the Fall Summit 2007 held at Caterpillar Financial Headquarter in Nashville, TN on October 1 and 2, 2007. Incorporating the feedback from that group allowed refining and improving the proposed BPM framework. This study developed a variant of the traditional latent semantic analysis (LSA) called causal latent semantic analysis (cLSA) that enables us to test causal models using textual data. This method was used to validate the 2008 MBNQA framework based on article abstracts on the Baldrige Award and program published in both practitioner and academic journals from 1987 to 2009. The cLSA was also used to validate the BPM framework using the full body text data from all articles published in the practitioner journal entitled the Business Performance Management Magazine since its inception in 2003. The results provide the first cLSA study of these frameworks. This is also the first study to examine all the causal relationships within the MBNQA and BPM frameworks

UNT Digital Library

Eigenvalue-based model selection during latent semantic indexing

Author: Anderson
BaezaYates
Berry
Bishop
Cota
Deerwester
Ding
Dumais
Efron
Efron
Efron
Fukunaga
Glorfeld
Golub
Guttman
Guttman
Hakstian
Hastie
Hofmann
Horn
Hyvarinen
Jackson
Jiang
Jobson
Jolliffe
Landauer
Lederman
Linn
Longman
Losee
Mardia
Mihail
Muirhead
Rencher
Salton
Strang
Subhash
Ten Berge
Wong
Wyse
Zwick
Publication venue: 'Wiley'
Publication date: 01/01/2005
Field of study

Crossref

Recommended from our members

Organizational Identity and Community Values: Determining Meaning in Post-secondary Education Social Media Guideline and Policy Documents

Author: Pasquini Laura Anne
Publication venue: 'University of North Texas Libraries'
Publication date: 01/08/2014
Field of study

With the increasing use of social media by students, researchers, administrative staff, and faculty in post-secondary education (PSE), a number of institutions have developed guideline and policy documents to set standards for social media use. Social media platforms and applications have the potential to increase communication channels, support learning, enhance research, and encourage community engagement at PSE institutions. As social media implementation and administration has developed in PSE, there has been minimal assessment of the substance of social media guideline and policy documents. The first objective of this research study was to examine an accessible, online database (corpus) comprised of 24, 243 atomic social media guideline and policy text documents from 250 PSE institutions representing 10 countries to identify central attributes. To determine text meaning from topic extraction, a rotated latent semantic analysis (rLSA) method was applied. The second objective of this investigation was to determine if the distribution of topics analyze in the corpus differ by PSE institution geographic location. To analyze the diverging topics, the researcher utilized an iterative consensus-building algorithm.Through the maximum term frequencies, LSA determined a rotated 36-factor solution that identified common attributes and topics shared among the 24,243 social media guideline and policy atomic documents. This initial finding produced a list of 36 universal topics discussed in social media guidelines and policies across all 250 PSE institutions from 10 countries. Continually, the applied chi-squared tests, that measured expected and observed document term counts, identified distribution differences of content related factors between US and Non-US PSE institutions. This analysis offered a concrete analysis for unstructured text data on the topic of social media guidance. This resulted in a comprehensive list of recommendations for developing social media guidelines and policies, and a database of social media guideline and policy documents for the PSE sector and other related organizations. Additionally, this research stimulated important theoretical development for how organizations socially construct a semantic structure within a community of practice. By assessing the community of practice, comprised of PSE 250 institutions that direct social media use, a corpus of documents provided unstructured data to evaluate the community. The spontaneous participation and reification process of the social media guideline and policy document corpus reaffirmed that a corpus-creating community of practice can instinctively form a knowledge-sharing organization that provides meaning, values, and identity. These findings should stimulate further research contributions, and provides practitioners and scholars with tools to measure, understand, and assess semantic space for other artifacts developed within a community of practice in other industries, organizations, or distributed associations

UNT Digital Library

Clustering of scientific fields by integrating text mining and bibliometrics.

Author: Janssens Frizo
Publication venue
Publication date
Field of study

De toenemende verspreiding van wetenschappelijke en technologische publicaties via het internet, en de beschikbaarheid ervan in grootschalige bibliografische databanken, leiden tot enorme mogelijkheden om de wetenschap en technologie in kaart te brengen. Ook de voortdurende toename van beschikbare rekenkracht en de ontwikkeling van nieuwe algoritmen dragen hiertoe bij. Belangrijke uitdagingen blijven echter bestaan. Dit proefschrift bevestigt de hypothese dat de nauwkeurigheid van zowel het clusteren van wetenschappelijke kennisgebieden als het classificeren van publicaties nog verbeterd kunnen worden door het integreren van tekstontginning en bibliometrie. Zowel de tekstuele als de bibliometrische benadering hebben voor- en nadelen, en allebei bieden ze een andere kijk op een corpus van wetenschappelijke publicaties of patenten. Enerzijds is er een schat aan tekstinformatie aanwezig in dergelijke documenten, anderzijds vormen de onderlinge citaties grote netwerken die extra informatie leveren. We integreren beide gezichtspunten en tonen hoe bestaande tekstuele en bibliometrische methoden kunnen verbeterd worden. De dissertatie is opgebouwd uit drie delen: Ten eerste bespreken we het gebruik van tekstontginningstechnieken voor informatievergaring en voor het in kaart brengen van kennis vervat in teksten. We introduceren en demonstreren het raamwerk voor tekstontginning, evenals het gebruik van agglomeratieve hiërarchische clustering. Voorts onderzoeken we de relatie tussen enerzijds de performantie van het clusteren en anderzijds het gewenste aantal clusters en het aantal factoren bij latent semantische indexering. Daarnaast beschrijven we een samengestelde, semi-automatische strategie om het aantal clusters in een verzameling documenten te bepalen. Ten tweede behandelen we netwerken die bestaan uit citaties tussen wetenschappelijke documenten en netwerken die ontstaan uit onderlinge samenwerkingsverbanden tussen auteurs. Dergelijke netwerken kunnen geanalyseerd worden met technieken van de bibliometrie en de grafentheorie, met als doel het rangschikken van relevante entiteiten, het clusteren en het ontdekken van gemeenschappen. Ten derde tonen we de complementariteit aan van tekstontginning en bibliometrie en stellen we mogelijkheden voor om beide werelden op correcte wijze te integreren. De performantie van ongesuperviseerd clusteren en van classificeren verbetert significant door het samenvoegen van de tekstuele inhoud van wetenschappelijke publicaties en de structuur van citatienetwerken. Een methode gebaseerd op statistische meta-analyse behaalt de beste resultaten en overtreft methoden die enkel gebaseerd zijn op tekst of citaties. Onze geïntegreerde of hybride strategieën voor informatievergaring en clustering worden gedemonstreerd in twee domeinstudies. Het doel van de eerste studie is het ontrafelen en visualiseren van de conceptstructuur van de informatiewetenschappen en het toetsen van de toegevoegde waarde van de hybride methode. De tweede studie omvat de cognitieve structuur, bibliometrische eigenschappen en de dynamica van bio-informatica. We ontwikkelen een methode voor dynamisch en geïntegreerd clusteren van evoluerende bibliografische corpora. Deze methode vergelijkt en volgt clusters doorheen de tijd. Samengevat kunnen we stellen dat we voor de complementaire tekst- en netwerkwerelden een hybride clustermethode ontwerpen die tegelijkertijd rekening houdt met beide paradigma's. We tonen eveneens aan dat de geïntegreerde zienswijze een beter begrip oplevert van de structuur en de evolutie van wetenschappelijke kennisgebieden.SISTA;

Research Papers in Economics