16 research outputs found

    Effect of Tuned Parameters on a LSA MCQ Answering Model

    Full text link
    This paper presents the current state of a work in progress, whose objective is to better understand the effects of factors that significantly influence the performance of Latent Semantic Analysis (LSA). A difficult task, which consists in answering (French) biology Multiple Choice Questions, is used to test the semantic properties of the truncated singular space and to study the relative influence of main parameters. A dedicated software has been designed to fine tune the LSA semantic space for the Multiple Choice Questions task. With optimal parameters, the performances of our simple model are quite surprisingly equal or superior to those of 7th and 8th grades students. This indicates that semantic spaces were quite good despite their low dimensions and the small sizes of training data sets. Besides, we present an original entropy global weighting of answers' terms of each question of the Multiple Choice Questions which was necessary to achieve the model's success.Comment: 9 page

    A Tecnologia de Mineração de Textos

    Get PDF
    Mineração de textos, também conhecido como mineração de dados textuais ou descoberta de conhecimento de bases de dados textuais, em geral, se refere ao processo de extração de informações de interesse e padrões não-triviais ou descoberta de conhecimento em documentos de texto não-estruturados. Pode ser visto como uma extensão da mineração de dados ou da descoberta de conhecimento em bases de dados estruturadas. Como muitas informações (mais de 80%) estão armazenadas em formato texto, acredita-se que as técnicas de mineração de textos possuam um grande valor comercial. O objetivo deste tutorial é apresentar algumas técnicas de mineração de textos, bem como casos de uso e resultados obtido

    A tree based keyphrase extraction technique for academic literature

    Get PDF
    Automatic keyphrase extraction techniques aim to extract quality keyphrases to summarize a document at a higher level. Among the existing techniques some of them are domain-specific and require application domain knowledge, some of them are based on higher-order statistical methods and are computationally expensive, and some of them require large train data which are rare for many applications. Overcoming these issues, this thesis proposes a new unsupervised automatic keyphrase extraction technique, named TeKET or Tree-based Keyphrase Extraction Technique, which is domain-independent, employs limited statistical knowledge, and requires no train data. The proposed technique also introduces a new variant of the binary tree, called KeyPhrase Extraction (KePhEx) tree to extract final keyphrases from candidate keyphrases. Depending on the candidate keyphrases the KePhEx tree structure is either expanded or shrunk or maintained. In addition, a measure, called Cohesiveness Index or CI, is derived that denotes the degree of cohesiveness of a given node with respect to the root which is used in extracting final keyphrases from a resultant tree in a flexible manner and is utilized in ranking keyphrases alongside Term Frequency. The effectiveness of the proposed technique is evaluated using an experimental evaluation on a benchmark corpus, called SemEval-2010 with total 244 train and test articles, and compared with other relevant unsupervised techniques by taking the representatives from both statistical (such as Term Frequency-Inverse Document Frequency and YAKE) and graph-based techniques (PositionRank, CollabRank (SingleRank), TopicRank, and MultipartiteRank) into account. Three evaluation metrics, namely precision, recall and F1 score are taken into consideration during the experiments. The obtained results demonstrate the improved performance of the proposed technique over other similar techniques in terms of precision, recall, and F1 scores

    Two approaches to extensive reading and their effects on L2 vocabulary development

    Get PDF
    One avenue for developing second language (L2) vocabulary knowledge is through Extensive Reading (ER). ER can provide opportunities for incidental learning to occur. Class time is often too restricted for sufficient attention to deliberate learning (Hunt & Beglar, 2005) meaning ER is important for L2 vocabulary development. This article builds on ideas in the recent two-part Reading in a Foreign Language ER discussion forum by investigating two implementations of ER and their effects on L2 vocabulary development: a traditional ER-only approach, and an ER-plus approach which supplements ER with post-reading discussion implemented in small groups. L2 English learners enrolled at a university in Aotearoa New Zealand read five graded readers during normal class time. Latent Semantic Analysis was used to measure the development of word association knowledge of 60 target words. The findings revealed facilitative effects of both ER approaches. Supplementing ER with discussion provided opportunities for further development

    Duplicate Defect Detection

    Get PDF
    Discovering and fixing faults is an unavoidable process in Software Engineering. It is always a good practice to document and organize fault reports. This facilitates the effectiveness of development and maintenance process. Bug Tracking Repositories, such as Bugzilla, are designed to provide fault reporting facilities for developers, testers and users of the system. Allowing anyone to contribute finding and reporting faults has an immediate impact on software quality. However, this benefit comes with one side-effect. Users often file reports that describe the same fault. This increases the triaging time spent by the maintainers. At the same time, important information required to fix the fault is likely to be distributed across different reports.;The objective of this thesis is twofold. First, we want to understand the dynamics of bug report filing for a large, long duration open source project, Firefox. Second, we present a new approach that can reduce the number of duplicate reports. The novel element in the proposed approach is the ability to concentrate the search for duplicates on specific portions of the bug repository. This improves the performance of Information Retrieval techniques and classification runtime of our algorithm. Our system can be deployed as a search tool to help reporters query the repository or it can be adopted to help maintainers detect duplicate reports. In both cases the performance is satisfactory. When tested as a search tool our system is able to detect up to 53% of duplicate reports. The approach adapted for maintainers has a maximum recall rate of 59%

    Evaluation of the Effectiveness of Cosine Similarity in Predicting Relevance between Paired Citing and Cited Sentences.

    Get PDF
    Citation analysis has a long history in Information Science. We examined the potential of cosine similarity to predict relevance between citing sentences and the articles they cite. An expert evaluated 22,697 pairs of cited and citing sentences, and marked 544 as relevant to one another. Cosine similarity gave 8386 of these pairs a similarity score over zero, which included 339 relevant pairs. (4% precision, 65% recall). Under 0.01% of each cited article was relevant to the citing sentence, making precise retrieval challenging. We performed a detailed error analysis. Cosine similarity performance was reduced by insufficient window size, affixes, hyphenation, acronyms and abbreviations. The following preprocessing steps would improve retrieval performance: using a stemming algorithm that accounts for prefixes, expanding the window of comparison from sentences to paragraphs, identifying synonyms and expanding abbreviations. Further investigation of the possibilities of cosine similarity is necessary, but such investigation is worth pursuit

    Text Preprocessing in Programmable Logic

    Get PDF
    There is a tremendous amount of information being generated and stored every year, and its growth rate is exponential. From 2008 to 2009, the growth rate was estimated to be 62%. In 2010, the amount of generated information is expected to grow by 50% to 1.2 Zettabytes, and by 2020 this rate is expected to grow to 35 Zettabytes. By preprocessing text in programmable logic, high data processing rates could be achieved with greater power efficiency than with an equivalent software solution, leading to a smaller carbon footprint. This thesis presents an overview of the fields of Information Retrieval and Natural Language Processing, and the design and implementation of four text preprocessing modules in programmable logic: UTF–8 decoding, stop–word filtering, and stemming with both Lovins’ and Porter’s techniques. These extensively pipelined circuits were implemented in a high performance FPGA and found to sustain maximum operational frequencies of 704 MHz, data throughputs in excess of 5 Gbps and efficiencies in the range of 4.332 – 6.765 mW/Gbps and 34.66 – 108.2 uW/MHz. These circuits can be incorporated into larger systems, such as document classifiers and information extraction engines

    Social Sentiment. Plataforma para la monitorización de redes sociales apoyada en análisis de sentimiento.

    Get PDF
    Trabajo de Fin de Grado. Grado en Ingeniería Informática. Curso académico 2020-2021.[ES]La plataforma consiste, en primer lugar, en una página web donde los usuarios podrán hacer login para ver distintas estadísticas acerca de sus redes sociales. Es necesario un sistema de usuarios ya que las propias funcionalidades de la plataforma ya requieren cierto uso de recursos. Abrir completamente la plataforma no permitiría que esta ofreciese todo lo que pretende con un rendimiento aceptable. El cliente conectará con el servidor a través de una API. Para la creación del frontend se pretende hacer uso del popular framework bootstrap, junto a node.js y express.js en el backend. Será necesario también construir una base de datos, preferiblemente no relacional (dada la estructura de documento de los datos con los que trabaJará la plataforma), como puede ser mongodb. Así, el sistema de usuarios se basaría en la unión node.js/mongodb. Por último, el servidor contará con distintos módulos escritos en python cuyo objetivo será la recolección, procesamiento y análisis de datos. Para ello se construirá un módulo crawler que obtenga datos de redes sociales utilizando las apis disponibles. También se necesitarán módulos que procesen y analicen los datos. Estos módulos utilizarán tanto técnicas más tradicionales y formales como técnicas de machine learning. En concreto técnicas de análisis de sentimiento - NLP, natural language processing-. Así, será necesario un dataset de sentimiento, por lo que se propone también la creación de dicho dataset para el idioma español, dada la notable ausencia de este tipo de datasets en nuestro idioma.[EN]The platform consists, first of all, of a web page where users will be able to log in to see different statistics about their social networks. A user system is necessary as the platform's own functionalities already require some resource usage. Opening the platform completely would not allow it to offer everything it intends to of fer with an acceptable performance. The client will connect to the server through an api. For the creation of the frontend we intend to make use of the popular bootstrap framework, along with node.js and express.js in the backend. It will also be necessary to build a database, preferably non-relational (given the document structure of the data with which the platform will work), such as mongodb. Thus, the user system would be based on the node.js/mongodb union. Finally, the server will have different modules written in python whose objective will be the collec tion, processing and analysis of data. For this purpose, a crawler module will be built to obtain data from social networks using the available apis. Modules that process and analyze the data will also be needed. These modules will use both more traditional and formal techniques as well as machine learning techniques. Specifi cally, sentiment analysis techniques - nlp, natural language processing-. Thus, a sen timent dataset will be needed, so it is also proposed the creation of such a dataset for the spanish language, given the notable absence of this type of dataset in our languag

    Repositório de registos electrónicos de saúde baseado em OpenEHR

    Get PDF
    Mestrado em Engenharia de Computadores e TelemáticaAn Electronic Health Record (EHR) aggregates all relevant medical information regarding a single patient, allowing a patient centric storage approach. This way the complete medical history of a patient is stored together in one record, making it possible to save time and work by allowing the sharing of information between health care institutions. To make this sharing possible there has to be agreed on the format in which the information is saved. There are many standards to de ne the way health information is stored, exchanged and retrieved. One of this standards is the Open Electronic Health Record (OpenEHR). The goal of this thesis is to create a repository which allows to store and manage patient records which follow the OpenEHR standard. The result of the implementation consists in three software parts, being them a Extensible Markup Language (XML) repository to store health information, a set of services allowing to manage and query the information stored and a web interface to demonstrate the implemented functionalities.Um registo electrónico de saúde agrega toda a informação médica relevante de um paciente, permitindo uma filosofia de armazenamento orientada ao mesmo. Desta forma todo o historial médico do paciente encontra-se armazenado num único registo, permitindo a optimização de custos e tempo gasto nas diferentes tarefas, através de partilha de informação entre diferentes instituições médicas. Para possibilitar esta partilha é necessário definir um formato comum em que a informação é armazenada. Para tal foram definidas diversas normas que ditam as regras de armazenamento, troca e recuperação de informação médica. Uma destas normas é o Open Electronic Health Record (OpenEHR). O objectivo desta dissertação e criar um reposit orio que permite o armazenamento de registos médicos que sigam a norma OpenEHR. A implementação dá origem a três componentes de software, sendo eles uma base de dados Extensible Markup Language (XML) para armazenamento de registos médicos, um conjunto de serviços para gestão e pesquisa da informação armazenada e uma interface web para demonstração das funcionalidades implementadas
    corecore