36 research outputs found

    News Reliability Evaluation using Latent Semantic Analysis

    Get PDF
    The rapid rise and widespread of ‘Fake News’ has severe implications in the society today. Much efforts have been directed towards the development of methods to verify news reliability on the Internet in recent years. In this paper, an automated news reliability evaluation system was proposed. The system utilizes term several Natural Language Processing (NLP) techniques such as Term Frequency-Inverse Document Frequency (TF-IDF), Phrase Detection and Cosine Similarity in tandem with Latent Semantic Analysis (LSA). A collection of 9203 labelled articles from both reliable and unreliable sources were collected. This dataset was then applied random test-train split to create the training dataset and testing dataset. The final results obtained shows 81.87% for precision and 86.95% for recall with the accuracy being 73.33%

    Classificação automática de texto buscando similaridade de palavras e significados ocultos

    Get PDF
    Adotamos o m etodo da indexação da semântica latente (LSI) para classifi car documentos que estejam relacionados por algum meio não restrito apenas aos termos presentes, mas buscando outras formas de similaridades. A redu cão de dimensionalidade da matriz Termo-Documento n~ao e novidade, sendo normalmente adotado entre 200 a 300 dimensões. Nesse trabalho, transformamos o LSI em um algoritmo semi-supervisionado e determinamos o n umero ideal de dimensão durante a fase de treinamento. O algoritmo utiliza um espa co isom etrico a aquele de nido pela matriz Termo-Documento para acelerar os c alculos.Eje: Workshop Bases de datos y minería de datos (WBDDM)Red de Universidades con Carreras en Informática (RedUNCI

    Inferior parietal lobule and early visual areas support elicitation of individualized meanings during narrative listening

    Get PDF
    Introduction: When listening to a narrative, the verbal expressions translate into meanings and flow of mental imagery. However, the same narrative can be heard quite differently based on differences in listeners' previous experiences and knowledge. We capitalized on such differences to disclose brain regions that support transformation of narrative into individualized propositional meanings and associated mental imagery by analyzing brain activity associated with behaviorally assessed individual meanings elicited by a narrative. Methods: Sixteen right-handed female subjects were instructed to list words that best described what had come to their minds while listening to an eight-minute narrative during functional magnetic resonance imaging (fMRI). The fMRI data were analyzed by calculating voxel-wise intersubject correlation (ISC) values. We used latent semantic analysis (LSA) enhanced with Wordnet knowledge to measure semantic similarity of the produced words between subjects. Finally, we predicted the ISC with the semantic similarity using representational similarity analysis. Results: We found that semantic similarity in these word listings between subjects, estimated using LSA combined with WordNet knowledge, predicting similarities in brain hemodynamic activity. Subject pairs whose individual semantics were similar also exhibited similar brain activity in the bilateral supramarginal and angular gyrus of the inferior parietal lobe, and in the occipital pole. Conclusions: Our results demonstrate, using a novel method to measure interindividual differences in semantics, brain mechanisms giving rise to semantics and associated imagery during narrative listening. During listening to a captivating narrative, the inferior parietal lobe and early visual cortical areas seem, thus, to support elicitation of individual meanings and flow of mental imagery.Peer reviewe

    Matching Possible Mitigations to Cyber Threats: A Document-Driven Decision Support Systems Approach

    Get PDF
    Despite more than a decade of heightened focus on cybersecurity, the threat continues. To address possible impacts, cyber threats must be addressed. Mitigation catalogs exist in practice today, but these do not map mitigations to the specific threats they counter. Currently, mitigations are manually selected by cybersecurity experts (CSE) who are in short supply. To reduce labor and improve repeatability, an automated approach is needed for matching mitigations to cyber threats. This research explores the application of supervised machine learning and text retrieval techniques to automate matching of relevant mitigations to cyber threats where both are expressed as text, resulting in a novel method that combines two techniques: support vector machine classification and latent semantic analysis. In five test cases, the approach demonstrates high recall for known relevant mitigation documents, bolstering confidence that potentially relevant mitigations will not be overlooked. It automatically excludes 97% of non-relevant mitigations, greatly reducing the CSE’s workload over purely manual matching

    Research infrastructures in the LHC era : a scientometric approach

    Get PDF
    When a research infrastructure is funded and implemented, new information and new publications are created. This new information is the measurable output of discovery process. In this paper, we describe the impact of infrastructure for physics experiments in terms of publications and citations. In particular, we consider the Large Hadron Collider (LHC) experiments (ATLAS, CMS, ALICE, LHCb) and compare them to the Large Electron Positron Collider (LEP) experiments (ALEPH, DELPHI, L3, OPAL) and the Tevatron experiments (CDF, D0). We provide an overview of the scientific output of these projects over time and highlight the role played by remarkable project results in the publication-citation distribution trends. The methodological and technical contributions of this work provide a starting point for the development of a theoretical model of modern scientific knowledge propagation over time

    Observing LOD: Its Knowledge Domains and the Varying Behavior of Ontologies Across Them

    Get PDF
    Linked Open Data (LOD) is the largest, collaborative, distributed, and publicly-accessible Knowledge Graph (KG) uniformly encoded in the Resource Description Framework (RDF) and formally represented according to the semantics of the Web Ontology Language (OWL). LOD provides researchers with a unique opportunity to study knowledge engineering as an empirical science: to observe existing modelling practices and possibly understanding how to improve knowledge engineering methodologies and knowledge representation formalisms. Following this perspective, several studies have analysed LOD to identify (mis-)use of OWL constructs or other modelling phenomena e.g. class or property usage, their alignment, the average depth of taxonomies. A question that remains open is whether there is a relation between observed modelling practices and knowledge domains (natural science, linguistics, etc.): do certain practices or phenomena change as the knowledge domain varies? Answering this question requires an assessment of the domains covered by LOD as well as a classification of its datasets. Existing approaches to classify LOD datasets provide partial and unaligned views, posing additional challenges. In this paper, we introduce a classification of knowledge domains, and a method for classifying LOD datasets and ontologies based on it. We classify a large portion of LOD and investigate whether a set of observed phenomena have a domain-specific character

    A data-driven analysis of workers' earnings on Amazon Mechanical Turk

    Get PDF
    A growing number of people are working as part of on-line crowd work. Crowd work is often thought to be low wage work. However, we know little about the wage distribution in practice and what causes low/high earnings in this setting. We recorded 2,676 workers performing 3.8 million tasks on Amazon Mechanical Turk. Our task-level analysis revealed that workers earned a median hourly wage of only ~2 USD/h, and only 4% earned more than 7.25 USD/h. While the average requester pays more than 11 USD/h, lower-paying requesters post much more work. Our wage calculations are influenced by how unpaid work is accounted for, e.g., time spent searching for tasks, working on tasks that are rejected, and working on tasks that are ultimately not submitted. We further explore the characteristics of tasks and working patterns that yield higher hourly wages. Our analysis informs platform design and worker tools to create a more positive future for crowd work

    Patent Thickets Identification

    Get PDF
    Patent thickets have been identified by various citations-based techniques, such as Graevenitz et al (2011) and Clarkson (2005). An alternative direct measurement is based on expert opinion. We use natural language processing techniques to measure pairwise semantic similarity of patents identified as thicket members by experts to create a semantic network. We compare the semantic similarity scores for patents in different expert-identified thickets: those within the same thicket, those in different thickets, and those not in thickets. We show that patents within the same thicket are significantly more semantically similar than other pairs of patents. We then present a statistical model to assess the probability of a newly added patent belonging to a thicket based on semantic networks as well as other measures from the existing thicket literature (the triples of Graevenitz and Clarkson’s density ratio). We conclude that combining information from semantic distance with other sources can be helpful to isolate the patents that are likely to be members of thickets
    corecore