513 research outputs found

    Text classification supervised algorithms with term frequency inverse document frequency and global vectors for word representation: a comparative study

    Get PDF
    Over the course of the previous two decades, there has been a rise in the quantity of text documents stored digitally. The ability to organize and categorize those documents in an automated mechanism, is known as text categorization which is used to classify them into a set of predefined categories so they may be preserved and sorted more efficiently. Identifying appropriate structures, architectures, and methods for text classification presents a challenge for researchers. This is due to the significant impact this concept has on content management, contextual search, opinion mining, product review analysis, spam filtering, and text sentiment mining. This study analyzes the generic categorization strategy and examines supervised machine learning approaches and their ability to comprehend complex models and nonlinear data interactions. Among these methods are k-nearest neighbors (KNN), support vector machine (SVM), and ensemble learning algorithms employing various evaluation techniques. Thereafter, an evaluation is conducted on the constraints of every technique and how they can be applied to real-life situations

    Sacola de grafos textuais : um modelo de representação de textos baseado em grafos, preciso, eficiente e de propósito geral

    Get PDF
    Orientador: Ricardo da Silva TorresDissertação (mestrado) - Universidade Estadual de Campinas, Instituto de ComputaçãoResumo: Modelos de representação de textos são o alicerce fundamental para as tarefas de Recuperação de Informação e Mineração de Textos. Apesar de diferentes modelos de representação de textos terem sido propostos, eles não são ao mesmo tempo eficientes, precisos e flexíveis para serem usados em aplicações variadas. Neste projeto, apresentamos a Sacola de Grafos Textuais (do inglês \textit{Bag of Textual Graphs}), um modelo de representação de textos que satisfaz esses três requisitos, ao propor uma combinação de um modelo de representação baseado em grafos com um arcabouço genérico de síntese de grafos em representações vetoriais. Avaliamos nosso método em experimentos considerando quatro coleções textuais bem conhecidas: Reuters-21578, 20-newsgroups, 4-universidades e K-series. Os resultados experimentais demonstram que o nosso modelo é genérico o bastante para lidar com diferentes coleções, e é mais eficiente do que métodos atuais e largamente utilizados em tarefas de classificação e recuperação de textos, sem perda de precisãoAbstract: Text representation models are the fundamental basis for Information Retrieval and Text Mining tasks. Despite different text models have been proposed, they are not at the same time efficient, accurate, and flexible to be used in several applications. Here we present Bag of Textual Graphs, a text representation model that addresses these three requirements, by combining a graph-representation model with an generic framework for graph-to-vector synthesis. We evaluate our method on experiments considering four well-known text collections: Reuters-21578, 20-newsgroups, 4-universities, and K-series. Experimental results demonstrate that our model is generic enough to handle different collections, and is more efficient than widely-used state-of-the-art methods in textual classification and retrieval tasks, without losing accuracy performanceMestradoCiência da ComputaçãoMestre em Ciência da Computaçã

    Machine Learning in Automated Text Categorization

    Full text link
    The automated categorization (or classification) of texts into predefined categories has witnessed a booming interest in the last ten years, due to the increased availability of documents in digital form and the ensuing need to organize them. In the research community the dominant approach to this problem is based on machine learning techniques: a general inductive process automatically builds a classifier by learning, from a set of preclassified documents, the characteristics of the categories. The advantages of this approach over the knowledge engineering approach (consisting in the manual definition of a classifier by domain experts) are a very good effectiveness, considerable savings in terms of expert manpower, and straightforward portability to different domains. This survey discusses the main approaches to text categorization that fall within the machine learning paradigm. We will discuss in detail issues pertaining to three different problems, namely document representation, classifier construction, and classifier evaluation.Comment: Accepted for publication on ACM Computing Survey

    Transfer Learning using Computational Intelligence: A Survey

    Get PDF
    Abstract Transfer learning aims to provide a framework to utilize previously-acquired knowledge to solve new but similar problems much more quickly and effectively. In contrast to classical machine learning methods, transfer learning methods exploit the knowledge accumulated from data in auxiliary domains to facilitate predictive modeling consisting of different data patterns in the current domain. To improve the performance of existing transfer learning methods and handle the knowledge transfer process in real-world systems, ..

    A New Web Search Engine with Learning Hierarchy

    Get PDF
    Most of the existing web search engines (such as Google and Bing) are in the form of keyword-based search. Typically, after the user issues a query with the keywords, the search engine will return a flat list of results. When the query issued by the user is related to a topic, only the keyword matching may not accurately retrieve the whole set of webpages in that topic. On the other hand, there exists another type of search system, particularly in e-Commerce web- sites, where the user can search in the categories of different faceted hierarchies (e.g., product types and price ranges). Is it possible to integrate the two types of search systems and build a web search engine with a topic hierarchy? The main diffculty is how to classify the vast number of webpages on the Internet into the topic hierarchy. In this thesis, we will leverage machine learning techniques to automatically classify webpages into the categories in our hierarchy, and then utilize the classification results to build the new search engine SEE. The experimental results demonstrate that SEE can achieve better search results than the traditional keyword-based search engine in most of the queries, particularly when the query is related to a topic. We also conduct a small-scale usability study which further verifies that SEE is a promising search engine. To further improve SEE, we also propose a new active learning framework with several novel strategies for hierarchical classification

    Linking social media, medical literature, and clinical notes using deep learning.

    Get PDF
    Researchers analyze data, information, and knowledge through many sources, formats, and methods. The dominant data format includes text and images. In the healthcare industry, professionals generate a large quantity of unstructured data. The complexity of this data and the lack of computational power causes delays in analysis. However, with emerging deep learning algorithms and access to computational powers such as graphics processing unit (GPU) and tensor processing units (TPUs), processing text and images is becoming more accessible. Deep learning algorithms achieve remarkable results in natural language processing (NLP) and computer vision. In this study, we focus on NLP in the healthcare industry and collect data not only from electronic medical records (EMRs) but also medical literature and social media. We propose a framework for linking social media, medical literature, and EMRs clinical notes using deep learning algorithms. Connecting data sources requires defining a link between them, and our key is finding concepts in the medical text. The National Library of Medicine (NLM) introduces a Unified Medical Language System (UMLS) and we use this system as the foundation of our own system. We recognize social media’s dynamic nature and apply supervised and semi-supervised methodologies to generate concepts. Named entity recognition (NER) allows efficient extraction of information, or entities, from medical literature, and we extend the model to process the EMRs’ clinical notes via transfer learning. The results include an integrated, end-to-end, web-based system solution that unifies social media, literature, and clinical notes, and improves access to medical knowledge for the public and experts

    Govwise procurement vocabulary (GPV) - An alternative to the Common Procurement Vocabulary (CPV)

    Get PDF
    Internship Report presented as the partial requirement for obtaining a Master's degree in Information Management, specialization in Knowledge Management and Business IntelligenceIn recent years, the world has witnessed emerging legislation on open data. Some of the main goals include stimulating economic growth with the re-use of the data, addressing societal challenges, enhancing evidence-based policymaking, and increasing efficiency in the public administrations, fostering the development of new technologies, such as AI, along with the enhanced participation of the citizens in political decisions and its transparency (European Commission, Open Data, 2021). Govwise is an Advanced Analytics Platform developed to provide a wide range of data analytics to governmental organizations, via a SaaS model. The goal is to make use of the open data policies, by producing valuable information, tackling the challenges that such a data deluge arises. These challenges constitute the scope of the internship here reported, the whole process is described, starting with the data sources and the respective ETL process, on a more high-level structure, until the production of the analysis and dashboards, that constitute the product of Govwise. The focus will, however, be on the classification model developed to address a major necessity of the company to cluster the portuguese public procurement contracts. These are initially classified with a CPV code (common procurement vocabulary code), which does not satisfy the needs of Govwise. Therefore, the end goal of the model is to generate an alternative classification for contracts and tenders of the portuguese public procurement, the GPV (Govwise procurement vocabulary)

    Scalable Text Mining with Sparse Generative Models

    Get PDF
    The information age has brought a deluge of data. Much of this is in text form, insurmountable in scope for humans and incomprehensible in structure for computers. Text mining is an expanding field of research that seeks to utilize the information contained in vast document collections. General data mining methods based on machine learning face challenges with the scale of text data, posing a need for scalable text mining methods. This thesis proposes a solution to scalable text mining: generative models combined with sparse computation. A unifying formalization for generative text models is defined, bringing together research traditions that have used formally equivalent models, but ignored parallel developments. This framework allows the use of methods developed in different processing tasks such as retrieval and classification, yielding effective solutions across different text mining tasks. Sparse computation using inverted indices is proposed for inference on probabilistic models. This reduces the computational complexity of the common text mining operations according to sparsity, yielding probabilistic models with the scalability of modern search engines. The proposed combination provides sparse generative models: a solution for text mining that is general, effective, and scalable. Extensive experimentation on text classification and ranked retrieval datasets are conducted, showing that the proposed solution matches or outperforms the leading task-specific methods in effectiveness, with a order of magnitude decrease in classification times for Wikipedia article categorization with a million classes. The developed methods were further applied in two 2014 Kaggle data mining prize competitions with over a hundred competing teams, earning first and second places
    corecore