2,105 research outputs found

    Shallow Text Clustering Does Not Mean Weak Topics: How Topic Identification Can Leverage Bigram Features

    Get PDF
    DMNLP co-located with the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML-PKDD)International audienceText clustering and topic learning are two closely related tasks. In this paper, we show that the topics can be learnt without the absolute need of an exact categorization. In particular, the experiments performed on two real case studies with a vocabulary based on bigram features lead to extracting readable topics that cover most of the documents. Precision at 10 is up to 74% for a dataset of scientific abstracts with 10,000 features, which is 4% less than when using unigrams only but provides more interpretable topics

    Novel Application of Neutrosophic Logic in Classifiers Evaluated under Region-Based Image Categorization System

    Get PDF
    Neutrosophic logic is a relatively new logic that is a generalization of fuzzy logic. In this dissertation, for the first time, neutrosophic logic is applied to the field of classifiers where a support vector machine (SVM) is adopted as the example to validate the feasibility and effectiveness of neutrosophic logic. The proposed neutrosophic set is integrated into a reformulated SVM, and the performance of the achieved classifier N-SVM is evaluated under an image categorization system. Image categorization is an important yet challenging research topic in computer vision. In this dissertation, images are first segmented by a hierarchical two-stage self organizing map (HSOM), using color and texture features. A novel approach is proposed to select the training samples of HSOM based on homogeneity properties. A diverse density support vector machine (DD-SVM) framework that extends the multiple-instance learning (MIL) technique is then applied to the image categorization problem by viewing an image as a bag of instances corresponding to the regions obtained from the image segmentation. Using the instance prototype, every bag is mapped to a point in the new bag space, and the categorization is transformed to a classification problem. Then, the proposed N-SVM based on the neutrosophic set is used as the classifier in the new bag space. N-SVM treats samples differently according to the weighting function, and it helps reduce the effects of outliers. Experimental results on a COREL dataset of 1000 general purpose images and a Caltech 101 dataset of 9000 images demonstrate the validity and effectiveness of the proposed method

    Weak signal identification with semantic web mining

    Get PDF
    We investigate an automated identification of weak signals according to Ansoff to improve strategic planning and technological forecasting. Literature shows that weak signals can be found in the organization's environment and that they appear in different contexts. We use internet information to represent organization's environment and we select these websites that are related to a given hypothesis. In contrast to related research, a methodology is provided that uses latent semantic indexing (LSI) for the identification of weak signals. This improves existing knowledge based approaches because LSI considers the aspects of meaning and thus, it is able to identify similar textual patterns in different contexts. A new weak signal maximization approach is introduced that replaces the commonly used prediction modeling approach in LSI. It enables to calculate the largest number of relevant weak signals represented by singular value decomposition (SVD) dimensions. A case study identifies and analyses weak signals to predict trends in the field of on-site medical oxygen production. This supports the planning of research and development (R&D) for a medical oxygen supplier. As a result, it is shown that the proposed methodology enables organizations to identify weak signals from the internet for a given hypothesis. This helps strategic planners to react ahead of time

    No Pattern, No Recognition: a Survey about Reproducibility and Distortion Issues of Text Clustering and Topic Modeling

    Full text link
    Extracting knowledge from unlabeled texts using machine learning algorithms can be complex. Document categorization and information retrieval are two applications that may benefit from unsupervised learning (e.g., text clustering and topic modeling), including exploratory data analysis. However, the unsupervised learning paradigm poses reproducibility issues. The initialization can lead to variability depending on the machine learning algorithm. Furthermore, the distortions can be misleading when regarding cluster geometry. Amongst the causes, the presence of outliers and anomalies can be a determining factor. Despite the relevance of initialization and outlier issues for text clustering and topic modeling, the authors did not find an in-depth analysis of them. This survey provides a systematic literature review (2011-2022) of these subareas and proposes a common terminology since similar procedures have different terms. The authors describe research opportunities, trends, and open issues. The appendices summarize the theoretical background of the text vectorization, the factorization, and the clustering algorithms that are directly or indirectly related to the reviewed works

    Image annotation and retrieval based on multi-modal feature clustering and similarity propagation.

    Get PDF
    The performance of content-based image retrieval systems has proved to be inherently constrained by the used low level features, and cannot give satisfactory results when the user\u27s high level concepts cannot be expressed by low level features. In an attempt to bridge this semantic gap, recent approaches started integrating both low level-visual features and high-level textual keywords. Unfortunately, manual image annotation is a tedious process and may not be possible for large image databases. In this thesis we propose a system for image retrieval that has three mains components. The first component of our system consists of a novel possibilistic clustering and feature weighting algorithm based on robust modeling of the Generalized Dirichlet (GD) finite mixture. Robust estimation of the mixture model parameters is achieved by incorporating two complementary types of membership degrees. The first one is a posterior probability that indicates the degree to which a point fits the estimated distribution. The second membership represents the degree of typicality and is used to indentify and discard noise points. Robustness to noisy and irrelevant features is achieved by transforming the data to make the features independent and follow Beta distribution, and learning optimal relevance weight for each feature subset within each cluster. We extend our algorithm to find the optimal number of clusters in an unsupervised and efficient way by exploiting some properties of the possibilistic membership function. We also outline a semi-supervised version of the proposed algorithm. In the second component of our system consists of a novel approach to unsupervised image annotation. Our approach is based on: (i) the proposed semi-supervised possibilistic clustering; (ii) a greedy selection and joining algorithm (GSJ); (iii) Bayes rule; and (iv) a probabilistic model that is based on possibilistic memebership degrees to annotate an image. The third component of the proposed system consists of an image retrieval framework based on multi-modal similarity propagation. The proposed framework is designed to deal with two data modalities: low-level visual features and high-level textual keywords generated by our proposed image annotation algorithm. The multi-modal similarity propagation system exploits the mutual reinforcement of relational data and results in a nonlinear combination of the different modalities. Specifically, it is used to learn the semantic similarities between images by leveraging the relationships between features from the different modalities. The proposed image annotation and retrieval approaches are implemented and tested with a standard benchmark dataset. We show the effectiveness of our clustering algorithm to handle high dimensional and noisy data. We compare our proposed image annotation approach to three state-of-the-art methods and demonstrate the effectiveness of the proposed image retrieval system

    Text Clumping for Technical Intelligence

    Get PDF

    Automatic Text Summarization

    Get PDF
    Writing text was one of the first ever methods used by humans to represent their knowledge. Text can be of different types and have different purposes. Due to the evolution of information systems and the Internet, the amount of textual information available has increased exponentially in a worldwide scale, and many documents tend to have a percentage of unnecessary information. Due to this event, most readers have difficulty in digesting all the extensive information contained in multiple documents, produced on a daily basis. A simple solution to the excessive irrelevant information in texts is to create summaries, in which we keep the subject’s related parts and remove the unnecessary ones. In Natural Language Processing, the goal of automatic text summarization is to create systems that process text and keep only the most important data. Since its creation several approaches have been designed to create better text summaries, which can be divided in two separate groups: extractive approaches and abstractive approaches. In the first group, the summarizers decide what text elements should be in the summary. The criteria by which they are selected is diverse. After they are selected, they are combined into the summary. In the second group, the text elements are generated from scratch. Abstractive summarizers are much more complex so they still need a lot of research, in order to represent good results. During this thesis, we have investigated the state of the art approaches, implemented our own versions and tested them in conventional datasets, like the DUC dataset. Our first approach was a frequency­based approach, since it analyses the frequency in which the text’s words/sentences appear in the text. Higher frequency words/sentences automatically receive higher scores which are then filtered with a compression rate and combined in a summary. Moving on to our second approach, we have improved the original TextRank algorithm by combining it with word embedding vectors. The goal was to represent the text’s sentences as nodes from a graph and with the help of word embeddings, determine how similar are pairs of sentences and rank them by their similarity scores. The highest ranking sentences were filtered with a compression rate and picked for the summary. In the third approach, we combined feature analysis with deep learning. By analysing certain characteristics of the text sentences, one can assign scores that represent the importance of a given sentence for the summary. With these computed values, we have created a dataset for training a deep neural network that is capable of deciding if a certain sentence must be or not in the summary. An abstractive encoder­decoder summarizer was created with the purpose of generating words related to the document subject and combining them into a summary. Finally, every single summarizer was combined into a full system. Each one of our approaches was evaluated with several evaluation metrics, such as ROUGE. We used the DUC dataset for this purpose and the results were fairly similar to the ones in the scientific community. As for our encoder­decode, we got promising results.O texto é um dos utensílios mais importantes de transmissão de ideias entre os seres humanos. Pode ser de vários tipos e o seu conteúdo pode ser mais ou menos fácil de interpretar, conforme a quantidade de informação relevante sobre o assunto principal. De forma a facilitar o processamento pelo leitor existe um mecanismo propositadamente criado para reduzir a informação irrelevante num texto, chamado sumarização de texto. Através da sumarização criam­se versões reduzidas do text original e mantém­se a informação do assunto principal. Devido à criação e evolução da Internet e outros meios de comunicação, surgiu um aumento exponencial de documentos textuais, evento denominado de sobrecarga de informação, que têm na sua maioria informação desnecessária sobre o assunto que retratam. De forma a resolver este problema global, surgiu dentro da área científica de Processamento de Linguagem Natural, a sumarização automática de texto, que permite criar sumários automáticos de qualquer tipo de texto e de qualquer lingua, através de algoritmos computacionais. Desde a sua criação, inúmeras técnicas de sumarização de texto foram idealizadas, podendo ser classificadas em dois tipos diferentes: extractivas e abstractivas. Em técnicas extractivas, são transcritos elementos do texto original, como palavras ou frases inteiras que sejam as mais ilustrativas do assunto do texto e combinadas num documento. Em técnicas abstractivas, os algoritmos geram elementos novos. Nesta dissertação pesquisaram­se, implementaram­se e combinaram­se algumas das técnicas com melhores resultados de modo a criar um sistema completo para criar sumários. Relativamente às técnicas implementadas, as primeiras três são técnicas extractivas enquanto que a ultima é abstractiva. Desta forma, a primeira incide sobre o cálculo das frequências dos elementos do texto, atribuindo­se valores às frases que sejam mais frequentes, que por sua vez são escolhidas para o sumário através de uma taxa de compressão. Outra das técnicas incide na representação dos elementos textuais sob a forma de nodos de um grafo, sendo atribuidos valores de similaridade entre os mesmos e de seguida escolhidas as frases com maiores valores através de uma taxa de compressão. Uma outra abordagem foi criada de forma a combinar um mecanismo de análise das caracteristicas do texto com métodos baseados em inteligência artificial. Nela cada frase possui um conjunto de caracteristicas que são usadas para treinar um modelo de rede neuronal. O modelo avalia e decide quais as frases que devem pertencer ao sumário e filtra as mesmas através deu uma taxa de compressão. Um sumarizador abstractivo foi criado para para gerar palavras sobre o assunto do texto e combinar num sumário. Cada um destes sumarizadores foi combinado num só sistema. Por fim, cada uma das técnicas pode ser avaliada segundo várias métricas de avaliação, como por exemplo a ROUGE. Segundo os resultados de avaliação das técnicas, com o conjunto de dados DUC, os nossos sumarizadores obtiveram resultados relativamente parecidos com os presentes na comunidade cientifica, com especial atenção para o codificador­descodificador que em certos casos apresentou resultados promissores
    corecore