882 research outputs found
Techniques for improving the performance of unsupervised approach to sentiment analysis
In this work, few techniques were proposed to enhance the performance of unsupervised sentiment analysis method to categorize review reports into sentiment orientations (positive and negative). In review reports, generally negations can change the polarity of other terms in a sentence. Therefore, a new technique for handling negations was proposed. As it is seen that, the positions of terms in a report are also important i.e. the same term appearing at different positions in a report may convey different amount of sentiments. Thus, a new technique was proposed to assign weights to the terms depending on their positions of occurrences within a review. Again, another technique was proposed to use the presence of exclamatory marks in the reviews as the effects of exclamatory marks are equally important in categorizing review reports. After incorporating all these concepts in the first phase of the proposed method, in the second phase, analysis of sentiment orientations was done using cluster ensemble method. The proposed method was tested on a state-of-the-art Movie review dataset and 91.75% accuracy was achieved. A significant improvement over some of the unsupervised and supervised methods in terms of accuracy was achieved with incorporation of the new techniques
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK STUDY ON DIFFERENT SENTENCE LEVEL CLUSTERING ALGORITHMS FOR TEXT MINING
Abstract: Clustering is the process of grouping of data items. The sentence clustering is used in variety of applications i.e. classify and categorization of documents, automatic summary generation, etc. In text mining, the sentence clustering plays a vital role this is used in text activities. Size of clusters can change from one cluster to another. The existing system many clustering methods and algorithms are used for clustering the documents at sentence level. In this paper, we study the different sentence clustering algorithm as a study. The main aim of this study is to present an overview of the sentence level clustering techniques are to find the drawback of the exiting work and how could overcome the all this drawback for clustering algorithm. And we can obtain the more efficient technique or we may propose the new method to overcome the problems in existing methods like time redundancy and data aqurency
No Pattern, No Recognition: a Survey about Reproducibility and Distortion Issues of Text Clustering and Topic Modeling
Extracting knowledge from unlabeled texts using machine learning algorithms
can be complex. Document categorization and information retrieval are two
applications that may benefit from unsupervised learning (e.g., text clustering
and topic modeling), including exploratory data analysis. However, the
unsupervised learning paradigm poses reproducibility issues. The initialization
can lead to variability depending on the machine learning algorithm.
Furthermore, the distortions can be misleading when regarding cluster geometry.
Amongst the causes, the presence of outliers and anomalies can be a determining
factor. Despite the relevance of initialization and outlier issues for text
clustering and topic modeling, the authors did not find an in-depth analysis of
them. This survey provides a systematic literature review (2011-2022) of these
subareas and proposes a common terminology since similar procedures have
different terms. The authors describe research opportunities, trends, and open
issues. The appendices summarize the theoretical background of the text
vectorization, the factorization, and the clustering algorithms that are
directly or indirectly related to the reviewed works
Aplicação de técnicas de Clustering ao contexto da Tomada de Decisão em Grupo
Nowadays, decisions made by executives and managers are primarily made in a group. Therefore, group decision-making is a process where a group of people called participants work together to analyze a set of variables, considering and evaluating a set of alternatives to select one or more solutions. There are many problems associated with group decision-making, namely when the participants cannot meet for any reason, ranging from schedule incompatibility to being in different countries with different time zones. To support this process, Group Decision Support Systems (GDSS) evolved to what today we call web-based GDSS. In GDSS, argumentation is ideal since it makes it easier to use justifications and explanations in interactions between decision-makers so they can sustain their opinions. Aspect Based Sentiment Analysis (ABSA) is a subfield of Argument Mining closely related to Natural Language Processing. It intends to classify opinions at the aspect level and identify the elements of an opinion. Applying ABSA techniques to Group Decision Making Context results in the automatic identification of alternatives and criteria, for example. This automatic identification is essential to reduce the time decision-makers take to step themselves up on Group Decision Support Systems and offer them various insights and knowledge on the discussion they are participants. One of these insights can be arguments getting used by the decision-makers about an alternative. Therefore, this dissertation proposes a methodology that uses an unsupervised technique, Clustering, and aims to segment the participants of a discussion based on arguments used so it can produce knowledge from the current information in the GDSS. This methodology can be hosted in a web service that follows a micro-service architecture and utilizes Data Preprocessing and Intra-sentence Segmentation in addition to Clustering to achieve the objectives of the dissertation. Word Embedding is needed when we apply clustering techniques to natural language text to transform the natural language text into vectors usable by the clustering techniques. In addition to Word Embedding, Dimensionality Reduction techniques were tested to improve the results. Maintaining the same Preprocessing steps and varying the chosen Clustering techniques, Word Embedders, and Dimensionality Reduction techniques came up with the best approach. This approach consisted of the KMeans++ clustering technique, using SBERT as the word embedder with UMAP dimensionality reduction, reducing the number of dimensions to 2. This experiment achieved a Silhouette Score of 0.63 with 8 clusters on the baseball dataset, which wielded good cluster results based on their manual review and Wordclouds. The same approach obtained a Silhouette Score of 0.59 with 16 clusters on the car brand dataset, which we used as an approach validation dataset.Atualmente, as decisões tomadas por gestores e executivos são maioritariamente realizadas em grupo. Sendo assim, a tomada de decisão em grupo é um processo no qual um grupo de pessoas denominadas de participantes, atuam em conjunto, analisando um conjunto de variáveis, considerando e avaliando um conjunto de alternativas com o objetivo de selecionar uma ou mais soluções. Existem muitos problemas associados ao processo de tomada de decisão, principalmente quando os participantes não têm possibilidades de se reunirem (Exs.: Os participantes encontramse em diferentes locais, os países onde estão têm fusos horários diferentes, incompatibilidades de agenda, etc.). Para suportar este processo de tomada de decisão, os Sistemas de Apoio à Tomada de Decisão em Grupo (SADG) evoluíram para o que hoje se chamam de Sistemas de Apoio à Tomada de Decisão em Grupo baseados na Web. Num SADG, argumentação é ideal pois facilita a utilização de justificações e explicações nas interações entre decisores para que possam suster as suas opiniões. Aspect Based Sentiment Analysis (ABSA) é uma área de Argument Mining correlacionada com o Processamento de Linguagem Natural. Esta área pretende classificar opiniões ao nível do aspeto da frase e identificar os elementos de uma opinião. Aplicando técnicas de ABSA à Tomada de Decisão em Grupo resulta na identificação automática de alternativas e critérios por exemplo. Esta identificação automática é essencial para reduzir o tempo que os decisores gastam a customizarem-se no SADG e oferece aos mesmos conhecimento e entendimentos sobre a discussão ao qual participam. Um destes entendimentos pode ser os argumentos a serem usados pelos decisores sobre uma alternativa. Assim, esta dissertação propõe uma metodologia que utiliza uma técnica não-supervisionada, Clustering, com o objetivo de segmentar os participantes de uma discussão com base nos argumentos usados pelos mesmos de modo a produzir conhecimento com a informação atual no SADG. Esta metodologia pode ser colocada num serviço web que segue a arquitetura micro serviços e utiliza Preprocessamento de Dados e Segmentação Intra Frase em conjunto com o Clustering para atingir os objetivos desta dissertação. Word Embedding também é necessário para aplicar técnicas de Clustering a texto em linguagem natural para transformar o texto em vetores que possam ser usados pelas técnicas de Clustering. Também Técnicas de Redução de Dimensionalidade também foram testadas de modo a melhorar os resultados. Mantendo os passos de Preprocessamento e variando as técnicas de Clustering, Word Embedder e as técnicas de Redução de Dimensionalidade de modo a encontrar a melhor abordagem. Essa abordagem consiste na utilização da técnica de Clustering KMeans++ com o SBERT como Word Embedder e UMAP como a técnica de redução de dimensionalidade, reduzindo as dimensões iniciais para duas. Esta experiência obteve um Silhouette Score de 0.63 com 8 clusters no dataset de baseball, que resultou em bons resultados de cluster com base na sua revisão manual e visualização dos WordClouds. A mesma abordagem obteve um Silhouette Score de 0.59 com 16 clusters no dataset das marcas de carros, ao qual usamos esse dataset com validação de abordagem
Domain-Focused Summarization of Polarized Debates
Due to the exponential growth of Internet use, textual content is increasingly published in online media. In everyday, more and more news content, blog posts, and scientific articles are published to the online volumes and thus open doors for the text summarization research community to conduct research on those areas. Whilst there are freely accessible repositories for such content, online debates which have recently become popular have remained largely unexplored. This thesis addresses the challenge in applying text summarization to summarize online debates. We view that the task of summarizing online debates should not only focus on summarization techniques but also should look further on presenting the summaries into the formats favored by users.
In this thesis, we present how a summarization system is developed to generate online debate summaries in accordance with a designed output, called the Combination 2. It is the combination of two summaries. The primary objective of the first summary, Chart Summary, is to visualize the debate summary as a bar chart in high-level view. The chart consists of the bars conveying clusters of the salient sentences, labels showing short descriptions of the bars, and numbers of salient sentences conversed in the two opposing sides. The other part, Side-By-Side Summary, linked to the Chart Summary, shows a more detailed summary of an online debate related to a bar clicked by a user. The development of the summarization system is divided into three processes.
In the first process, we create a gold standard dataset of online debates. The dataset contains a collection of debate comments that have been subjectively annotated by 5 judgments. We develop a summarization system with key features to help identify salient sentences in the comments. The sentences selected by the system are evaluated against the annotation results. We found that the system performance outperforms the baseline.
The second process begins with the generation of Chart Summary from the salient sentences selected by the system. We propose a framework with two branches where each branch presents either a term-based clustering and the term-based labeling method or X-means based clustering and the MI labeling strategy. Our evaluation results indicate that the X-means clustering approach is a better alternative for clustering.
In the last process, we view the generation of Side-By-Side Summary as a contradiction detection task. We create two debate entailment datasets derived from the two clustering approaches and annotate them with the Contradiction and Non-Contradiction relations. We develop a classifier and investigate combinations of features that maximize the F1 scores. Based on the proposed features, we discovered that the combinations of at least two features to the maximum of eight features yield good results
Measuring the influence of greenwashing practices on the public opinion:a Twitter sentiment analysis
A comparison of statistical machine learning methods in heartbeat detection and classification
In health care, patients with heart problems require quick responsiveness in a clinical setting or in the operating theatre. Towards that end, automated classification of heartbeats is vital as some heartbeat irregularities are time consuming to detect. Therefore, analysis of electro-cardiogram (ECG) signals is an active area of research. The methods proposed in the literature depend on the structure of a heartbeat cycle. In this paper, we use interval and amplitude based features together with a few samples from the ECG signal as a feature vector. We studied a variety of classification algorithms focused especially on a type of arrhythmia known as the ventricular ectopic fibrillation (VEB). We compare the performance of the classifiers against algorithms proposed in the literature and make recommendations regarding features, sampling rate, and choice of the classifier to apply in a real-time clinical setting. The extensive study is based on the MIT-BIH arrhythmia database. Our main contribution is the evaluation of existing classifiers over a range sampling rates, recommendation of a detection methodology to employ in a practical setting, and extend the notion of a mixture of experts to a larger class of algorithms
Methods for constructing an opinion network for politically controversial topics
The US presidential race, the re-election of President Hugo Chavez, and the economic crisis in Greece and other European countries are some of the controversial topics being played on the news everyday. To understand the landscape of opinions on political controversies, it would be helpful to know which politician or other stakeholder takes which position - support or opposition - on specific aspects of these topics. The work described in this thesis aims to automatically derive a map of the opinions-people network from news and other Web docu- ments. The focus is on acquiring opinions held by various stakeholders on politi- cally controversial topics. This opinions-people network serves as a knowledge- base of opinions in the form of (opinion holder) (opinion) (topic) triples. Our system to build this knowledge-base makes use of online news sources in order to extract opinions from text snippets. These sources come with a set of unique challenges. For example, processing text snippets involves not just iden- tifying the topic and the opinion, but also attributing that opinion to a specific opinion holder. This requires making use of deep parsing and analyzing the parse tree. Moreover, in order to ensure uniformity, both the topic as well the opinion holder should be mapped to canonical strings, and the topics should also be organized into a hierarchy. Our system relies on two main components: i) acquiring opinions which uses a combination of techniques to extract opinions from online news sources, and ii) organizing topics which crawls and extracts de- bates from online sources, and organizes these debates in a hierarchy of political controversial topics. We present systematic evaluations of the different compo- nents of our system, and show their high accuracies. We also present some of the different kinds of applications that require political analysis. We present some application requires political analysis such as identifying flip-floppers, political bias, and dissenters. Such applications can make use of the knowledge-base of opinions.Kontroverse Themen wie das US-Präsidentschaftsrennen, die Wiederwahl von Präsident Hugo Chavez, die Wirtschaftskrise in Griechenland sowie in anderen europäischen Ländern werden täglich in den Nachrichten diskutiert. Um die Bandbreite verschiedener Meinungen zu politischen Kontroversen zu verstehen, ist es hilfreich herauszufinden, welcher Politiker bzw. Interessenvertreter welchen Standpunkt (Pro oder Contra) bezüglich spezifischer Aspekte dieser Themen einnimmt. Diese Dissertation beschreibt ein Verfahren, welches automatisch eine Übersicht des Meinung-Mensch-Netzwerks aus aktuellen Nachrichten und anderen Web-Dokumenten ableitet. Der Fokus liegt hierbei auf dem Erfassen von Meinungen verschiedener Interessenvertreter bezüglich politisch kontroverser Themen. Dieses Meinung-Mensch-Netzwerk dient als Wissensbasis von Meinungen in Form von Tripeln: (Meinungsvertreter) (Meinung) (Thema). Um diese Wissensbasis aufzubauen, nutzt unser System Online-Nachrichten und extrahiert Meinungen aus Textausschnitten. Quellen von Online-Nachrichten stellen eine Reihe von besonderen Anforderungen an unser System. Zum Beispiel umfasst die Verarbeitung von Textausschnitten nicht nur die Identifikation des Themas und der geschilderten Meinung, sondern auch die Zuordnung der Stellungnahme zu einem spezifischen Meinungsvertreter.Dies erfordert eine tiefgründige Analyse sowie eine genaue Untersuchung des Syntaxbaumes. Um die Einheitlichkeit zu gewährleisten, müssen darüber hinaus Thema sowie Meinungsvertreter auf ein kanonisches Format abgebildet und die Themen hierarchisch angeordnet werden. Unser System beruht im Wesentlichen auf zwei Komponenten: i) Erkennen von Meinungen, welches verschiedene Techniken zur Extraktion von Meinungen aus Online-Nachrichten beinhaltet, und ii) Erkennen von Beziehungen zwischen Themen, welches das Crawling und Extrahieren von Debatten aus Online-Quellen sowie das Organisieren dieser Debatten in einer Hierarchie von politisch kontroversen Themen umfasst. Wir präsentieren eine systematische Evaluierung der verschiedenen Systemkomponenten, welche die hohe Genauigkeit der von uns entwickelten Techniken zeigt. Wir diskutieren außerdem verschiedene Arten von Anwendungen, die eine politische Analyse erfordern, wie zum Beispiel die Erkennung von Opportunisten, politische Voreingenommenheit und Dissidenten. All diese Anwendungen können durch die Wissensbasis von Meinungen umfangreich profitieren
Knowledge management and Discovery for advanced Enterprise Knowledge Engineering
2012 - 2013The research work addresses mainly issues related to the adoption of models, methodologies and knowledge management tools that implement a pervasive use of the latest technologies in the area of Semantic Web for the improvement of business processes and Enterprise 2.0 applications.
The first phase of the research has focused on the study and analysis of the state of the art and the problems of Knowledge Discovery database, paying more attention to the data mining systems. The most innovative approaches which were investigated for the "Enterprise Knowledge Engineering" are listed below.
In detail, the problems analyzed are those relating to architectural aspects and the integration of Legacy Systems (or not). The contribution of research that is intended to give, consists in the identification and definition of a uniform and general model, a "Knowledge Enterprise Model", the original model with respect to the canonical approaches of enterprise architecture (for example with respect to the Object Management - OMG - standard).
The introduction of the tools and principles of Enterprise 2.0 in the company have been investigated and, simultaneously, Semantic Enterprise based appropriate solutions have been defined to the problem of fragmentation of information and improvement of the process of knowledge discovery and functional knowledge sharing.
All studies and analysis are finalized and validated by defining a methodology and related software tools to support, for the improvement of processes related to the life cycles of best practices across the enterprise. Collaborative tools, knowledge modeling, algorithms, knowledge discovery and extraction are applied synergistically to support these processes. [edited by author]XII n.s
- …