75 research outputs found

    QMOS: Query-based multi-documents opinion-oriented summarization

    Get PDF
    Sentiment analysis concerns the study of opinions expressed in a text. This paper presents the QMOS method, which employs a combination of sentiment analysis and summarization approaches. It is a lexicon-based method to query-based multi-documents summarization of opinion expressed in reviews. QMOS combines multiple sentiment dictionaries to improve word coverage limit of the individual lexicon. A major problem for a dictionary-based approach is the semantic gap between the prior polarity of a word presented by a lexicon and the word polarity in a specific context. This is due to the fact that, the polarity of a word depends on the context in which it is being used. Furthermore, the type of a sentence can also affect the performance of a sentiment analysis approach. Therefore, to tackle the aforementioned challenges, QMOS integrates multiple strategies to adjust word prior sentiment orientation while also considers the type of sentence. QMOS also employs the Semantic Sentiment Approach to determine the sentiment score of a word if it is not included in a sentiment lexicon. On the other hand, the most of the existing methods fail to distinguish the meaning of a review sentence and user's query when both of them share the similar bag-of-words; hence there is often a conflict between the extracted opinionated sentences and users’ needs. However, the summarization phase of QMOS is able to avoid extracting a review sentence whose similarity with the user's query is high but whose meaning is different. The method also employs the greedy algorithm and query expansion approach to reduce redundancy and bridge the lexical gaps for similar contexts that are expressed using different wording, respectively. Our experiment shows that the QMOS method can significantly improve the performance and make QMOS comparable to other existing methods

    Document analysis by means of data mining techniques

    Get PDF
    The huge amount of textual data produced everyday by scientists, journalists and Web users, allows investigating many different aspects of information stored in the published documents. Data mining and information retrieval techniques are exploited to manage and extract information from huge amount of unstructured textual data. Text mining also known as text data mining is the processing of extracting high quality information (focusing relevance, novelty and interestingness) from text by identifying patterns etc. Text mining typically involves the process of structuring input text by means of parsing and other linguistic features or sometimes by removing extra data and then finding patterns from structured data. Patterns are then evaluated at last and interpretation of output is performed to accomplish the desired task. Recently, text mining has got attention in several fields such as in security (involves analysis of Internet news), for commercial (for search and indexing purposes) and in academic departments (such as answering query). Beyond searching the documents consisting the words given in a user query, text mining may provide direct answer to user by semantic web for content based (content meaning and its context). It can also act as intelligence analyst and can also be used in some email spam filters for filtering out unwanted material. Text mining usually includes tasks such as clustering, categorization, sentiment analysis, entity recognition, entity relation modeling and document summarization. In particular, summarization approaches are suitable for identifying relevant sentences that describe the main concepts presented in a document dataset. Furthermore, the knowledge existed in the most informative sentences can be employed to improve the understanding of user and/or community interests. Different approaches have been proposed to extract summaries from unstructured text documents. Some of them are based on the statistical analysis of linguistic features by means of supervised machine learning or data mining methods, such as Hidden Markov models, neural networks and Naive Bayes methods. An appealing research field is the extraction of summaries tailored to the major user interests. In this context, the problem of extracting useful information according to domain knowledge related to the user interests is a challenging task. The main topics have been to study and design of novel data representations and data mining algorithms useful for managing and extracting knowledge from unstructured documents. This thesis describes an effort to investigate the application of data mining approaches, firmly established in the subject of transactional data (e.g., frequent itemset mining), to textual documents. Frequent itemset mining is a widely exploratory technique to discover hidden correlations that frequently occur in the source data. Although its application to transactional data is well-established, the usage of frequent itemsets in textual document summarization has never been investigated so far. A work is carried on exploiting frequent itemsets for the purpose of multi-document summarization so a novel multi-document summarizer, namely ItemSum (Itemset-based Summarizer) is presented, that is based on an itemset-based model, i.e., a framework comprise of frequent itemsets, taken out from the document collection. Highly representative and not redundant sentences are selected for generating summary by considering both sentence coverage, with respect to a sentence relevance score, based on tf-idf statistics, and a concise and highly informative itemset-based model. To evaluate the ItemSum performance a suite of experiments on a collection of news articles has been performed. Obtained results show that ItemSum significantly outperforms mostly used previous summarizers in terms of precision, recall, and F-measure. We also validated our approach against a large number of approaches on the DUC’04 document collection. Performance comparisons, in terms of precision, recall, and F-measure, have been performed by means of the ROUGE toolkit. In most cases, ItemSum significantly outperforms the considered competitors. Furthermore, the impact of both the main algorithm parameters and the adopted model coverage strategy on the summarization performance are investigated as well. In some cases, the soundness and readability of the generated summaries are unsatisfactory, because the summaries do not cover in an effective way all the semantically relevant data facets. A step beyond towards the generation of more accurate summaries has been made by semantics-based summarizers. Such approaches combine the use of general-purpose summarization strategies with ad-hoc linguistic analysis. The key idea is to also consider the semantics behind the document content to overcome the limitations of general-purpose strategies in differentiating between sentences based on their actual meaning and context. Most of the previously proposed approaches perform the semantics-based analysis as a preprocessing step that precedes the main summarization process. Therefore, the generated summaries could not entirely reflect the actual meaning and context of the key document sentences. In contrast, we aim at tightly integrating the ontology-based document analysis into the summarization process in order to take the semantic meaning of the document content into account during the sentence evaluation and selection processes. With this in mind, we propose a new multi-document summarizer, namely Yago-based Summarizer, that integrates an established ontology-based entity recognition and disambiguation step. Named Entity Recognition from Yago ontology is being used for the task of text summarization. The Named Entity Recognition (NER) task is concerned with marking occurrences of a specific object being mentioned. These mentions are then classified into a set of predefined categories. Standard categories include “person”, “location”, “geo-political organization”, “facility”, “organization”, and “time”. The use of NER in text summarization improved the summarization process by increasing the rank of informative sentences. To demonstrate the effectiveness of the proposed approach, we compared its performance on the DUC’04 benchmark document collections with that of a large number of state-of-the-art summarizers. Furthermore, we also performed a qualitative evaluation of the soundness and readability of the generated summaries and a comparison with the results that were produced by the most effective summarizers. A parallel effort has been devoted to integrating semantics-based models and the knowledge acquired from social networks into a document summarization model named as SociONewSum. The effort addresses the sentence-based generic multi-document summarization problem, which can be formulated as follows: given a collection of news articles ranging over the same topic, the goal is to extract a concise yet informative summary, which consists of most salient document sentences. An established ontological model has been used to improve summarization performance by integrating a textual entity recognition and disambiguation step. Furthermore, the analysis of the user-generated content coming from Twitter has been exploited to discover current social trends and improve the appealing of the generated summaries. An experimental evaluation of the SociONewSum performance was conducted on real English-written news article collections and Twitter posts. The achieved results demonstrate the effectiveness of the proposed summarizer, in terms of different ROUGE scores, compared to state-of-the-art open source summarizers as well as to a baseline version of the SociONewSum summarizer that does not perform any UGC analysis. Furthermore, the readability of the generated summaries has also been analyzed

    Studying, developing, and experimenting contextual advertising systems

    Get PDF
    The World Wide Web has grown so fast in the last decade and it is today a vital daily part of people. The Internet is used for many purposes by an ever growing number of users, mostly for daily activities, tasks, and services. To face the needs of users, an efficient and effective access to information is required. To deal with this task, the adoption of Information Retrieval and Information Filtering techniques is continuously growing. Information Re-trieval (IR) is the field concerned with searching for documents, information within documents, and metadata about documents, as well as searching for structured storage, relational databases, and the World Wide Web. Infor- mation Filtering deals with the problem of selecting relevant information for a given user, according to her/his preferences and interest. Nowadays, Web advertising is one of the major sources of income for a large number of websites. Its main goal is to suggest products and services to the still ever growing population of Internet users. Web advertising is aimed at suggesting products and services to the users. A significant part of Web ad-vertising consists of textual ads, the ubiquitous short text messages usually marked as sponsored links. There are two primary channels for distributing ads: Sponsored Search (or Paid Search Advertising) and Contextual Ad-vertising (or Content Match). Sponsored Search advertising is the task of displaying ads on the page returned from a Web search engine following a query. Contextual Advertising (CA) displays ads within the content of a generic, third party, webpage. In this thesis I study, develop, and evaluated novel solutions in the field of Contextual Advertising. In particular, I studied and developed novel text summarization techniques, I adopted a novel semantic approach, I studied and adopted collaborative approaches, I started a conjunct study of Contex-tual Advertising and Geo-Localization, and I study the task of advertising in the field of Multi-Modal Aggregation. The thesis is organized as follows. In Chapter 1, we briefly describe the main aspects of Information Retrieval. Following, the Chapter 2 shows the problem of Contextual Advertising and describes the main contributes of the literature. Chapter 3 sketches a typical adopted approach and the eval-uation metrics of a Contextual Advertising system. Chapter 4 is related to the syntactic aspects, and its focus is on text summarization. In Chapter 5 the semantic aspects are taken into account, and a novel approach based on ConceptNet is proposed. Chapter 6 proposes a novel view of CA by the adoption of a collaborative filtering approach. Chapter 7 shows a prelim-inary study of Geo Location, performed in collaboration with the Yahoo! Research center in Barcelona. The target is to study several techniques of suggesting localized advertising in the field of mobile applications and search engines. In Chapter 8 is shown a joint work with the RAI Centre for Research and Technological Innovation. The main goal is to study and propose a system of advertising for Multimodal Aggregation data. Chapter 9 ends this work with conclusions and future directions

    Multi Domain Semantic Information Retrieval Based on Topic Model

    Get PDF
    Over the last decades, there have been remarkable shifts in the area of Information Retrieval (IR) as huge amount of information is increasingly accumulated on the Web. The gigantic information explosion increases the need for discovering new tools that retrieve meaningful knowledge from various complex information sources. Thus, techniques primarily used to search and extract important information from numerous database sources have been a key challenge in current IR systems. Topic modeling is one of the most recent techniquesthat discover hidden thematic structures from large data collections without human supervision. Several topic models have been proposed in various fields of study and have been utilized extensively for many applications. Latent Dirichlet Allocation (LDA) is the most well-known topic model that generates topics from large corpus of resources, such as text, images, and audio.It has been widely used in many areas in information retrieval and data mining, providing efficient way of identifying latent topics among document collections. However, LDA has a drawback that topic cohesion within a concept is attenuated when estimating infrequently occurring words. Moreover, LDAseems not to consider the meaning of words, but rather to infer hidden topics based on a statisticalapproach. However, LDA can cause either reduction in the quality of topic words or increase in loose relations between topics. In order to solve the previous problems, we propose a domain specific topic model that combines domain concepts with LDA. Two domain specific algorithms are suggested for solving the difficulties associated with LDA. The main strength of our proposed model comes from the fact that it narrows semantic concepts from broad domain knowledge to a specific one which solves the unknown domain problem. Our proposed model is extensively tested on various applications, query expansion, classification, and summarization, to demonstrate the effectiveness of the model. Experimental results show that the proposed model significantly increasesthe performance of applications

    Neural Representations of Concepts and Texts for Biomedical Information Retrieval

    Get PDF
    Information retrieval (IR) methods are an indispensable tool in the current landscape of exponentially increasing textual data, especially on the Web. A typical IR task involves fetching and ranking a set of documents (from a large corpus) in terms of relevance to a user\u27s query, which is often expressed as a short phrase. IR methods are the backbone of modern search engines where additional system-level aspects including fault tolerance, scale, user interfaces, and session maintenance are also addressed. In addition to fetching documents, modern search systems may also identify snippets within the documents that are potentially most relevant to the input query. Furthermore, current systems may also maintain preprocessed structured knowledge derived from textual data as so called knowledge graphs, so certain types of queries that are posed as questions can be parsed as such; a response can be an output of one or more named entities instead of a ranked list of documents (e.g., what diseases are associated with EGFR mutations? ). This refined setup is often termed as question answering (QA) in the IR and natural language processing (NLP) communities. In biomedicine and healthcare, specialized corpora are often at play including research articles by scientists, clinical notes generated by healthcare professionals, consumer forums for specific conditions (e.g., cancer survivors network), and clinical trial protocols (e.g., www.clinicaltrials.gov). Biomedical IR is specialized given the types of queries and the variations in the texts are different from that of general Web documents. For example, scientific articles are more formal with longer sentences but clinical notes tend to have less grammatical conformity and are rife with abbreviations. There is also a mismatch between the vocabulary of consumers and the lingo of domain experts and professionals. Queries are also different and can range from simple phrases (e.g., COVID-19 symptoms ) to more complex implicitly fielded queries (e.g., chemotherapy regimens for stage IV lung cancer patients with ALK mutations ). Hence, developing methods for different configurations (corpus, query type, user type) needs more deliberate attention in biomedical IR. Representations of documents and queries are at the core of IR methods and retrieval methodology involves coming up with these representations and matching queries with documents based on them. Traditional IR systems follow the approach of keyword based indexing of documents (the so called inverted index) and matching query phrases against the document index. It is not difficult to see that this keyword based matching ignores the semantics of texts (synonymy at the lexeme level and entailment at phrase/clause/sentence levels) and this has lead to dimensionality reduction methods such as latent semantic indexing that generally have scale-related concerns; such methods also do not address similarity at the sentence level. Since the resurgence of neural network methods in NLP, the IR field has also moved to incorporate advances in neural networks into current IR methods. This dissertation presents four specific methodological efforts toward improving biomedical IR. Neural methods always begin with dense embeddings for words and concepts to overcome the limitations of one-hot encoding in traditional NLP/IR. In the first effort, we present a new neural pre-training approach to jointly learn word and concept embeddings for downstream use in applications. In the second study, we present a joint neural model for two essential subtasks of information extraction (IE): named entity recognition (NER) and entity normalization (EN). Our method detects biomedical concept phrases in texts and links them to the corresponding semantic types and entity codes. These first two studies provide essential tools to model textual representations as compositions of both surface forms (lexical units) and high level concepts with potential downstream use in QA. In the third effort, we present a document reranking model that can help surface documents that are likely to contain answers (e.g, factoids, lists) to a question in a QA task. The model is essentially a sentence matching neural network that learns the relevance of a candidate answer sentence to the given question parametrized with a bilinear map. In the fourth effort, we present another document reranking approach that is tailored for precision medicine use-cases. It combines neural query-document matching and faceted text summarization. The main distinction of this effort from previous efforts is to pivot from a query manipulation setup to transforming candidate documents into pseudo-queries via neural text summarization. Overall, our contributions constitute nontrivial advances in biomedical IR using neural representations of concepts and texts

    Leveraging Formulae and Text for Improved Math Retrieval

    Get PDF
    Large collections containing millions of math formulas are available online. Retrieving math expressions from these collections is challenging. Users can use formula, formula+text, or math questions to express their math information needs. The structural complexity of formulas requires specialized processing. Despite the existence of math search systems and online community question-answering websites for math, little is known about mathematical information needs. This research first explores the characteristics of math searches using a general search engine. The findings show how math searches are different from general searches. Then, test collections for math-aware search are introduced. The ARQMath test collections have two main tasks: 1) finding answers for math questions and 2) contextual formula search. In each test collection (ARQMath-1 to -3) the same collection is used, Math Stack Exchange posts from 2010 to 2018, introducing different topics for each task. Compared to the previous test collections, ARQMath has a much larger number of diverse topics, and improved evaluation protocol. Another key role of this research is to leverage text and math information for improved math information retrieval. Three formula search models that only use the formula, with no context are introduced. The first model is an n-gram embedding model using both symbol layout tree and operator tree representations. The second model uses tree-edit distance to re-rank the results from the first model. Finally, a learning-to-rank model that leverages full-tree, sub-tree, and vector similarity scores is introduced. To use context, Math Abstract Meaning Representation (MathAMR) is introduced, which generalizes AMR trees to include math formula operations and arguments. This MathAMR is then used for contextualized formula search using a fine-tuned Sentence-BERT model. The experiments show tree-edit distance ranking achieves the current state-of-the-art results on contextual formula search task, and the MathAMR model can be beneficial for re-ranking. This research also addresses the answer retrieval task, introducing a two-step retrieval model in which similar questions are first found and then answers previously given to those similar questions are ranked. The proposed model, fine-tunes two Sentence-BERT models, one for finding similar questions and another one for ranking the answers. For Sentence-BERT model, raw text as well as MathAMR are used

    Towards Automated Related Work Summarization

    Get PDF
    Master'sMASTER OF SCIENC
    corecore