498 research outputs found
Theme-weighted Ranking of Keywords from Text Documents using Phrase Embeddings
Keyword extraction is a fundamental task in natural language processing that
facilitates mapping of documents to a concise set of representative single and
multi-word phrases. Keywords from text documents are primarily extracted using
supervised and unsupervised approaches. In this paper, we present an
unsupervised technique that uses a combination of theme-weighted personalized
PageRank algorithm and neural phrase embeddings for extracting and ranking
keywords. We also introduce an efficient way of processing text documents and
training phrase embeddings using existing techniques. We share an evaluation
dataset derived from an existing dataset that is used for choosing the
underlying embedding model. The evaluations for ranked keyword extraction are
performed on two benchmark datasets comprising of short abstracts (Inspec), and
long scientific papers (SemEval 2010), and is shown to produce results better
than the state-of-the-art systems.Comment: preprint for paper accepted in Proceedings of 1st IEEE International
Conference on Multimedia Information Processing and Retrieva
Keywords at Work: Investigating Keyword Extraction in Social Media Applications
This dissertation examines a long-standing problem in Natural Language Processing (NLP) -- keyword extraction -- from a new angle. We investigate how keyword extraction can be formulated on social media data, such as emails, product reviews, student discussions, and student statements of purpose. We design novel graph-based features for supervised and unsupervised keyword extraction from emails, and use the resulting system with success to uncover patterns in a new dataset -- student statements of purpose. Furthermore, the system is used with new features on the problem of usage expression extraction from product reviews, where we obtain interesting insights. The system while used on student discussions, uncover new and exciting patterns.
While each of the above problems is conceptually distinct, they share two key common elements -- keywords and social data. Social data can be messy, hard-to-interpret, and not easily amenable to existing NLP resources. We show that our system is robust enough in the face of such challenges to discover useful and important patterns. We also show that the problem definition of keyword extraction itself can be expanded to accommodate new and challenging research questions and datasets.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/145929/1/lahiri_1.pd
A User-Centered Concept Mining System for Query and Document Understanding at Tencent
Concepts embody the knowledge of the world and facilitate the cognitive
processes of human beings. Mining concepts from web documents and constructing
the corresponding taxonomy are core research problems in text understanding and
support many downstream tasks such as query analysis, knowledge base
construction, recommendation, and search. However, we argue that most prior
studies extract formal and overly general concepts from Wikipedia or static web
pages, which are not representing the user perspective. In this paper, we
describe our experience of implementing and deploying ConcepT in Tencent QQ
Browser. It discovers user-centered concepts at the right granularity
conforming to user interests, by mining a large amount of user queries and
interactive search click logs. The extracted concepts have the proper
granularity, are consistent with user language styles and are dynamically
updated. We further present our techniques to tag documents with user-centered
concepts and to construct a topic-concept-instance taxonomy, which has helped
to improve search as well as news feeds recommendation in Tencent QQ Browser.
We performed extensive offline evaluation to demonstrate that our approach
could extract concepts of higher quality compared to several other existing
methods. Our system has been deployed in Tencent QQ Browser. Results from
online A/B testing involving a large number of real users suggest that the
Impression Efficiency of feeds users increased by 6.01% after incorporating the
user-centered concepts into the recommendation framework of Tencent QQ Browser.Comment: Accepted by KDD 201
AI-assisted patent prior art searching - feasibility study
This study seeks to understand the feasibility, technical complexities and effectiveness of using artificial intelligence (AI) solutions to improve operational processes of registering IP rights. The Intellectual Property Office commissioned Cardiff University to undertake this research. The research was funded through the BEIS Regulators’ Pioneer Fund (RPF). The RPF fund was set up to help address barriers to innovation in the UK economy
Neural Representations of Concepts and Texts for Biomedical Information Retrieval
Information retrieval (IR) methods are an indispensable tool in the current landscape of exponentially increasing textual data, especially on the Web. A typical IR task involves fetching and ranking a set of documents (from a large corpus) in terms of relevance to a user\u27s query, which is often expressed as a short phrase. IR methods are the backbone of modern search engines where additional system-level aspects including fault tolerance, scale, user interfaces, and session maintenance are also addressed. In addition to fetching documents, modern search systems may also identify snippets within the documents that are potentially most relevant to the input query. Furthermore, current systems may also maintain preprocessed structured knowledge derived from textual data as so called knowledge graphs, so certain types of queries that are posed as questions can be parsed as such; a response can be an output of one or more named entities instead of a ranked list of documents (e.g., what diseases are associated with EGFR mutations? ). This refined setup is often termed as question answering (QA) in the IR and natural language processing (NLP) communities.
In biomedicine and healthcare, specialized corpora are often at play including research articles by scientists, clinical notes generated by healthcare professionals, consumer forums for specific conditions (e.g., cancer survivors network), and clinical trial protocols (e.g., www.clinicaltrials.gov). Biomedical IR is specialized given the types of queries and the variations in the texts are different from that of general Web documents. For example, scientific articles are more formal with longer sentences but clinical notes tend to have less grammatical conformity and are rife with abbreviations. There is also a mismatch between the vocabulary of consumers and the lingo of domain experts and professionals. Queries are also different and can range from simple phrases (e.g., COVID-19 symptoms ) to more complex implicitly fielded queries (e.g., chemotherapy regimens for stage IV lung cancer patients with ALK mutations ). Hence, developing methods for different configurations (corpus, query type, user type) needs more deliberate attention in biomedical IR.
Representations of documents and queries are at the core of IR methods and retrieval methodology involves coming up with these representations and matching queries with documents based on them. Traditional IR systems follow the approach of keyword based indexing of documents (the so called inverted index) and matching query phrases against the document index. It is not difficult to see that this keyword based matching ignores the semantics of texts (synonymy at the lexeme level and entailment at phrase/clause/sentence levels) and this has lead to dimensionality reduction methods such as latent semantic indexing that generally have scale-related concerns; such methods also do not address similarity at the sentence level. Since the resurgence of neural network methods in NLP, the IR field has also moved to incorporate advances in neural networks into current IR methods.
This dissertation presents four specific methodological efforts toward improving biomedical IR. Neural methods always begin with dense embeddings for words and concepts to overcome the limitations of one-hot encoding in traditional NLP/IR. In the first effort, we present a new neural pre-training approach to jointly learn word and concept embeddings for downstream use in applications. In the second study, we present a joint neural model for two essential subtasks of information extraction (IE): named entity recognition (NER) and entity normalization (EN). Our method detects biomedical concept phrases in texts and links them to the corresponding semantic types and entity codes. These first two studies provide essential tools to model textual representations as compositions of both surface forms (lexical units) and high level concepts with potential downstream use in QA. In the third effort, we present a document reranking model that can help surface documents that are likely to contain answers (e.g, factoids, lists) to a question in a QA task. The model is essentially a sentence matching neural network that learns the relevance of a candidate answer sentence to the given question parametrized with a bilinear map. In the fourth effort, we present another document reranking approach that is tailored for precision medicine use-cases. It combines neural query-document matching and faceted text summarization. The main distinction of this effort from previous efforts is to pivot from a query manipulation setup to transforming candidate documents into pseudo-queries via neural text summarization. Overall, our contributions constitute nontrivial advances in biomedical IR using neural representations of concepts and texts
EntropyRank: Unsupervised Keyphrase Extraction via Side-Information Optimization for Language Model-based Text Compression
We propose an unsupervised method to extract keywords and keyphrases from
texts based on a pre-trained language model (LM) and Shannon's information
maximization. Specifically, our method extracts phrases having the highest
conditional entropy under the LM. The resulting set of keyphrases turns out to
solve a relevant information-theoretic problem: if provided as side
information, it leads to the expected minimal binary code length in compressing
the text using the LM and an entropy encoder. Alternately, the resulting set is
an approximation via a causal LM to the set of phrases that minimize the
entropy of the text when conditioned upon it. Empirically, the method provides
results comparable to the most commonly used methods in various keyphrase
extraction benchmark challenges
Probabilistic latent variable models for knowledge discovery and optimization
I conduct a systematic study of probabilistic latent variable models (PLVMs) with applications to knowledge discovery and optimization. Probabilistic modeling is a principled means to gain insight of data. By assuming that the observed data are generated from a distribution, we can estimate its density, or the statistics of our interest, by either Maximum Likelihood Estimation or Bayesian inference, depending on whether there is a prior distribution for the parameters of the assumed data distribution.
One of the primary goals of various machine learning/data mining models is to reveal the underlying knowledge of observed data. A common practice is to introduce latent variables, which are modeled together with the observations. Such latent variables compute, for example, the class assignments (labels), the cluster membership, as well as other unobserved measurements of the data. Besides, proper exploitation of latent variables facilities the optimization itself, which leads to computationally efficient inference algorithms.
In this thesis, I describe a range of applications where latent variables can be leveraged for knowledge discovery and efficient optimization. Works in this thesis demonstrate that PLVMs are a powerful tool for modeling incomplete observations. Through incorporating latent variables and assuming that the observations such as citations, pairwise preferences as well as text are generated following tractable distributions parametrized by the latent variables, PLVMs are flexible and effective to discover knowledge in data mining problems, where the knowledge is mathematically modelled as continuous or discrete values, distributions or uncertainty. In addition, I also explore PLVMs for deriving efficient algorithms. It has been shown that latent variables can be employed as a means for model reduction and facilitates the computation/sampling of intractable distributions.
Our results lead to algorithms which take advantage of latent variables in probabilistic models. We conduct experiments against state-of-the-art models and empirical evaluation shows that our proposed approaches improve both learning performance and computational efficiency
- …