6,629 research outputs found
Deformable Prototypes for Encoding Shape Categories in Image Databases
We describe a method for shape-based image database search that uses deformable prototypes to represent categories. Rather than directly comparing a candidate shape with all shape entries in the database, shapes are compared in terms of the types of nonrigid deformations (differences) that relate them to a small subset of representative prototypes. To solve the shape correspondence and alignment problem, we employ the technique of modal matching, an information-preserving shape decomposition for matching, describing, and comparing shapes despite sensor variations and nonrigid deformations. In modal matching, shape is decomposed into an ordered basis of orthogonal principal components. We demonstrate the utility of this approach for shape comparison in 2-D image databases.Office of Naval Research (Young Investigator Award N00014-06-1-0661
Enhancing Biomedical Text Summarization Using Semantic Relation Extraction
Automatic text summarization for a biomedical concept can help researchers to get the key points of a certain topic from large amount of biomedical literature efficiently. In this paper, we present a method for generating text summary for a given biomedical concept, e.g., H1N1 disease, from multiple documents based on semantic relation extraction. Our approach includes three stages: 1) We extract semantic relations in each sentence using the semantic knowledge representation tool SemRep. 2) We develop a relation-level retrieval method to select the relations most relevant to each query concept and visualize them in a graphic representation. 3) For relations in the relevant set, we extract informative sentences that can interpret them from the document collection to generate text summary using an information retrieval based method. Our major focus in this work is to investigate the contribution of semantic relation extraction to the task of biomedical text summarization. The experimental results on summarization for a set of diseases show that the introduction of semantic knowledge improves the performance and our results are better than the MEAD system, a well-known tool for text summarization
Towards an Architecture for Efficient Distributed Search of Multimodal Information
The creation of very large-scale multimedia search engines, with more than one billion
images and videos, is a pressing need of digital societies where data is generated by multiple connected devices. Distributing search indexes in cloud environments is the inevitable solution to deal with the increasing scale of image and video collections. The distribution of such indexes in this setting raises multiple challenges such as the even partitioning of data space, load balancing across index nodes and the fusion of the results computed over multiple nodes. The main question behind this thesis is how to reduce and distribute the multimedia retrieval computational complexity?
This thesis studies the extension of sparse hash inverted indexing to distributed settings.
The main goal is to ensure that indexes are uniformly distributed across computing nodes while keeping similar documents on the same nodes. Load balancing is performed at both node and index level, to guarantee that the retrieval process is not delayed by nodes that have to inspect larger subsets of the index.
Multimodal search requires the combination of the search results from individual modalities and document features. This thesis studies rank fusion techniques focused on reducing complexity by automatically selecting only the features that improve retrieval effectiveness.
The achievements of this thesis span both distributed indexing and rank fusion research.
Experiments across multiple datasets show that sparse hashes can be used to distribute documents and queries across index entries in a balanced and redundant manner across nodes. Rank fusion results show that is possible to reduce retrieval complexity and improve efficiency by searching only a subset of the feature indexes
NON-PARAMETRIC GRAPH-BASED METHODS FOR LARGE SCALE PROBLEMS
The notion of similarity between observations plays a very fundamental role in many Machine Learning and Data Mining algorithms. In
many of these methods, the fundamental problem of prediction, which is making assessments and/or inferences about the future observations from the
past ones, boils down to how ``similar'' the future cases are to the already observed ones. However, similarity is not always
obtained through the traditional distance metrics. Data-driven similarity metrics, in particular, come into play where the traditional absolute
metrics are not sufficient for the task in hand due to special structure of the observed data. A common approach for computing data-driven similarity
is to somehow
aggregate the local absolute similarities (which are not data-driven and can be computed in a closed-from) to infer a global
data-driven similarity value between any pair of observations. The graph-based methods offer a natural framework to do so.
Incorporating these methods, many of the Machine Learning algorithms, that are
designed to work with absolute distances, can be applied on those problems with data-driven distances. This makes graph-based methods
very effective tools for many real-world problems.
In this thesis, the major problem that I want to address is the scalability of the graph-based methods. With the rise of large-scale,
high-dimensional datasets in many real-world applications, many Machine Learning algorithms do not scale up well applying to these problems.
The graph-based methods are no exception
either. Both the large number of observations and the high dimensionality hurt graph-based methods,
computationally and statistically. While the large number of observations imposes more of a computational problem, the high dimensionality
problem has more of a statistical nature. In this thesis, I address both of these issues in depth and review the common solutions for them proposed
in the literature. Moreover, for each of these problems, I propose novel solutions with experimental results depicting the merits of the proposed
algorithms. Finally, I discuss the contribution of the proposed work from a broader viewpoint and draw some future directions of the current work
A framework for clustering and adaptive topic tracking on evolving text and social media data streams.
Recent advances and widespread usage of online web services and social media platforms, coupled with ubiquitous low cost devices, mobile technologies, and increasing capacity of lower cost storage, has led to a proliferation of Big data, ranging from, news, e-commerce clickstreams, and online business transactions to continuous event logs and social media expressions. These large amounts of online data, often referred to as data streams, because they get generated at extremely high throughputs or velocity, can make conventional and classical data analytics methodologies obsolete. For these reasons, the issues of management and analysis of data streams have been researched extensively in recent years. The special case of social media Big Data brings additional challenges, particularly because of the unstructured nature of the data, specifically free text. One classical approach to mine text data has been Topic Modeling. Topic Models are statistical models that can be used for discovering the abstract ``topics\u27\u27 that may occur in a corpus of documents. Topic models have emerged as a powerful technique in machine learning and data science, providing a great balance between simplicity and complexity. They also provide sophisticated insight without the need for real natural language understanding. However they have not been designed to cope with the type of text data that is abundant on social media platforms, but rather for traditional medium size corpora consisting of longer documents, adhering to a specific language and typically spanning a stable set of topics. Unlike traditional document corpora, social media messages tend to be very short, sparse, noisy, and do not adhere to a standard vocabulary, linguistic patterns, or stable topic distributions. They are also generated at high velocity that impose high demands on topic modeling; and their evolving or dynamic nature, makes any set of results from topic modeling quickly become stale in the face of changes in the textual content and topics discussed within social media streams. In this dissertation, we propose an integrated topic modeling framework built on top of an existing stream-clustering framework called Stream-Dashboard, which can extract, isolate, and track topics over any given time period. In this new framework, Stream Dashboard first clusters the data stream points into homogeneous groups. Then data from each group is ushered to the topic modeling framework which extracts finer topics from the group. The proposed framework tracks the evolution of the clusters over time to detect milestones corresponding to changes in topic evolution, and to trigger an adaptation of the learned groups and topics at each milestone. The proposed approach to topic modeling is different from a generic Topic Modeling approach because it works in a compartmentalized fashion, where the input document stream is split into distinct compartments, and Topic Modeling is applied on each compartment separately. Furthermore, we propose extensions to existing topic modeling and stream clustering methods, including: an adaptive query reformulation approach to help focus on the topic discovery with time; a topic modeling extension with adaptive hyper-parameter and with infinite vocabulary; an adaptive stream clustering algorithm incorporating the automated estimation of dynamic, cluster-specific temporal scales for adaptive forgetting to help facilitate clustering in a fast evolving data stream. Our experimental results show that the proposed adaptive forgetting clustering algorithm can mine better quality clusters; that our proposed compartmentalized framework is able to mine topics of better quality compared to competitive baselines; and that the proposed framework can automatically adapt to focus on changing topics using the proposed query reformulation strategy
Pretrained Transformers for Text Ranking: BERT and Beyond
The goal of text ranking is to generate an ordered list of texts retrieved
from a corpus in response to a query. Although the most common formulation of
text ranking is search, instances of the task can also be found in many natural
language processing applications. This survey provides an overview of text
ranking with neural network architectures known as transformers, of which BERT
is the best-known example. The combination of transformers and self-supervised
pretraining has been responsible for a paradigm shift in natural language
processing (NLP), information retrieval (IR), and beyond. In this survey, we
provide a synthesis of existing work as a single point of entry for
practitioners who wish to gain a better understanding of how to apply
transformers to text ranking problems and researchers who wish to pursue work
in this area. We cover a wide range of modern techniques, grouped into two
high-level categories: transformer models that perform reranking in multi-stage
architectures and dense retrieval techniques that perform ranking directly.
There are two themes that pervade our survey: techniques for handling long
documents, beyond typical sentence-by-sentence processing in NLP, and
techniques for addressing the tradeoff between effectiveness (i.e., result
quality) and efficiency (e.g., query latency, model and index size). Although
transformer architectures and pretraining techniques are recent innovations,
many aspects of how they are applied to text ranking are relatively well
understood and represent mature techniques. However, there remain many open
research questions, and thus in addition to laying out the foundations of
pretrained transformers for text ranking, this survey also attempts to
prognosticate where the field is heading
Knowledge-based Biomedical Data Science 2019
Knowledge-based biomedical data science (KBDS) involves the design and
implementation of computer systems that act as if they knew about biomedicine.
Such systems depend on formally represented knowledge in computer systems,
often in the form of knowledge graphs. Here we survey the progress in the last
year in systems that use formally represented knowledge to address data science
problems in both clinical and biological domains, as well as on approaches for
creating knowledge graphs. Major themes include the relationships between
knowledge graphs and machine learning, the use of natural language processing,
and the expansion of knowledge-based approaches to novel domains, such as
Chinese Traditional Medicine and biodiversity.Comment: Manuscript 43 pages with 3 tables; Supplemental material 43 pages
with 3 table
Recommended from our members
Supporting the Discoverability of Open Educational Resources: on the Scent of a Hidden Treasury
Open Educational Resources (OERs), now available in large numbers, have a considerable potential to improve many aspects of society, yet one of the factors limiting this positive impact is the difficulty to discover them. This thesis investigates and proposes strategies to better support educators in discovering OERs.
The literature suggests that the effectiveness of existing search systems, including for OER discovery, could be improved by supporting users, such as teachers, in carrying out more exploratory search activities closer to their existing methods of working. Hence, a preliminary taxonomy of OER-related search tasks was produced, based on an analysis of the literature, interpreted through Information Foraging Theory. This taxonomy was empirically evaluated to preliminarily identify a set of search tasks that involve finding other OERs similar to one that has already been identified, a process that is generally referred to as Query By Example (QBE). Following the Design Science Research methodology, three prototypes to support as well as to refine those tasks were iteratively designed, implemented, and evaluated involving an increasing number of educators in usability oriented studies. The resulting high-level and domain-oriented blended search/recommendation strategy transparently replicates Google searches in specialized networks, and identifies similar resources with a QBE strategy. It makes use of a domain-oriented similarity metric based on shared alignments to educational standards, and clusters results in expandable classes of comparable degrees of similarity. The summative evaluation shows that educators do appreciate this strategy because it is exploratory and â balancing similarity and diversity â it supports their high-level tasks, such as lesson planning and personalization of education. Finally, potential barriers and opportunities for the uptake of OER discovery tools were investigated in a structured interview study with experts from the OER field. Identified issues included how to work across multiple OER portals, variability in the use of metadata and how to align with the working practices of teachers.
The findings of the thesis can be used to inform the research and development of methods and tools for OER discovery as well as their deployment to serve the needs of educators
- âŠ