612 research outputs found
Doctor of Philosophy
dissertationThe objective of this work is to examine the efficacy of natural language processing (NLP) in summarizing bibliographic text for multiple purposes. Researchers have noted the accelerating growth of bibliographic databases. Information seekers using traditional information retrieval techniques when searching large bibliographic databases are often overwhelmed by excessive, irrelevant data. Scientists have applied natural language processing technologies to improve retrieval. Text summarization, a natural language processing approach, simplifies bibliographic data while filtering it to address a user's need. Traditional text summarization can necessitate the use of multiple software applications to accommodate diverse processing refinements known as "points-of-view." A new, statistical approach to text summarization can transform this process. Combo, a statistical algorithm comprised of three individual metrics, determines which elements within input data are relevant to a user's specified information need, thus enabling a single software application to summarize text for many points-of-view. In this dissertation, I describe this algorithm, and the research process used in developing and testing it. Four studies comprised the research process. The goal of the first study was to create a conventional schema accommodating a genetic disease etiology point-of-view, and an evaluative reference standard. This was accomplished through simulating the task of secondary genetic database curation. The second study addressed the development iv and initial evaluation of the algorithm, comparing its performance to the conventional schema using the previously established reference standard, again within the task of secondary genetic database curation. The third and fourth studies evaluated the algorithm's performance in accommodating additional points-of-view in a simulated clinical decision support task. The third study explored prevention, while the fourth evaluated performance for prevention and drug treatment, comparing results to a conventional treatment schema's output. Both summarization methods identified data that were salient to their tasks. The conventional genetic disease etiology and treatment schemas located salient information for database curation and decision support, respectively. The Combo algorithm located salient genetic disease etiology, treatment, and prevention data, for the associated tasks. Dynamic text summarization could potentially serve additional purposes, such as consumer health information delivery, systematic review creation, and primary research. This technology may benefit many user groups
Genetic graph-based in clustering applied to static and streaming data analysis
Tesis inédita leÃda en la Universidad Autónoma de Madrid, Escuela Politécnica Superior, Departamento de IngenierÃa Informática. Fecha de lectura: diciembre de 2014Unsupervised Learning Techniques have been widely used in Data Mining over the
last few years. These techniques try to identify patterns in a dataset blindly. Clustering
is one of the most promising elds in Unsupervised Learning. It consists on
grouping the data by similarity. This eld has generated several research works
which have tried to deal with di erent problems related to the pattern extraction
and data grouping processes. One of the most innovative clustering methodologies
is shape-based or continuity-based clustering which tries to group data according to
the form they de ne in the space.
This dissertation is focused on how to apply Genetic Algorithms to the continuitybased
clustering problems. Genetic Algorithms have been traditionally used in optimization
problems. They are featured by an encoding -which represents the solution
space; a population set of chromosomes -which are the potential solutions; and some
genetic operations -which are used to evolve the solutions in order to nd the best
chromosome or solution. The main idea is to take advantage of their potential, generating
new algorithms which can improve the performance of classical clustering
algorithms, and apply them to static and streaming data.
In order to design these algorithms, this dissertation has been based on the Spectral
Clustering algorithm. This algorithm studies the spectrum of a Similarity Graph
in order to de ne the clusters. The clusters de ned by Spectral Clustering usually
respect the data continuity. Using this idea as a starting point, di erent graph-based
genetic algorithms have been designed to deal with the continuity-based clustering
problem. The di erent algorithms developed have been divided in three generations:
The rst generation is based on genetic graph-based clustering algorithms. In
this generation we combined graph-based clustering and genetic algorithms to
generate a graph topology among the data, in order to nd the best way to cut
the graph. This cutting process is used to discriminate the nal clusters. The
main idea is to use hybrid algorithms which combine di erent metrics extracted
from graph theory. In order to evaluate the performance on real-world problems,
these algorithms have been also applied to text summarization.
The second generation is based on multi-objective genetic graph-based clustering
algorithms. This generation introduces the Pareto Front generated by the
di erent tness functions used in the genetic search. The Pareto Front is used
to study the solution space and provides more robust and accurate solutions.
During this generation we also used co-evolutionary algorithms to include the
number of clusters in the search space.
Finally, the last generation is focused on large and streaming data analysis.
During this generation the previous algorithms have been adapted to deal with
large data, combining di erent methodologies such as online clustering and
MapReduce. The main idea is to study their performance compared with other
algorithms.
The dissertation also includes a description of other graph-based bio-inspired algorithms,
in this case Ant Colony Optimization Clustering algorithms, which have
been designed during the dissertation, in order to extend the range of study to other
bio-inspired areas.
Finally, with the purpose of evaluating the algorithms of the di erent generations,
we have compared them with relevant and well-known clustering algorithms using
synthetic and real-world datasets extracted from the literature and the UCI Machine
Learning RepositoryLas técnicas de aprendizaje no supervisado han sido ampliamente utilizadas en minerÃa de datos en los últimos años. Estas técnicas tratan de extraer patrones de un conjunto de datos de forma ciega. Dentro de las mismas, el Clustering es uno de los campos más prometedores. Este consiste en la agrupación de los datos por similitud. Este campo ha generado varios trabajos de investigación que han tratado de hacer frente a diferentes problemas relacionados con la extracción de patrones y los procesos de agrupación de datos. Una de las metodologÃas de clustering más innovadoras se basa en agrupar los datos por continuidad, respetando la forma que estos definen en espacio en el que se encuentran. Esta tesis se centra en la manera de aplicar algoritmos genéticos a los problemas de clustering basado en continuidad. Los algoritmos genéticos han sido utilizados tradicionalmente en problemas de optimización. Se caracterizan por una codificación -que representa el espacio de soluciones-, una población o conjunto de cromosomas -que son las soluciones potenciales dentro de este espacio-, y algunas operaciones genéticas -que se utilizan para evolucionar las soluciones con el fin de encontrar el mejor cromosoma o solución-. La idea principal es aprovechar el pontencial de los algoritmos genéticos generando nuevos algoritmos que pueden mejorar el rendimiento de los algoritmos clásicos aplicados tanto a datos estáticos como a flujos continuos de datos. De cara a diseñaar estos algoritmos, esta tesis doctoral utiliza el algoritmo de Spectral Clustering como punto de partida. Este algoritmo estudia el espectro de un grafo de similitud con el fin de dfinir las agrupaciones o clusters. Los grupos de nidos por Spectral Clustering suelen respetar la continuidad de los datos. Utilizando esta idea, se han diseñado diferentes algoritmos genéticos basados en grafos para hacer frente al problema de agrupación basada en continuidad. Los diferentes algoritmos desarrollados se han dividido en tres generaciones: La primera generación se basa en algoritmos de clustering genéticos basados en grafos. En esta generación se han combinado técnicas de Graph Clustering y algoritmos genéticos para generar una topologÃa de grafo entre los datos, con el fin de encontrar la mejor manera de cortar el grafo. Este proceso de corte se utiliza para discriminar los grupos finales. La idea principal es utilizar algoritmos hÃbridos que combinan diferentes métricas extraÃdas de teorÃa de grafos. Con el fin de evaluar el comportamiento de los algoritmos en problemas del mundo real, estos algoritmos se han aplicado al problema de cómo generar resúmenes automáticos. La segunda generación se basa en algoritmos multi-objetivo de clustering genético basado en grafos. Esta generación introduce el Frente de Pareto, generado por las diferentes funciones de fitness utilizadas en la búsqueda genética. El frente de Pareto se utiliza para estudiar el espacio de soluciones y proporcionar soluciones más robustas y precisas. Durante esta generación también utilizamos algoritmos co-evolutivos de cara a incluir el número de clusters en el espacio de búsqueda Finalmente, la ultima generación se centra en el análisis de grandes cantidades y flujos de datos. Durante esta generación los algoritmos anteriormente mencionados se han adaptado para hacer frente a grandes volúmenes de datos, combinando diferentes metodologà as como el clustering online y MapReduce. La idea principal es estudiar su rendimiento en comparación con otros algoritmos. La tesis también incluye aportaciones de otros algoritmos bio-inspirados basados en grafos, en este caso, algoritmos de clustering usando optimización por colonias de hormigas. Estos algoritmos han sido diseñados durante el desarrollo de la tesis para ampliar el rango de estudio a otros entornos bio-inspirados. Por último, con el fin de evaluar los algoritmos de las diferentes generaciones, se han comparado con algoritmos de clustering conocidos. El rendimiento de estos algoritmos se ha medido utilizando conjuntos de datos sintéticos y reales extraÃdos de la literatura y del repositorio UCI de Machine Learning
Designing large quantum key distribution networks via medoid-based algorithms
The current development of quantum mechanics and its applications suppose a threat to modern cryptography as it was conceived. The abilities of quantum computers for solving complex mathematical problems, as a strong computational novelty, is the root of that risk. However, quantum technologies can also prevent this threat by leveraging quantum methods to distribute keys. This field, called Quantum Key Distribution (QKD) is growing, although it still needs more physical basics to become a reality as popular as the Internet. This work proposes a novel methodology that leverages medoid-based clustering techniques to design quantum key distribution networks on commercial fiber optics systems. Our methodology focuses on the current limitations of these communication systems, their error loss and how trusted repeaters can lead to achieve a proper communication with the current technology. We adapt our model to the current data on a wide territory covering an area of almost 100,000 km2, and prove that considering physical limitations of around 45km with 3.1 error loss, our design can provide service to the whole area. This technique is the first to extend the state of the art network’s design, that is focused on up to 10 nodes, to networks dealing with more than 200 nodes
Towards Personalized and Human-in-the-Loop Document Summarization
The ubiquitous availability of computing devices and the widespread use of
the internet have generated a large amount of data continuously. Therefore, the
amount of available information on any given topic is far beyond humans'
processing capacity to properly process, causing what is known as information
overload. To efficiently cope with large amounts of information and generate
content with significant value to users, we require identifying, merging and
summarising information. Data summaries can help gather related information and
collect it into a shorter format that enables answering complicated questions,
gaining new insight and discovering conceptual boundaries.
This thesis focuses on three main challenges to alleviate information
overload using novel summarisation techniques. It further intends to facilitate
the analysis of documents to support personalised information extraction. This
thesis separates the research issues into four areas, covering (i) feature
engineering in document summarisation, (ii) traditional static and inflexible
summaries, (iii) traditional generic summarisation approaches, and (iv) the
need for reference summaries. We propose novel approaches to tackle these
challenges, by: i)enabling automatic intelligent feature engineering, ii)
enabling flexible and interactive summarisation, iii) utilising intelligent and
personalised summarisation approaches. The experimental results prove the
efficiency of the proposed approaches compared to other state-of-the-art
models. We further propose solutions to the information overload problem in
different domains through summarisation, covering network traffic data, health
data and business process data.Comment: PhD thesi
COMMUNITY DETECTION IN GRAPHS
Thesis (Ph.D.) - Indiana University, Luddy School of Informatics, Computing, and Engineering/University Graduate School, 2020Community detection has always been one of the fundamental research topics in graph mining. As a type of unsupervised or semi-supervised approach, community detection aims to explore node high-order closeness by leveraging graph topological structure. By grouping similar nodes or edges into the same community while separating dissimilar ones apart into different communities, graph structure can be revealed in a coarser resolution. It can be beneficial for numerous applications such as user shopping recommendation and advertisement in e-commerce, protein-protein interaction prediction in the bioinformatics, and literature recommendation or scholar collaboration in citation
analysis. However, identifying communities is an ill-defined problem. Due to the No Free Lunch theorem [1], there is neither gold standard to represent perfect community partition nor universal methods that are able to detect satisfied communities for all tasks under various types of graphs. To have a global view of this research topic, I summarize state-of-art community detection methods by categorizing them based on graph types, research tasks and methodology frameworks. As academic exploration on community detection grows rapidly in recent years, I hereby particularly focus on the state-of-art works published in the latest decade, which may leave out some classic models published decades ago. Meanwhile, three subtle community detection tasks are proposed and assessed in this dissertation as well. First, apart from general models which consider only graph structures, personalized community detection considers user need as auxiliary information to guide community detection. In the end, there will be fine-grained communities for nodes better matching user needs while coarser-resolution communities for the rest of less relevant nodes. Second, graphs always suffer from the sparse connectivity issue. Leveraging conventional models directly on such graphs may hugely distort the quality of generate communities. To tackle such a problem, cross-graph techniques are involved to propagate external graph information as a support for target graph community detection. Third, graph community structure supports a natural language processing (NLP) task to depict node intrinsic characteristics by generating node summarizations via a text generative model. The contribution of this dissertation is threefold. First, a decent amount of researches are reviewed and summarized under a well-defined taxonomy. Existing works about methods, evaluation and applications are all addressed in the literature review. Second, three novel community detection tasks are demonstrated and associated models are proposed and evaluated by comparing with state-of-art baselines under various datasets. Third, the limitations of current works are pointed out and future research tracks with potentials are discussed as well
Methodological challenges and analytic opportunities for modeling and interpreting Big Healthcare Data
Abstract
Managing, processing and understanding big healthcare data is challenging, costly and demanding. Without a robust fundamental theory for representation, analysis and inference, a roadmap for uniform handling and analyzing of such complex data remains elusive. In this article, we outline various big data challenges, opportunities, modeling methods and software techniques for blending complex healthcare data, advanced analytic tools, and distributed scientific computing. Using imaging, genetic and healthcare data we provide examples of processing heterogeneous datasets using distributed cloud services, automated and semi-automated classification techniques, and open-science protocols. Despite substantial advances, new innovative technologies need to be developed that enhance, scale and optimize the management and processing of large, complex and heterogeneous data. Stakeholder investments in data acquisition, research and development, computational infrastructure and education will be critical to realize the huge potential of big data, to reap the expected information benefits and to build lasting knowledge assets. Multi-faceted proprietary, open-source, and community developments will be essential to enable broad, reliable, sustainable and efficient data-driven discovery and analytics. Big data will affect every sector of the economy and their hallmark will be ‘team science’.http://deepblue.lib.umich.edu/bitstream/2027.42/134522/1/13742_2016_Article_117.pd
- …