612 research outputs found

    Doctor of Philosophy

    Get PDF
    dissertationThe objective of this work is to examine the efficacy of natural language processing (NLP) in summarizing bibliographic text for multiple purposes. Researchers have noted the accelerating growth of bibliographic databases. Information seekers using traditional information retrieval techniques when searching large bibliographic databases are often overwhelmed by excessive, irrelevant data. Scientists have applied natural language processing technologies to improve retrieval. Text summarization, a natural language processing approach, simplifies bibliographic data while filtering it to address a user's need. Traditional text summarization can necessitate the use of multiple software applications to accommodate diverse processing refinements known as "points-of-view." A new, statistical approach to text summarization can transform this process. Combo, a statistical algorithm comprised of three individual metrics, determines which elements within input data are relevant to a user's specified information need, thus enabling a single software application to summarize text for many points-of-view. In this dissertation, I describe this algorithm, and the research process used in developing and testing it. Four studies comprised the research process. The goal of the first study was to create a conventional schema accommodating a genetic disease etiology point-of-view, and an evaluative reference standard. This was accomplished through simulating the task of secondary genetic database curation. The second study addressed the development iv and initial evaluation of the algorithm, comparing its performance to the conventional schema using the previously established reference standard, again within the task of secondary genetic database curation. The third and fourth studies evaluated the algorithm's performance in accommodating additional points-of-view in a simulated clinical decision support task. The third study explored prevention, while the fourth evaluated performance for prevention and drug treatment, comparing results to a conventional treatment schema's output. Both summarization methods identified data that were salient to their tasks. The conventional genetic disease etiology and treatment schemas located salient information for database curation and decision support, respectively. The Combo algorithm located salient genetic disease etiology, treatment, and prevention data, for the associated tasks. Dynamic text summarization could potentially serve additional purposes, such as consumer health information delivery, systematic review creation, and primary research. This technology may benefit many user groups

    Genetic graph-based in clustering applied to static and streaming data analysis

    Full text link
    Tesis inédita leída en la Universidad Autónoma de Madrid, Escuela Politécnica Superior, Departamento de Ingeniería Informática. Fecha de lectura: diciembre de 2014Unsupervised Learning Techniques have been widely used in Data Mining over the last few years. These techniques try to identify patterns in a dataset blindly. Clustering is one of the most promising elds in Unsupervised Learning. It consists on grouping the data by similarity. This eld has generated several research works which have tried to deal with di erent problems related to the pattern extraction and data grouping processes. One of the most innovative clustering methodologies is shape-based or continuity-based clustering which tries to group data according to the form they de ne in the space. This dissertation is focused on how to apply Genetic Algorithms to the continuitybased clustering problems. Genetic Algorithms have been traditionally used in optimization problems. They are featured by an encoding -which represents the solution space; a population set of chromosomes -which are the potential solutions; and some genetic operations -which are used to evolve the solutions in order to nd the best chromosome or solution. The main idea is to take advantage of their potential, generating new algorithms which can improve the performance of classical clustering algorithms, and apply them to static and streaming data. In order to design these algorithms, this dissertation has been based on the Spectral Clustering algorithm. This algorithm studies the spectrum of a Similarity Graph in order to de ne the clusters. The clusters de ned by Spectral Clustering usually respect the data continuity. Using this idea as a starting point, di erent graph-based genetic algorithms have been designed to deal with the continuity-based clustering problem. The di erent algorithms developed have been divided in three generations: The rst generation is based on genetic graph-based clustering algorithms. In this generation we combined graph-based clustering and genetic algorithms to generate a graph topology among the data, in order to nd the best way to cut the graph. This cutting process is used to discriminate the nal clusters. The main idea is to use hybrid algorithms which combine di erent metrics extracted from graph theory. In order to evaluate the performance on real-world problems, these algorithms have been also applied to text summarization. The second generation is based on multi-objective genetic graph-based clustering algorithms. This generation introduces the Pareto Front generated by the di erent tness functions used in the genetic search. The Pareto Front is used to study the solution space and provides more robust and accurate solutions. During this generation we also used co-evolutionary algorithms to include the number of clusters in the search space. Finally, the last generation is focused on large and streaming data analysis. During this generation the previous algorithms have been adapted to deal with large data, combining di erent methodologies such as online clustering and MapReduce. The main idea is to study their performance compared with other algorithms. The dissertation also includes a description of other graph-based bio-inspired algorithms, in this case Ant Colony Optimization Clustering algorithms, which have been designed during the dissertation, in order to extend the range of study to other bio-inspired areas. Finally, with the purpose of evaluating the algorithms of the di erent generations, we have compared them with relevant and well-known clustering algorithms using synthetic and real-world datasets extracted from the literature and the UCI Machine Learning RepositoryLas técnicas de aprendizaje no supervisado han sido ampliamente utilizadas en minería de datos en los últimos años. Estas técnicas tratan de extraer patrones de un conjunto de datos de forma ciega. Dentro de las mismas, el Clustering es uno de los campos más prometedores. Este consiste en la agrupación de los datos por similitud. Este campo ha generado varios trabajos de investigación que han tratado de hacer frente a diferentes problemas relacionados con la extracción de patrones y los procesos de agrupación de datos. Una de las metodologías de clustering más innovadoras se basa en agrupar los datos por continuidad, respetando la forma que estos definen en espacio en el que se encuentran. Esta tesis se centra en la manera de aplicar algoritmos genéticos a los problemas de clustering basado en continuidad. Los algoritmos genéticos han sido utilizados tradicionalmente en problemas de optimización. Se caracterizan por una codificación -que representa el espacio de soluciones-, una población o conjunto de cromosomas -que son las soluciones potenciales dentro de este espacio-, y algunas operaciones genéticas -que se utilizan para evolucionar las soluciones con el fin de encontrar el mejor cromosoma o solución-. La idea principal es aprovechar el pontencial de los algoritmos genéticos generando nuevos algoritmos que pueden mejorar el rendimiento de los algoritmos clásicos aplicados tanto a datos estáticos como a flujos continuos de datos. De cara a diseñaar estos algoritmos, esta tesis doctoral utiliza el algoritmo de Spectral Clustering como punto de partida. Este algoritmo estudia el espectro de un grafo de similitud con el fin de dfinir las agrupaciones o clusters. Los grupos de nidos por Spectral Clustering suelen respetar la continuidad de los datos. Utilizando esta idea, se han diseñado diferentes algoritmos genéticos basados en grafos para hacer frente al problema de agrupación basada en continuidad. Los diferentes algoritmos desarrollados se han dividido en tres generaciones: La primera generación se basa en algoritmos de clustering genéticos basados en grafos. En esta generación se han combinado técnicas de Graph Clustering y algoritmos genéticos para generar una topología de grafo entre los datos, con el fin de encontrar la mejor manera de cortar el grafo. Este proceso de corte se utiliza para discriminar los grupos finales. La idea principal es utilizar algoritmos híbridos que combinan diferentes métricas extraídas de teoría de grafos. Con el fin de evaluar el comportamiento de los algoritmos en problemas del mundo real, estos algoritmos se han aplicado al problema de cómo generar resúmenes automáticos. La segunda generación se basa en algoritmos multi-objetivo de clustering genético basado en grafos. Esta generación introduce el Frente de Pareto, generado por las diferentes funciones de fitness utilizadas en la búsqueda genética. El frente de Pareto se utiliza para estudiar el espacio de soluciones y proporcionar soluciones más robustas y precisas. Durante esta generación también utilizamos algoritmos co-evolutivos de cara a incluir el número de clusters en el espacio de búsqueda Finalmente, la ultima generación se centra en el análisis de grandes cantidades y flujos de datos. Durante esta generación los algoritmos anteriormente mencionados se han adaptado para hacer frente a grandes volúmenes de datos, combinando diferentes metodologí as como el clustering online y MapReduce. La idea principal es estudiar su rendimiento en comparación con otros algoritmos. La tesis también incluye aportaciones de otros algoritmos bio-inspirados basados en grafos, en este caso, algoritmos de clustering usando optimización por colonias de hormigas. Estos algoritmos han sido diseñados durante el desarrollo de la tesis para ampliar el rango de estudio a otros entornos bio-inspirados. Por último, con el fin de evaluar los algoritmos de las diferentes generaciones, se han comparado con algoritmos de clustering conocidos. El rendimiento de estos algoritmos se ha medido utilizando conjuntos de datos sintéticos y reales extraídos de la literatura y del repositorio UCI de Machine Learning

    Designing large quantum key distribution networks via medoid-based algorithms

    Get PDF
    The current development of quantum mechanics and its applications suppose a threat to modern cryptography as it was conceived. The abilities of quantum computers for solving complex mathematical problems, as a strong computational novelty, is the root of that risk. However, quantum technologies can also prevent this threat by leveraging quantum methods to distribute keys. This field, called Quantum Key Distribution (QKD) is growing, although it still needs more physical basics to become a reality as popular as the Internet. This work proposes a novel methodology that leverages medoid-based clustering techniques to design quantum key distribution networks on commercial fiber optics systems. Our methodology focuses on the current limitations of these communication systems, their error loss and how trusted repeaters can lead to achieve a proper communication with the current technology. We adapt our model to the current data on a wide territory covering an area of almost 100,000 km2, and prove that considering physical limitations of around 45km with 3.1 error loss, our design can provide service to the whole area. This technique is the first to extend the state of the art network’s design, that is focused on up to 10 nodes, to networks dealing with more than 200 nodes

    Towards Personalized and Human-in-the-Loop Document Summarization

    Full text link
    The ubiquitous availability of computing devices and the widespread use of the internet have generated a large amount of data continuously. Therefore, the amount of available information on any given topic is far beyond humans' processing capacity to properly process, causing what is known as information overload. To efficiently cope with large amounts of information and generate content with significant value to users, we require identifying, merging and summarising information. Data summaries can help gather related information and collect it into a shorter format that enables answering complicated questions, gaining new insight and discovering conceptual boundaries. This thesis focuses on three main challenges to alleviate information overload using novel summarisation techniques. It further intends to facilitate the analysis of documents to support personalised information extraction. This thesis separates the research issues into four areas, covering (i) feature engineering in document summarisation, (ii) traditional static and inflexible summaries, (iii) traditional generic summarisation approaches, and (iv) the need for reference summaries. We propose novel approaches to tackle these challenges, by: i)enabling automatic intelligent feature engineering, ii) enabling flexible and interactive summarisation, iii) utilising intelligent and personalised summarisation approaches. The experimental results prove the efficiency of the proposed approaches compared to other state-of-the-art models. We further propose solutions to the information overload problem in different domains through summarisation, covering network traffic data, health data and business process data.Comment: PhD thesi

    COMMUNITY DETECTION IN GRAPHS

    Get PDF
    Thesis (Ph.D.) - Indiana University, Luddy School of Informatics, Computing, and Engineering/University Graduate School, 2020Community detection has always been one of the fundamental research topics in graph mining. As a type of unsupervised or semi-supervised approach, community detection aims to explore node high-order closeness by leveraging graph topological structure. By grouping similar nodes or edges into the same community while separating dissimilar ones apart into different communities, graph structure can be revealed in a coarser resolution. It can be beneficial for numerous applications such as user shopping recommendation and advertisement in e-commerce, protein-protein interaction prediction in the bioinformatics, and literature recommendation or scholar collaboration in citation analysis. However, identifying communities is an ill-defined problem. Due to the No Free Lunch theorem [1], there is neither gold standard to represent perfect community partition nor universal methods that are able to detect satisfied communities for all tasks under various types of graphs. To have a global view of this research topic, I summarize state-of-art community detection methods by categorizing them based on graph types, research tasks and methodology frameworks. As academic exploration on community detection grows rapidly in recent years, I hereby particularly focus on the state-of-art works published in the latest decade, which may leave out some classic models published decades ago. Meanwhile, three subtle community detection tasks are proposed and assessed in this dissertation as well. First, apart from general models which consider only graph structures, personalized community detection considers user need as auxiliary information to guide community detection. In the end, there will be fine-grained communities for nodes better matching user needs while coarser-resolution communities for the rest of less relevant nodes. Second, graphs always suffer from the sparse connectivity issue. Leveraging conventional models directly on such graphs may hugely distort the quality of generate communities. To tackle such a problem, cross-graph techniques are involved to propagate external graph information as a support for target graph community detection. Third, graph community structure supports a natural language processing (NLP) task to depict node intrinsic characteristics by generating node summarizations via a text generative model. The contribution of this dissertation is threefold. First, a decent amount of researches are reviewed and summarized under a well-defined taxonomy. Existing works about methods, evaluation and applications are all addressed in the literature review. Second, three novel community detection tasks are demonstrated and associated models are proposed and evaluated by comparing with state-of-art baselines under various datasets. Third, the limitations of current works are pointed out and future research tracks with potentials are discussed as well

    Methodological challenges and analytic opportunities for modeling and interpreting Big Healthcare Data

    Full text link
    Abstract Managing, processing and understanding big healthcare data is challenging, costly and demanding. Without a robust fundamental theory for representation, analysis and inference, a roadmap for uniform handling and analyzing of such complex data remains elusive. In this article, we outline various big data challenges, opportunities, modeling methods and software techniques for blending complex healthcare data, advanced analytic tools, and distributed scientific computing. Using imaging, genetic and healthcare data we provide examples of processing heterogeneous datasets using distributed cloud services, automated and semi-automated classification techniques, and open-science protocols. Despite substantial advances, new innovative technologies need to be developed that enhance, scale and optimize the management and processing of large, complex and heterogeneous data. Stakeholder investments in data acquisition, research and development, computational infrastructure and education will be critical to realize the huge potential of big data, to reap the expected information benefits and to build lasting knowledge assets. Multi-faceted proprietary, open-source, and community developments will be essential to enable broad, reliable, sustainable and efficient data-driven discovery and analytics. Big data will affect every sector of the economy and their hallmark will be ‘team science’.http://deepblue.lib.umich.edu/bitstream/2027.42/134522/1/13742_2016_Article_117.pd
    • …
    corecore