1,002 research outputs found

    Investigating Randomised Sphere Covers in Supervised Learning

    Get PDF
    c©This copy of the thesis has been supplied on condition that anyone who consults it is understood to recognise that its copyright rests with the author and that no quotation from the thesis, nor any information derived therefrom, may be published without the author’s prior, written consent. In this thesis, we thoroughly investigate a simple Instance Based Learning (IBL) classifier known as Sphere Cover. We propose a simple Randomized Sphere Cover Classifier (αRSC) and use several datasets in order to evaluate the classification performance of the αRSC classifier. In addition, we analyse the generalization error of the proposed classifier using bias/variance decomposition. A Sphere Cover Classifier may be described from the compression scheme which stipulates data compression as the reason for high generalization performance. We investigate the compression capacity of αRSC using a sample compression bound. The Compression Scheme prompted us to search new compressibility methods for αRSC. As such, we used a Gaussian kernel to investigate further data compression

    Finding Attribute-Aware Similar Region for Data Analysis

    Get PDF

    Searching and mining in enriched geo-spatial data

    Get PDF
    The emergence of new data collection mechanisms in geo-spatial applications paired with a heightened tendency of users to volunteer information provides an ever-increasing flow of data of high volume, complex nature, and often associated with inherent uncertainty. Such mechanisms include crowdsourcing, automated knowledge inference, tracking, and social media data repositories. Such data bearing additional information from multiple sources like probability distributions, text or numerical attributes, social context, or multimedia content can be called multi-enriched. Searching and mining this abundance of information holds many challenges, if all of the data's potential is to be released. This thesis addresses several major issues arising in that field, namely path queries using multi-enriched data, trend mining in social media data, and handling uncertainty in geo-spatial data. In all cases, the developed methods have made significant contributions and have appeared in or were accepted into various renowned international peer-reviewed venues. A common use of geo-spatial data is path queries in road networks where traditional methods optimise results based on absolute and ofttimes singular metrics, i.e., finding the shortest paths based on distance or the best trade-off between distance and travel time. Integrating additional aspects like qualitative or social data by enriching the data model with knowledge derived from sources as mentioned above allows for queries that can be issued to fit a broader scope of needs or preferences. This thesis presents two implementations of incorporating multi-enriched data into road networks. In one case, a range of qualitative data sources is evaluated to gain knowledge about user preferences which is subsequently matched with locations represented in a road network and integrated into its components. Several methods are presented for highly customisable path queries that incorporate a wide spectrum of data. In a second case, a framework is described for resource distribution with reappearance in road networks to serve one or more clients, resulting in paths that provide maximum gain based on a probabilistic evaluation of available resources. Applications for this include finding parking spots. Social media trends are an emerging research area giving insight in user sentiment and important topics. Such trends consist of bursts of messages concerning a certain topic within a time frame, significantly deviating from the average appearance frequency of the same topic. By investigating the dissemination of such trends in space and time, this thesis presents methods to classify trend archetypes to predict future dissemination of a trend. Processing and querying uncertain data is particularly demanding given the additional knowledge required to yield results with probabilistic guarantees. Since such knowledge is not always available and queries are not easily scaled to larger datasets due to the #P-complete nature of the problem, many existing approaches reduce the data to a deterministic representation of its underlying model to eliminate uncertainty. However, data uncertainty can also provide valuable insight into the nature of the data that cannot be represented in a deterministic manner. This thesis presents techniques for clustering uncertain data as well as query processing, that take the additional information from uncertainty models into account while preserving scalability using a sampling-based approach, while previous approaches could only provide one of the two. The given solutions enable the application of various existing clustering techniques or query types to a framework that manages the uncertainty.Das Erscheinen neuer Methoden zur Datenerhebung in räumlichen Applikationen gepaart mit einer erhöhten Bereitschaft der Nutzer, Daten über sich preiszugeben, generiert einen stetig steigenden Fluss von Daten in großer Menge, komplexer Natur, und oft gepaart mit inhärenter Unsicherheit. Beispiele für solche Mechanismen sind Crowdsourcing, automatisierte Wissensinferenz, Tracking, und Daten aus sozialen Medien. Derartige Daten, angereichert mit mit zusätzlichen Informationen aus verschiedenen Quellen wie Wahrscheinlichkeitsverteilungen, Text- oder numerische Attribute, sozialem Kontext, oder Multimediainhalten, werden als multi-enriched bezeichnet. Suche und Datamining in dieser weiten Datenmenge hält viele Herausforderungen bereit, wenn das gesamte Potenzial der Daten genutzt werden soll. Diese Arbeit geht auf mehrere große Fragestellungen in diesem Feld ein, insbesondere Pfadanfragen in multi-enriched Daten, Trend-mining in Daten aus sozialen Netzwerken, und die Beherrschung von Unsicherheit in räumlichen Daten. In all diesen Fällen haben die entwickelten Methoden signifikante Forschungsbeiträge geleistet und wurden veröffentlicht oder angenommen zu diversen renommierten internationalen, von Experten begutachteten Konferenzen und Journals. Ein gängiges Anwendungsgebiet räumlicher Daten sind Pfadanfragen in Straßennetzwerken, wo traditionelle Methoden die Resultate anhand absoluter und oft auch singulärer Maße optimieren, d.h., der kürzeste Pfad in Bezug auf die Distanz oder der beste Kompromiss zwischen Distanz und Reisezeit. Durch die Integration zusätzlicher Aspekte wie qualitativer Daten oder Daten aus sozialen Netzwerken als Anreicherung des Datenmodells mit aus diesen Quellen abgeleitetem Wissen werden Anfragen möglich, die ein breiteres Spektrum an Anforderungen oder Präferenzen erfüllen. Diese Arbeit präsentiert zwei Ansätze, solche multi-enriched Daten in Straßennetze einzufügen. Zum einen wird eine Reihe qualitativer Datenquellen ausgewertet, um Wissen über Nutzerpräferenzen zu generieren, welches darauf mit Örtlichkeiten im Straßennetz abgeglichen und in das Netz integriert wird. Diverse Methoden werden präsentiert, die stark personalisierbare Pfadanfragen ermöglichen, die ein weites Spektrum an Daten mit einbeziehen. Im zweiten Fall wird ein Framework präsentiert, das eine Ressourcenverteilung im Straßennetzwerk modelliert, bei der einmal verbrauchte Ressourcen erneut auftauchen können. Resultierende Pfade ergeben einen maximalen Ertrag basieren auf einer probabilistischen Evaluation der verfügbaren Ressourcen. Eine Anwendung ist die Suche nach Parkplätzen. Trends in sozialen Medien sind ein entstehendes Forscchungsgebiet, das Einblicke in Benutzerverhalten und wichtige Themen zulässt. Solche Trends bestehen aus großen Mengen an Nachrichten zu einem bestimmten Thema innerhalb eines Zeitfensters, so dass die Auftrittsfrequenz signifikant über den durchschnittlichen Level liegt. Durch die Untersuchung der Fortpflanzung solcher Trends in Raum und Zeit präsentiert diese Arbeit Methoden, um Trends nach Archetypen zu klassifizieren und ihren zukünftigen Weg vorherzusagen. Die Anfragebearbeitung und Datamining in unsicheren Daten ist besonders herausfordernd, insbesondere im Hinblick auf das notwendige Zusatzwissen, um Resultate mit probabilistischen Garantien zu erzielen. Solches Wissen ist nicht immer verfügbar und Anfragen lassen sich aufgrund der \P-Vollständigkeit des Problems nicht ohne Weiteres auf größere Datensätze skalieren. Dennoch kann Datenunsicherheit wertvollen Einblick in die Struktur der Daten liefern, der mit deterministischen Methoden nicht erreichbar wäre. Diese Arbeit präsentiert Techniken zum Clustering unsicherer Daten sowie zur Anfragebearbeitung, die die Zusatzinformation aus dem Unsicherheitsmodell in Betracht ziehen, jedoch gleichzeitig die Skalierbarkeit des Ansatzes auf große Datenmengen sicherstellen

    Maximal Closed Set and Half-Space Separations in Finite Closure Systems

    Full text link
    Several problems of artificial intelligence, such as predictive learning, formal concept analysis or inductive logic programming, can be viewed as a special case of half-space separation in abstract closure systems over finite ground sets. For the typical scenario that the closure system is given via a closure operator, we show that the half-space separation problem is NP-complete. As a first approach to overcome this negative result, we relax the problem to maximal closed set separation, give a greedy algorithm solving this problem with a linear number of closure operator calls, and show that this bound is sharp. For a second direction, we consider Kakutani closure systems and prove that they are algorithmically characterized by the greedy algorithm. As a first special case of the general problem setting, we consider Kakutani closure systems over graphs, generalize a fundamental characterization result based on the Pasch axiom to graph structured partitioning of finite sets, and give a sufficient condition for this kind of closures systems in terms of graph minors. For a second case, we then focus on closure systems over finite lattices, give an improved adaptation of the greedy algorithm for this special case, and present two applications concerning formal concept and subsumption lattices. We also report some experimental results to demonstrate the practical usefulness of our algorithm.Comment: An early version of this paper was presented at ECML/PKDD 2019 and has appeared in the Lecture Notes in Computer Science, Machine Learning and Knowledge Discovery in Databases - European Conference, ECML PKDD 201

    Partial Replica Location And Selection For Spatial Datasets

    Get PDF
    As the size of scientific datasets continues to grow, we will not be able to store enormous datasets on a single grid node, but must distribute them across many grid nodes. The implementation of partial or incomplete replicas, which represent only a subset of a larger dataset, has been an active topic of research. Partial Spatial Replicas extend this functionality to spatial data, allowing us to distribute a spatial dataset in pieces over several locations. We investigate solutions to the partial spatial replica selection problems. First, we describe and develop two designs for an Spatial Replica Location Service (SRLS), which must return the set of replicas that intersect with a query region. Integrating a relational database, a spatial data structure and grid computing software, we build a scalable solution that works well even for several million replicas. In our SRLS, we have improved performance by designing a R-tree structure in the backend database, and by aggregating several queries into one larger query, which reduces overhead. We also use the Morton Space-filling Curve during R-tree construction, which improves spatial locality. In addition, we describe R-tree Prefetching(RTP), which effectively utilizes the modern multi-processor architecture. Second, we present and implement a fast replica selection algorithm in which a set of partial replicas is chosen from a set of candidates so that retrieval performance is maximized. Using an R-tree based heuristic algorithm, we achieve O(n log n) complexity for this NP-complete problem. We describe a model for disk access performance that takes filesystem prefetching into account and is sufficiently accurate for spatial replica selection. Making a few simplifying assumptions, we present a fast replica selection algorithm for partial spatial replicas. The algorithm uses a greedy approach that attempts to maximize performance by choosing a collection of replica subsets that allow fast data retrieval by a client machine. Experiments show that the performance of the solution found by our algorithm is on average always at least 91% and 93.4% of the performance of the optimal solution in 4-node and 8-node tests respectively

    A Pre-Programming Approach to Algorithmic Thinking in High School Mathematics

    Get PDF
    Given the impact of computers and computing on almost every aspect of society, the ability to develop, analyze, and implement algorithms is gaining more focus. Algorithms are increasingly important in theoretical mathematics, in applications of mathematics, in computer science, as well as in many areas outside of mathematics. In high school, however, algorithms are usually restricted to computer science courses and as a result, the important relationship between mathematics and computer science is often overlooked (Henderson, 1997). The mathematical ideas behind the design, construction and analysis of algorithms, are important for students' mathematical education. In addition, exploring algorithms can help students see mathematics as a meaningful and creative subject. This study provides a review of the history of algorithms and algorithmic complexity, as well as a technical monograph that illustrates the mathematical aspects of algorithmic complexity in a form that is accessible to mathematics instructors at the high school level. The historical component of this study is broken down into two parts. The first part covers the history of algorithms with an emphasis on how the concept has evolved from 3000 BC through the Middle Ages to the present day. The second part focuses on the history of algorithmic complexity, dating back to the text of Ibn al-majdi, a fourteenth century Egyptian astronomer, through the 20th century. In particular, it highlights the contributions of a group of mathematicians including Alan Turing, Michael Rabin, Juris Hartmanis, Richard Stearns and Alan Cobham, whose work in computability theory and complexity measures was critical to the development of the field of algorithmic complexity. The technical monograph which follows describes how the complexity of an algorithm can be measured and analyzes different types of algorithms. It includes divide-and-conquer algorithms, search and sort algorithms, greedy algorithms, algorithms for matching, and geometric algorithms. The methods used to analyze the complexity of these algorithms is done without the use of a programming language in order to focus on the mathematical aspects of the algorithms, and to provide knowledge and skills of value that are independent of specific computers or programming languages. In addition, the study assesses the appropriateness of these topics for use by high school teachers by submitting it for independent review to a panel of experts. The panel, which consists of mathematics and computer science faculty in high school and colleges around the United States, found the material to be interesting and felt that using a pre-programming approach to teaching algorithmic complexity has a great deal of merit. There was some concern, however, that portions of the material may be too advanced for high school mathematics instructors. Additionally, they thought that the material would only appeal to the strongest students. As per the reviewers' suggestions, the monograph was revised to its current form

    Rigorous cubical approximation and persistent homology of continuous functions

    Get PDF
    International audienceThe interaction between discrete and continuous mathematics lies at the heart of many fundamental problems in applied mathematics and computational sciences. In this paper we discuss the problem of discretizing vector-valued functions defined on finite-dimensional Euclidean spaces in such a way that the discretization error is bounded by a pre-specified small constant. While the approximation scheme has a number of potential applications, we consider its usefulness in the context of computational homology. More precisely, we demonstrate that our approximation procedure can be used to rigorously compute the persistent homology of the original continuous function on a compact domain, up to small explicitly known and verified errors. In contrast to other work in this area, our approach requires minimal smoothness assumptions on the underlying function

    Polütoopide laienditega seotud ülesanded

    Get PDF
    Väitekirja elektrooniline versioon ei sisalda publikatsiooneLineaarplaneerimine on optimeerimine matemaatilise mudeliga, mille sihi¬funktsioon ja kitsendused on esitatud lineaarsete seostega. Paljusid igapäeva elu väljakutseid võime vaadelda lineaarplaneerimise vormis, näiteks miinimumhinna või maksimaalse tulu leidmist. Sisepunkti meetod saavutab häid tulemusi nii teoorias kui ka praktikas ning lahendite leidmise tööaeg ja lineaarsete seoste arv on polünomiaalses seoses. Sellest tulenevalt eksponentsiaalne arv lineaarseid seoseid väljendub ka ekponentsiaalses tööajas. Iga vajalik lineaarne seos vastab ühele polütoobi P tahule, mis omakorda tähistab lahendite hulka. Üks võimalus tööaja vähendamiseks on suurendada dimensiooni, mille tulemusel väheneks ka polütoobi tahkude arv. Saadud polütoopi Q nimeta¬takse polütoobi P laiendiks kõrgemas dimensioonis ning polütoobi Q minimaalset tahkude arvu nimetakakse polütoobi P laiendi keerukuseks, sellisel juhul optimaalsete lahendite hulk ei muutu. Tekib küsimus, millisel juhul on võimalik leida laiend Q, mille korral tahkude arv on polünomiaalne. Mittedeterministlik suhtluskeerukus mängib olulist rolli tõestamaks polütoopide laiendite keerukuse alampiiri. Polütoobile P vastava suhtluskeerukuse leidmine ning alamtõkke tõestamine väistavad võimalused leida laiend Q, mis ei oleks eksponentsiaalne. Käesolevas töös keskendume me juhuslikele Boole'i funktsioonidele f, mille tihedusfunktsioon on p = p(n). Me pakume välja vähima ülemtõkke ning suurima alamtõkke mittedeterministliku suhtluskeerukuse jaoks. Lisaks uurime me ka pedigree polütoobi graafi. Pedigree polütoop on rändkaupmehe ülesande polütoobi laiend, millel on kombinatoorne struktuur. Polütoobi graafi võib vaadelda kui abstraktset graafi ning see annab informatsiooni polütoobi omaduste kohta.The linear programming (LP for short) is a method for finding an optimal solution, such as minimum cost or maximum profit for a linear function subject to linear constraints. But having an exponential number of inequalities gives the exponential running time in solving linear program. A polytope, let's say P, represents the space of the feasible solution. One idea for decreasing the running time of the problem, is lifting the polytope P tho the higher dimensions with the goal of decresing the number of inequalities. The polytope in higher dimension, let's say Q, is the extension of the original polytope P and the minimum number of facets that Q can have is the extension complexity of P. Then the optimal solution of the problem over Q, gives the optimal solution over P. The natural question may raise is when is it possible to have an extension with a polynomial number of inequalities? Nondeterministic communication complexity is a powerful tool for proving lower bound on the extension complexity of a polytopes. Finding a suitable communication complexity problem corresponded to a polytope P and proving a linear lower bound for the nondeterministic communication complexity of it, will rule out all the attempts for finding sub-exponential size extension Q of P. In this thesis, we focus on the random Boolean functions f, with density p = p(n). We give tight upper and lower bounds for the nondeterministic communication complexity and parameters related to it. Also, we study the rank of fooling set matrix which is an important lower bound for nondeterministic communication complexity. Finally, we investigate the graph of the pedigree polytope. Pedigree polytope is an extension of TSP (traveling salesman problem; the most extensively studied problem in combinatorial optimization) polytopes with a nice combinatorial structure. The graph of a polytope can be regarded as an abstract graph and it reveals meaningful information about the properties of the polytope

    Probabilistic framework for image understanding applications using Bayesian Networks

    Get PDF
    Machine learning algorithms have been successfully utilized in various systems/devices. They have the ability to improve the usability/quality of such systems in terms of intelligent user interface, fast performance, and more importantly, high accuracy. In this research, machine learning techniques are used in the field of image understanding, which is a common research area between image analysis and computer vision, to involve higher processing level of a target image to make sense of the scene captured in it. A general probabilistic framework for image understanding where topics associated with (i) collection of images to generate a comprehensive and valid database, (ii) generation of an unbiased ground-truth for the aforesaid database, (iii) selection of classification features and elimination of the redundant ones, and (iv) usage of such information to test a new sample set, are discussed. Two research projects have been developed as examples of the general image understanding framework; identification of region(s) of interest, and image segmentation evaluation. These techniques, in addition to others, are combined in an object-oriented rendering system for printing applications. The discussion included in this doctoral dissertation explores the means for developing such a system from an image understanding/ processing aspect. It is worth noticing that this work does not aim to develop a printing system. It is only proposed to add some essential features for current printing pipelines to achieve better visual quality while printing images/photos. Hence, we assume that image regions have been successfully extracted from the printed document. These images are used as input to the proposed object-oriented rendering algorithm where methodologies for color image segmentation, region-of-interest identification and semantic features extraction are employed. Probabilistic approaches based on Bayesian statistics have been utilized to develop the proposed image understanding techniques
    corecore