1,494 research outputs found

    Contextual Information Retrieval based on Algorithmic Information Theory and Statistical Outlier Detection

    Full text link
    The main contribution of this paper is to design an Information Retrieval (IR) technique based on Algorithmic Information Theory (using the Normalized Compression Distance- NCD), statistical techniques (outliers), and novel organization of data base structure. The paper shows how they can be integrated to retrieve information from generic databases using long (text-based) queries. Two important problems are analyzed in the paper. On the one hand, how to detect "false positives" when the distance among the documents is very low and there is actual similarity. On the other hand, we propose a way to structure a document database which similarities distance estimation depends on the length of the selected text. Finally, the experimental evaluations that have been carried out to study previous problems are shown.Comment: Submitted to 2008 IEEE Information Theory Workshop (6 pages, 6 figures

    Reducing the loss of information through annealing text distortion

    Full text link
    Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. Granados, A. ;Cebrian, M. ; Camacho, D. ; de Borja Rodriguez, F. "Reducing the Loss of Information through Annealing Text Distortion". IEEE Transactions on Knowledge and Data Engineering, vol. 23, no. 7 pp. 1090 - 1102, July 2011Compression distances have been widely used in knowledge discovery and data mining. They are parameter-free, widely applicable, and very effective in several domains. However, little has been done to interpret their results or to explain their behavior. In this paper, we take a step toward understanding compression distances by performing an experimental evaluation of the impact of several kinds of information distortion on compression-based text clustering. We show how progressively removing words in such a way that the complexity of a document is slowly reduced helps the compression-based text clustering and improves its accuracy. In fact, we show how the nondistorted text clustering can be improved by means of annealing text distortion. The experimental results shown in this paper are consistent using different data sets, and different compression algorithms belonging to the most important compression families: Lempel-Ziv, Statistical and Block-Sorting.This work was supported by the Spanish Ministry of Education and Science under TIN2010-19872 and TIN2010-19607 projects

    Governing others:Anomaly and the algorithmic subject of security

    Get PDF

    Computing Competencies for Undergraduate Data Science Curricula: ACM Data Science Task Force

    Get PDF
    At the August 2017 ACM Education Council meeting, a task force was formed to explore a process to add to the broad, interdisciplinary conversation on data science, with an articulation of the role of computing discipline-specific contributions to this emerging field. Specifically, the task force would seek to define what the computing/computational contributions are to this new field, and provide guidance on computing-specific competencies in data science for departments offering such programs of study at the undergraduate level. There are many stakeholders in the discussion of data science – these include colleges and universities that (hope to) offer data science programs, employers who hope to hire a workforce with knowledge and experience in data science, as well as individuals and professional societies representing the fields of computing, statistics, machine learning, computational biology, computational social sciences, digital humanities, and others. There is a shared desire to form a broad interdisciplinary definition of data science and to develop curriculum guidance for degree programs in data science. This volume builds upon the important work of other groups who have published guidelines for data science education. There is a need to acknowledge the definition and description of the individual contributions to this interdisciplinary field. For instance, those interested in the business context for these concepts generally use the term “analytics”; in some cases, the abbreviation DSA appears, meaning Data Science and Analytics. This volume is the third draft articulation of computing-focused competencies for data science. It recognizes the inherent interdisciplinarity of data science and situates computing-specific competencies within the broader interdisciplinary space

    A Flexible Outlier Detector Based on a Topology Given by Graph Communities

    Get PDF
    Acord transformatiu CRUE-CSICOutlier detection is essential for optimal performance of machine learning methods and statistical predictive models. Their detection is especially determinant in small sample size unbalanced problems, since in such settings outliers become highly influential and significantly bias models. This particular experimental settings are usual in medical applications, like diagnosis of rare pathologies, outcome of experimental personalized treatments or pandemic emergencies. In contrast to population-based methods, neighborhood based local approaches compute an outlier score from the neighbors of each sample, are simple flexible methods that have the potential to perform well in small sample size unbalanced problems. A main concern of local approaches is the impact that the computation of each sample neighborhood has on the method performance. Most approaches use a distance in the feature space to define a single neighborhood that requires careful selection of several parameters, like the number of neighbors. This work presents a local approach based on a local measure of the heterogeneity of sample labels in the feature space considered as a topological manifold. Topology is computed using the communities of a weighted graph codifying mutual nearest neighbors in the feature space. This way, we provide with a set of multiple neighborhoods able to describe the structure of complex spaces without parameter fine tuning. The extensive experiments on real-world and synthetic data sets show that our approach outperforms, both, local and global strategies in multi and single view settings

    Model-driven and Data-driven Approaches for some Object Recognition Problems

    Get PDF
    Recognizing objects from images and videos has been a long standing problem in computer vision. The recent surge in the prevalence of visual cameras has given rise to two main challenges where, (i) it is important to understand different sources of object variations in more unconstrained scenarios, and (ii) rather than describing an object in isolation, efficient learning methods for modeling object-scene `contextual' relations are required to resolve visual ambiguities. This dissertation addresses some aspects of these challenges, and consists of two parts. First part of the work focuses on obtaining object descriptors that are largely preserved across certain sources of variations, by utilizing models for image formation and local image features. Given a single instance of an object, we investigate the following three problems. (i) Representing a 2D projection of a 3D non-planar shape invariant to articulations, when there are no self-occlusions. We propose an articulation invariant distance that is preserved across piece-wise affine transformations of a non-rigid object `parts', under a weak perspective imaging model, and then obtain a shape context-like descriptor to perform recognition; (ii) Understanding the space of `arbitrary' blurred images of an object, by representing an unknown blur kernel of a known maximum size using a complete set of orthonormal basis functions spanning that space, and showing that subspaces resulting from convolving a clean object and its blurred versions with these basis functions are equal under some assumptions. We then view the invariant subspaces as points on a Grassmann manifold, and use statistical tools that account for the underlying non-Euclidean nature of the space of these invariants to perform recognition across blur; (iii) Analyzing the robustness of local feature descriptors to different illumination conditions. We perform an empirical study of these descriptors for the problem of face recognition under lighting change, and show that the direction of image gradient largely preserves object properties across varying lighting conditions. The second part of the dissertation utilizes information conveyed by large quantity of data to learn contextual information shared by an object (or an entity) with its surroundings. (i) We first consider a supervised two-class problem of detecting lane markings from road video sequences, where we learn relevant feature-level contextual information through a machine learning algorithm based on boosting. We then focus on unsupervised object classification scenarios where, (ii) we perform clustering using maximum margin principles, by deriving some basic properties on the affinity of `a pair of points' belonging to the same cluster using the information conveyed by `all' points in the system, and (iii) then consider correspondence-free adaptation of statistical classifiers across domain shifting transformations, by generating meaningful `intermediate domains' that incrementally convey potential information about the domain change

    A comprehensive insight towards Pre-processing Methodologies applied on GPS data

    Get PDF
    Reliability in the utilization of the Global Positioning System (GPS) data demands a higher degree of accuracy with respect to time and positional information required by the user. However, various extrinsic and intrinsic parameters disrupt the data transmission phenomenon from GPS satellite to GPS receiver which always questions the trustworthiness of such data. Therefore, this manuscript offers a comprehensive insight into the data preprocessing methodologies evolved and adopted by present-day researchers. The discussion is carried out with respect to standard methods of data cleaning as well as diversified existing research-based approaches. The review finds that irrespective of a good number of work carried out to address the problem of data cleaning, there are critical loopholes in almost all the existing studies. The paper extracts open end research problems as well as it also offers an evidential insight using use-cases where it is found that still there is a critical need to investigate data cleaning methods

    Parsing consumption preferences of music streaming audiences

    Get PDF
    As demands for insights on music streaming listeners continue to grow, scientists and industry analysts face the challenge to comprehend a mutated consumption behavior, which demands a renewed approach to listener typologies. This study aims to determine how audience segmentation can be performed in a time-relevant and replicable manner. Thus, it interrogates which parameters best serve as indicators of preferences to ultimately assist in delimiting listener segments. Accordingly, the primary objective of this research is to develop a revised typology that classifies music streaming listeners in the light of the progressive phenomenology of music listening. The hypothesis assumes that this could be solved by positioning listeners – rather than products – at the center of streaming analysis and supplementing sales- with user-centered metrics. The empirical research of this paper was based on grounded theories, enriched by analytical case studies. For this purpose, behavioral and psychological research results were interconnected with market analysis and streaming platform usage data. Analysis of the results demonstrates that a concatenation of multi-dimensional data streams facilitates the derivation of a typology that is applicable to varying audience pools. The findings indicate that for the delimitation of listener types, the motivation, and listening context are essential key constituents. Since these variables demand insights that reach beyond existing metrics, descriptive data points relating to the listening process are subjoined. Ultimately, parameter indexation results in listener profiles that offer novel access points for investigations, which make imperceptible, interdisciplinary correlations tangible. The framework of the typology can be consulted in analytical and creational processes. In this respect, the results of the derived analytical approach contribute to better determine and ultimately satisfy listener preferences.Während die Nachfrage nach Erkenntnissen über Musik-Streaming-Hörer kontinuierlich steigt, stehen Wissenschaftler sowie Industrieanalysten einem geänderten Konsumptions- verhalten gegenüber, das eine überarbeitete Hörertypologie fordert. Die vorliegende Studie erörtert, wie eine Hörersegmentierung auf zeitgemäße und replizierbare Weise umgesetzt werden kann. Demnach beschäftigt sie sich mit der Frage, welche Parameter am besten als Indikatoren für Hörerpräferenzen dienen und wie diese zur Abgrenzung der Publikumsseg- mente beitragen können. Dementsprechend ist es das primäre Ziel dieser Forschung, eine überarbeitete Typologie aufzustellen, die Musik-Streaming-Hörer in Anbetracht der progressiven Erscheinungsform des Musikhörens klassifiziert. Die Hypothese nimmt an, dass dies realisierbar ist, wenn der Hörer – anstelle von Produkten – im Zentrum der Streaming-Analyse steht und absatzzen- trierte durch hörerzentrierte Messungen ergänzt werden. Die empirische Forschung basiert auf systematischen Theorien, untermauert durch analytische Fallbeispiele. Hierfür werden psychologische und verhaltenswissenschaftliche Forschungserkenntnisse mit Marktanalysen und Nutzerdaten von Musikstreaming-Portalen fusioniert. Die Analyse der Ergebnisse verdeutlicht, dass eine Verkettung von multidimensionalen Rohdaten die Erhebung einer Typologie ermöglicht, die auf mehrere Hörergruppen anwend- bar ist. Die Befunde signalisieren, dass die Hörmotivation und der Hörkontext bei der Abgrenzung der Publikumstypen Schlüsselelemente darstellen. Da diese Variablen spezifis- che Kenntnisse fordern, die über vorliegende Kennzahlen hinausgehen, werden deskriptive Datenpunkte über den Hörvorgang ergänzt. Letztlich, resultiert die Indexierung der Pa- rameter in Hörerprofilen, die neue Zugangspunkte für Untersuchungen bieten, die nicht ersichtliche, interdisziplinäre Korrelationen greifbar machen. Das Gerüst der Hörertypologie kann sowohl in Erstellungs- als auch in Analyseprozessen herangezogen werden. Somit tragen die Ergebnisse der entwickelten Analysemethode zum Verständnis und letztlich zur Erfüllung von Hörerpräferenzen bei

    08421 Abstracts Collection -- Uncertainty Management in Information Systems

    Get PDF
    From October 12 to 17, 2008 the Dagstuhl Seminar 08421 \u27`Uncertainty Management in Information Systems \u27\u27 was held in Schloss Dagstuhl~--~Leibniz Center for Informatics. The abstracts of the plenary and session talks given during the seminar as well as those of the shown demos are put together in this paper
    • …
    corecore