226 research outputs found

    Bayesian Bernoulli Mixture Regression Model for Bidikmisi Scholarship Classification

    Get PDF
    Bidikmisi scholarship grantees are determined based on criteria related to the socioeconomic conditions of the parent of the scholarship grantee. Decision process of Bidikmisi acceptance is not easy to do, since there are sufficient big data of prospective applicants and variables of varied criteria. Based on these problems, a new approach is proposed to determine Bidikmisi grantees by using the Bayesian Bernoulli mixture regression model. The modeling procedure is performed by compiling the accepted and unaccepted cluster of applicants which are estimated for each cluster by the Bernoulli mixture regression model. The model parameter estimation process is done by building an algorithm based on Bayesian Markov Chain Monte Carlo (MCMC) method. The accuracy of acceptance process through Bayesian Bernoulli mixture regression model is measured by determining acceptance classification percentage of model which is compared with acceptance classification percentage of  the dummy regression model and the polytomous regression model. The comparative results show that Bayesian Bernoulli mixture regression model approach gives higher percentage of acceptance classification accuracy than dummy regression model and polytomous regression mode

    Statistical Learning Approaches to Information Filtering

    Get PDF
    Enabling computer systems to understand human thinking or behaviors has ever been an exciting challenge to computer scientists. In recent years one such a topic, information filtering, emerges to help users find desired information items (e.g.~movies, books, news) from large amount of available data, and has become crucial in many applications, like product recommendation, image retrieval, spam email filtering, news filtering, and web navigation etc.. An information filtering system must be able to understand users' information needs. Existing approaches either infer a user's profile by exploring his/her connections to other users, i.e.~collaborative filtering (CF), or analyzing the content descriptions of liked or disliked examples annotated by the user, ~i.e.~content-based filtering (CBF). Those methods work well to some extent, but are facing difficulties due to lack of insights into the problem. This thesis intensively studies a wide scope of information filtering technologies. Novel and principled machine learning methods are proposed to model users' information needs. The work demonstrates that the uncertainty of user profiles and the connections between them can be effectively modelled by using probability theory and Bayes rule. As one major contribution of this thesis, the work clarifies the ``structure'' of information filtering and gives rise to principled solutions. In summary, the work of this thesis mainly covers the following three aspects: Collaborative filtering: We develop a probabilistic model for memory-based collaborative filtering (PMCF), which has clear links with classical memory-based CF. Various heuristics to improve memory-based CF have been proposed in the literature. In contrast, extensions based on PMCF can be made in a principled probabilistic way. With PMCF, we describe a CF paradigm that involves interactions with users, instead of passively receiving data from users in conventional CF, and actively chooses the most informative patterns to learn, thereby greatly reduce user efforts and computational costs. Content-based filtering: One major problem for CBF is the deficiency and high dimensionality of content-descriptive features. Information items (e.g.~images or articles) are typically described by high-dimensional features with mixed types of attributes, that seem to be developed independently but intrinsically related. We derive a generalized principle component analysis to merge high-dimensional and heterogenous content features into a low-dimensional continuous latent space. The derived features brings great conveniences to CBF, because most existing algorithms easily cope with low-dimensional and continuous data, and more importantly, the extracted data highlight the intrinsic semantics of original content features. Hybrid filtering: How to combine CF and CBF in an ``smart'' way remains one of the most challenging problems in information filtering. Little principled work exists so far. This thesis reveals that people's information needs can be naturally modelled with a hierarchical Bayesian thinking, where each individual's data are generated based on his/her own profile model, which itself is a sample from a common distribution of the population of user profiles. Users are thus connected to each other via this common distribution. Due to the complexity of such a distribution in real-world applications, usually applied parametric models are too restrictive, and we thus introduce a nonparametric hierarchical Bayesian model using Dirichlet process. We derive effective and efficient algorithms to learn the described model. In particular, the finally achieved hybrid filtering methods are surprisingly simple and intuitively understandable, offering clear insights to previous work on pure CF, pure CBF, and hybrid filtering

    Bayesian nonparametric models for data exploration

    Get PDF
    Mención Internacional en el título de doctorMaking sense out of data is one of the biggest challenges of our time. With the emergence of technologies such as the Internet, sensor networks or deep genome sequencing, a true data explosion has been unleashed that affects all fields of science and our everyday life. Recent breakthroughs, such as self-driven cars or champion-level Go player programs, have demonstrated the potential benefits from exploiting data, mostly in well-defined supervised tasks. However, we have barely started to actually explore and truly understand data. In fact, data holds valuable information for answering most important questions for humanity: How does aging impact our physical capabilities? What are the underlying mechanisms of cancer? Which factors make countries wealthier than others? Most of these questions cannot be stated as well-defined supervised problems, and might benefit enormously from multidisciplinary research efforts involving easy-to-interpret models and rigorous data exploratory analyses. Efficient data exploration might lead to life-changing scientific discoveries, which can later be turned into a more impactful exploitation phase, to put forward more informed policy recommendations, decision-making systems, medical protocols or improved models for highly accurate predictions. This thesis proposes tailored Bayesian nonparametric (BNP) models to solve specific data exploratory tasks across different scientific areas including sport sciences, cancer research, and economics. We resort to BNP approaches to facilitate the discovery of unexpected hidden patterns within data. BNP models place a prior distribution over an infinite-dimensional parameter space, which makes them particularly useful in probabilistic models where the number of hidden parameters is unknown a priori. Under this prior distribution, the posterior distribution of the hidden parameters given the data will assign high probability mass to those configurations that best explain the observations. Hence, inference over the hidden variables can be performed using standard Bayesian inference techniques, therefore avoiding expensive model selection steps. This thesis is application-focused and highly multidisciplinary. More precisely, we propose an automatic grading system for sportive competitions to compare athletic performance regardless of age, gender and environmental aspects; we develop BNP models to perform genetic association and biomarker discovery in cancer research, either using genetic information and Electronic Health Records or clinical trial data; finally, we present a flexible infinite latent factor model of international trade data to understand the underlying economic structure of countries and their evolution over time.Uno de los principales desafíos de nuestro tiempo es encontrar sentido dentro de los datos. Con la aparición de tecnologías como Internet, redes de sensores, o métodos de secuenciación profunda del genoma, una verdadera explosión digital se ha visto desencadenada, afectando todos los campos científicos, así como nuestra vida diaria. Logros recientes como pueden ser los coches auto-dirigidos o programas que ganan a los seres humanos al milenario juego del Go, han demostrado con creces los posibles beneficios que podemos obtener de la explotación de datos, mayoritariamente en tareas supervisadas bien definidas. No obstante, apenas hemos empezado con la exploración de datos y su verdadero entendimiento. En verdad, los datos encierran información muy valiosa para responder a muchas de las preguntas más importantes para la humanidad: ¿Cómo afecta el envejecimiento a nuestras aptitudes físicas? ¿Cuáles son los mecanismos subyacentes del cáncer? ¿Qué factores explican la riqueza de ciertos países frente a otros? Si bien la mayoría de estas preguntas no pueden formularse como problemas supervisados bien definidos, éstas pueden ser abordadas mediante esfuerzos de investigación multidisciplinar que involucren modelos fáciles de interpretar y análisis exploratorios rigurosos. Explorar los datos de manera eficiente abre potencialmente la puerta a un sinnúmero de descubrimientos científicos en diversas áreas con impacto real en nuestras vidas, descubrimientos que a su vez pueden llevarnos a una mejor explotación de los datos, resultando en recomendaciones políticas adecuadas, sistemas precisos de toma de decisión, protocolos médicos optimizados o modelos con mejores capacidades predictivas. Esta tesis propone modelos Bayesianos no-paramétricos (BNP) adecuados para la resolución específica de tareas explorativas de los datos en diversos ámbitos científicos incluyendo ciencias del deporte, investigación contra el cáncer, o economía. Recurrimos a un planteamiento BNP para facilitar el descubrimiento de patrones ocultos inesperados subyacentes en los datos. Los modelos BNP definen una distribución a priori sobre un espacio de parámetros de dimensión infinita, lo cual los hace especialmente atractivos para enfoques probabilísticos donde el número de parámetros latentes es en principio desconocido. Bajo dicha distribución a priori, la distribución a posteriori de los parámetros ocultos dados los datos asignará mayor probabilidad a aquellas configuraciones que mejor explican las observaciones. De esta manera, la inferencia sobre el espacio de variables ocultas puede realizarse mediante técnicas estándar de inferencia Bayesiana, evitando el proceso de selección de modelos. Esta tesis se centra en el ámbito de las aplicaciones, y es de naturaleza multidisciplinar. En concreto, proponemos un sistema de gradación automática para comparar el rendimiento deportivo de atletas independientemente de su edad o género, así como de otros factores del entorno. Desarrollamos modelos BNP para descubrir asociaciones genéticas y biomarcadores dentro de la investigación contra el cáncer, ya sea contrastando información genética con la historia clínica electrónica de los pacientes, o utilizando datos de ensayos clínicos; finalmente, presentamos un modelo flexible de factores latentes infinito para datos de comercio internacional, con el objetivo de entender la estructura económica de los distintos países y su correspondiente evolución a lo largo del tiempo.Programa Oficial de Doctorado en Multimedia y ComunicacionesPresidente: Joaquín Míguez Arenas.- Secretario: Daniel Hernández Lobato.- Vocal: Cédric Archambea

    Confluence of Vision and Natural Language Processing for Cross-media Semantic Relations Extraction

    Get PDF
    In this dissertation, we focus on extracting and understanding semantically meaningful relationships between data items of various modalities; especially relations between images and natural language. We explore the ideas and techniques to integrate such cross-media semantic relations for machine understanding of large heterogeneous datasets, made available through the expansion of the World Wide Web. The datasets collected from social media websites, news media outlets and blogging platforms usually contain multiple modalities of data. Intelligent systems are needed to automatically make sense out of these datasets and present them in such a way that humans can find the relevant pieces of information or get a summary of the available material. Such systems have to process multiple modalities of data such as images, text, linguistic features, and structured data in reference to each other. For example, image and video search and retrieval engines are required to understand the relations between visual and textual data so that they can provide relevant answers in the form of images and videos to the users\u27 queries presented in the form of text. We emphasize the automatic extraction of semantic topics or concepts from the data available in any form such as images, free-flowing text or metadata. These semantic concepts/topics become the basis of semantic relations across heterogeneous data types, e.g., visual and textual data. A classic problem involving image-text relations is the automatic generation of textual descriptions of images. This problem is the main focus of our work. In many cases, large amount of text is associated with images. Deep exploration of linguistic features of such text is required to fully utilize the semantic information encoded in it. A news dataset involving images and news articles is an example of this scenario. We devise frameworks for automatic news image description generation based on the semantic relations of images, as well as semantic understanding of linguistic features of the news articles

    Action recognition in depth videos using nonparametric probabilistic graphical models

    Get PDF
    Action recognition involves automatically labelling videos that contain human motion with action classes. It has applications in diverse areas such as smart surveillance, human computer interaction and content retrieval. The recent advent of depth sensing technology that produces depth image sequences has offered opportunities to solve the challenging action recognition problem. The depth images facilitate robust estimation of a human skeleton’s 3D joint positions and a high level action can be inferred from a sequence of these joint positions. A natural way to model a sequence of joint positions is to use a graphical model that describes probabilistic dependencies between the observed joint positions and some hidden state variables. A problem with these models is that the number of hidden states must be fixed a priori even though for many applications this number is not known in advance. This thesis proposes nonparametric variants of graphical models with the number of hidden states automatically inferred from data. The inference is performed in a full Bayesian setting by using the Dirichlet Process as a prior over the model’s infinite dimensional parameter space. This thesis describes three original constructions of nonparametric graphical models that are applied in the classification of actions in depth videos. Firstly, the action classes are represented by a Hidden Markov Model (HMM) with an unbounded number of hidden states. The formulation enables information sharing and discriminative learning of parameters. Secondly, a hierarchical HMM with an unbounded number of actions and poses is used to represent activities. The construction produces a simplified model for activity classification by using logistic regression to capture the relationship between action states and activity labels. Finally, the action classes are modelled by a Hidden Conditional Random Field (HCRF) with the number of intermediate hidden states learned from data. Tractable inference procedures based on Markov Chain Monte Carlo (MCMC) techniques are derived for all these constructions. Experiments with multiple benchmark datasets confirm the efficacy of the proposed approaches for action recognition

    AutoBayes Program Synthesis System Users Manual

    Get PDF
    Program synthesis is the systematic, automatic construction of efficient executable code from high-level declarative specifications. AutoBayes is a fully automatic program synthesis system for the statistical data analysis domain; in particular, it solves parameter estimation problems. It has seen many successful applications at NASA and is currently being used, for example, to analyze simulation results for Orion. The input to AutoBayes is a concise description of a data analysis problem composed of a parameterized statistical model and a goal that is a probability term involving parameters and input data. The output is optimized and fully documented C/C++ code computing the values for those parameters that maximize the probability term. AutoBayes can solve many subproblems symbolically rather than having to rely on numeric approximation algorithms, thus yielding effective, efficient, and compact code. Statistical analysis is faster and more reliable, because effort can be focused on model development and validation rather than manual development of solution algorithms and code

    Learning of classification models from group-based feedback

    Get PDF
    Learning of classification models in practice often relies on a nontrivial amount of human annotation effort. The most widely adopted human labeling process assigns class labels to individual data instances. However, such a process is very rigid and may end up being very time-consuming and costly to conduct in practice. Finding more effective ways to reduce human annotation effort has become critical for building machine learning systems that require human feedback. In this thesis, we propose and investigate a new machine learning approach - Group-Based Active Learning - to learn classification models from limited human feedback. A group is defined by a set of instances represented by conjunctive patterns that are value ranges over the input features. Such conjunctive patterns define hypercubic regions of the input data space. A human annotator assesses the group solely based on its region-based description by providing an estimate of the class proportion for the subpopulation covered by the region. The advantage of this labeling process is that it allows a human to label many instances at the same time, which can, in turn, improve the labeling efficiency. In general, there are infinitely many regions one can define over a real-valued input space. To identify and label groups/regions important for classification learning, we propose and develop a Hierarchical Active Learning framework that actively builds and labels a hierarchy of input regions. Briefly, our framework starts by identifying general regions covering substantial portions of the input data space. After that, it progressively splits the regions into smaller and smaller sub-regions and also acquires class proportion labels for the new regions. The proportion labels for these regions are used to gradually improve and refine a classification model induced by the regions. We develop three versions of the idea. The first two versions aim to build a single hierarchy of regions. One builds it statically using hierarchical clustering, while the other one builds it dynamically, similarly to the decision tree learning process. The third approach builds multiple hierarchies simultaneously, and it offers additional flexibility for identifying more informative and simpler regions. We have conducted comprehensive empirical studies to evaluate our framework. The results show that the methods based on the region-based active learning can learn very good classifiers from a very few and simple region queries, and hence are promising for reducing human annotation effort needed for building a variety of classification models
    corecore