392 research outputs found

    Compiling a domain specific language for dynamic programming

    Get PDF
    Steffen P. Compiling a domain specific language for dynamic programming. Bielefeld (Germany): Bielefeld University; 2006

    A network approach to topic models

    Full text link
    One of the main computational and scientific challenges in the modern age is to extract useful information from unstructured texts. Topic models are one popular machine-learning approach which infers the latent topical structure of a collection of documents. Despite their success --- in particular of its most widely used variant called Latent Dirichlet Allocation (LDA) --- and numerous applications in sociology, history, and linguistics, topic models are known to suffer from severe conceptual and practical problems, e.g. a lack of justification for the Bayesian priors, discrepancies with statistical properties of real texts, and the inability to properly choose the number of topics. Here we obtain a fresh view on the problem of identifying topical structures by relating it to the problem of finding communities in complex networks. This is achieved by representing text corpora as bipartite networks of documents and words. By adapting existing community-detection methods -- using a stochastic block model (SBM) with non-parametric priors -- we obtain a more versatile and principled framework for topic modeling (e.g., it automatically detects the number of topics and hierarchically clusters both the words and documents). The analysis of artificial and real corpora demonstrates that our SBM approach leads to better topic models than LDA in terms of statistical model selection. More importantly, our work shows how to formally relate methods from community detection and topic modeling, opening the possibility of cross-fertilization between these two fields.Comment: 22 pages, 10 figures, code available at https://topsbm.github.io

    Family-based genetic association models

    Get PDF
    The high heritability and recurrence rates observed for several complex diseases justify the search for genetic risk factors. However, despite decades of intense and extensive research, the underlying genetic basis of most complex traits has not been fully deciphered. This unexplained genetic etiology underscores the need to examine etiologic disease mechanisms other than simple genetic effects alone, such as the effect of maternal genes or the effect of parental origin. Additionally, since genome-wide association studies (GWAS) are commonly underpowered due to the large number of single-nucleotide polymorphisms being tested, poorly designed and inadequately powered studies that are unable to capture most of the genetic variants underlying a trait might also contribute to the unexplained genetic etiology. Family-based study designs have been introduced specifically for studies of genetic risk factors. The main study unit is the case-parent triad design, which involves genotyping cases (affected offspring) and both their biological parents. However, a variety of other child-parent configurations and population-based study designs are also amenable to genetic association studies, including (but not limited to) cases in combination with unrelated controls, case-mother dyads, and case-parent triads in combination with unrelated controls or control-parent triads. Large clinical and population-based biobanks and national health registries have created unique opportunities for genetic, epidemiological, and clinical research worldwide. Nonetheless, there is currently a lack of flexible models that accommodate family structure in data. Models that incorporate non-standard genetic effects, such as maternal effects and parent-of-origin effects, are warranted. Moreover, joint models that integrate genetic, environmental, and epigenetic risk factors are needed to elucidate their combined effect on disease. This thesis focuses on models for analyzing GWAS data for binary disease traits as well as methods for maximizing the statistical power of such studies, allowing for a broad range of child-parent configurations in the calculations. Using maximum likelihood estimation in a log-linear model, we developed new methodology to detect parent-of-origin-environment interactions, a possible mechanism contributing to disease susceptibility that has not yet been sufficiently explored. The approach has been implemented in our R package Haplin. In the Haplin framework, we also developed an extensive setup for power and sample size calculations, both through analytic approximations and Monte Carlo simulations, which is essential not only in study planning but also in understanding and interpreting statistical findings. Within the power calculation module, we also implemented a relative efficiency calculator. Relative efficiency measures allow a more informative and general design comparison than straightforward and standard power analyses. We aimed to optimize the study design in genetic association studies given the constraints of available resources, i.e., maximize the statistical power using the least sample collection and genotyping cost

    Inference and Estimation in Change Point Models for Censored Data

    Get PDF
    In general, the change point problem considers inference of a change in distribution for a set of time-ordered observations. This has applications in a large variety of fields and can also apply to survival data. With improvements to medical diagnoses and treatments, incidences and mortality rates have changed. However, the most commonly used analysis methods do not account for such distributional changes. In survival analysis, change point problems can concern a shift in a distribution for a set of time-ordered observations, potentially under censoring or truncation. In this dissertation, we first propose a sequential testing approach for detecting multiple change points in the Weibull accelerated failure time model, since this is sufficiently flexible to accommodate increasing, decreasing, or constant hazard rates and is also the only continuous distribution for which the accelerated failure time model can be reparametrized as a proportional hazards model. Our sequential testing procedure does not require the number of change points to be known; this information is instead inferred from the data. We conduct a simulation study to show that the method accurately detects change points and estimates the model. The numerical results along with a real data application demonstrate that our proposed method can detect change points in the hazard rate. In survival analysis, most existing methods compare two treatment groups for the entirety of the study period. Some treatments may take a length of time to show effects in subjects. This has been called the time-lag effect in the literature, and in cases where time-lag effect is considerable, such methods may not be appropriate to detect significant differences between two groups. In the second part of this dissertation, we propose a novel non-parametric approach for estimating the point of treatment time-lag effect by using an empirical divergence measure. Theoretical properties of the estimator are studied. The results from the simulated data and real data example support our proposed method

    2022 SDSU Data Science Symposium Presentation Abstracts

    Get PDF
    This document contains abstracts for presentations and posters 2022 SDSU Data Science Symposium

    2022 SDSU Data Science Symposium Presentation Abstracts

    Get PDF
    This document contains abstracts for presentations and posters 2022 SDSU Data Science Symposium

    Hidden Markov Models for Time-Inhomogeneous and Incompletely Observed Point Processes

    Get PDF
    Many point processes such as earthquakes or volcanic eruptions usually have incomplete records with the degree of incompleteness varying over time. Consequently, hazard estimation from such time-inhomogeneous incomplete records is complicated and potentially biased. Since the number of missing events is unknown, two distinct HMM-type methodologies are proposed: one with the observed process having a fixed number of missing events between each pair of consecutively observed events, and the other with the observed process having a variable number of missing events between each pair of consecutively observed events in an incomplete point process record. In the first approach, a general class of inhomogeneous hidden semi-Markov models (IHSMMs) is proposed for modelling incompletely observed point processes when incompleteness does not necessarily behave in a stationary and memoryless manner. The key feature of the proposed model is that the sojourn times of the hidden states in the semi-Markov chain depend on time, making it an inhomogeneous semi-Markov chain. We check a conjecture of consistency of the parameter estimators of the proposed model by simulation study using direct numerical optimization of the log-likelihood function. We apply this class of models to a global volcanic eruption catalogue to investigate the time-dependent incompleteness of the record by proposing a particular IHSMM with time-dependent shifted Poisson distributed state durations and a renewal process as the observed process with a fixed number of missing events between each pair of consecutively observed events in the record. A combination of the Akaike Information Criterion and residual analysis is used to choose the best model. The selected inhomogeneous hidden semi-Markov model provides useful insights into the completeness of a global record of volcanic eruptions during the last 2000 years, demonstrating the effectiveness of this method. In the second approach, shifted compound Poisson-gamma (SCPG) and time-dependent SCPG (TSCPG) renewal processes are introduced in order to model the unknown and time-dependent random variable number of missing events between each pair of consecutively observed events in incompletely observed point processes. The SCPG renewal process models the shifted Poisson distributed number of missing events, and the TSCPG renewal process models the time-dependent shifted Poisson distributed number of missing events between each pair of consecutively observed events in the gamma renewal process. In addition to IHSMMs and SCPG renewal processes, a special case of inhomogeneous hidden Markov models (IHMMs) is developed to examine nonstationary incompleteness of point processes. The multinomial logistic functions are adopted to formulate the time-varying transition probabilities in the proposed IHMM in the way that characterizes the temporal structure of the missingness of events in records. The SCPG and TSCPG renewal processes are used as the observed processes in HMMs, HSMMs, IHMMs and IHSMMs to model the time-dependent incomplete point process records. Simulation experiments are employed to check the performance of proposed renewal processes with different types of HMMs. We apply these models to a global volcanic eruption record during the last 10000 years to analyze and demonstrate how we estimate the completeness of the record and the future hazard rate. All proposed models can be utilized to model other types of inhomogeneous processes with or without missing data

    Class discovery via feature selection in unsupervised settings

    Full text link
    Identifying genes linked to the appearance of certain types of cancers and their phenotypes is a well-known and challenging problem in bioinformatics. Discovering marker genes which, upon genetic mutation, drive the proliferation of different types and subtypes of cancer is critical for the development of advanced tests and therapies that will specifically identify, target, and treat certain cancers. Therefore, it is crucial to find methods that are successful in recovering "cancer-critical genes" from the (usually much larger) set of all genes in the human genome. We approach this problem in the statistical context as a feature (or variable) selection problem for clustering, in the case where the number of important features is typically small (or rare) and the signal of each important feature is typically minimal (or weak). Genetic datasets typically consist of hundreds of samples (n) each with tens of thousands gene-level measurements (p), resulting in the well-known statistical "large p small n" problem. The class or cluster identification is based on the clinical information associated with the type or subtype of the cancer (either known or unknown) for each individual. We discuss and develop novel feature ranking methods, which complement and build upon current methods in the field. These ranking methods are used to select features which contain the most significant information for clustering. Retaining only a small set of useful features based on this ranking aids in both a reduction in data dimensionality, as well as the identification of a set of genes that are crucial in understanding cancer subtypes. In this paper, we present an outline of cutting-edge feature selection methods, and provide a detailed explanation of our own contributions to the field. We explain both the practical properties and theoretical advantages of the new tools that we have developed. Additionally, we explore a well-developed case study applying these new feature selection methods to different levels of genetic data to explore their practical implementation within the field of bioinformatics

    Interval-censored semi-competing risks data: a novel approach for modelling bladder cancer

    Get PDF
    Aquesta tesi tracta sobre tècniques d'anàlisi de supervivència en situacions amb múltiples esdeveniments i patrons complexes de censura. Proposem una nova metodologia per tractar la situació de riscos semi-competitius quan les dades estan censurades en un interval. La motivació del treball neix de la nostra col·laboració amb l'Estudi Espanyol del Càncer de Bufeta (SBC/EPICURO), el més gran estudi sobre càncer de bufeta realitzat fins ara a l'Estat Espanyol. La nostra contribució en el projecte es centra en la modelització i identificació de factors pronòstics de l'evolució de la malaltia.L'evolució de malalties complexes, com el càncer o la infecció VIH, es caracteritza per la ocurrència de múltiples esdeveniments en el mateix pacient: per exemple, la recaiguda de la malaltia o la mort. Aquests esdeveniments poden ser finals, quan el seguiment del pacient s'atura després de l'esdeveniment, o bé intermedis, quan l'individu continua sota observació. La presència d'esdeveniments finals complica l'anàlisi dels intermedis ja que n'impedeix la seva completa observació, induint una possible censura depenent.En aquest context, es requereixen metodologies apropiades. Els següents mètodes són emprats: riscos competitius, models multiestat i riscos semi-competitius. A resultes de l'aplicació de mètodes per riscos competitius i models multi-estat, proposem dues aportacions rellevants al coneixement de la malaltia: (1) la caracterització dels pacients amb un alt risc de progressió com a primer esdeveniment després de la diagnosi, i (2) la construcció d'un model pronòstic dinàmic per al risc de progressió.La situació de riscos competitius es dóna quan volem descriure el temps fins al primer entre K possibles esdeveniments, juntament amb un indicador del tipus d'esdeveniment observat. En l'estudi EPICURO, és rellevant estudiar el temps fins al primer entre recidiva, progressió o mort. La caracterització d'aquest primer esdeveniment permetria seleccionar el millor tractament d'acord amb el perfil de risc basal del pacient.Els models multi-estat descriuen les diferents evolucions que la malaltia pot seguir, establint relacions entre els esdeveniments d'interès: per exemple, un pacient pot experimentar una recidiva del tumor primari, i després morir, o bé pot morir sense haver tingut cap recaiguda de la malaltia. Una característica interessant d'aquests models és que permeten fer prediccions del risc de futurs esdeveniments per a un pacient, d'acord amb la història que hagi pogut tenir fins aquell moment. En el cas de càncer de bufeta podrem avaluar la influència que té en el risc de progressar haver patit o no una recidiva prèvia.Un cas especial de model multi-estat és aquell que conté un esdeveniment intermedi E1, i un esdeveniment final, E2. Siguin T1 i T2 els temps fins aquests esdeveniments, respectivament. Ni l'anàlisi de riscos competitius ni els models multi-estat permeten adreçar l'estudi de la distribució marginal de T1. En efecte, l'anàlisi de riscos competitius tracta amb la distribució del mínim entre els dostemps, T=min(T1,T2), mentre que els models multi-estat es centren en la distribució condicional de T2|T1, és a dir, en com la ocurrència de E1 modifica el risc de E2. En aquest cas, la distribució de T1 no és identificable a partir de les dades observades. La situació abans descrita, on la ocurrència d'un esdeveniment final impedeix l'observació de l'esdeveniment intermedi és coneguda com a riscos semi-competitius (Fine et al., 2001). L'estratègia d'aquests autors passà per assumir un model per a la distribució conjunta (T1, T2), i aleshores recuperar la distribució marginal de T1 derivada d'aquest model.Proposem una nova metodologia per tractar amb riscos semi-competitius quan el temps fins l'esdeveniment intermedi, T1, està censurat en un interval. En molts estudis mèdics longitudinals, la ocurrència de l'esdeveniment d'interès s'avalua en visites periòdiques del pacient, i per tant, T1 és desconegut, però es sap que pertany al interval comprès entre els temps de dues visites consecutives. Els mètodes per riscos semi-competitius en el context usual de censura per la dreta no són vàlids en aquest cas i és necessària una nova aproximació. En aquest treball ampliem la metodología semi-paramètrica proposada per Fine et al. (2001), que assumeix un model de còpula de Clayton (1978) per a descriure la dependència entre T1 i T2. Assumint el mateix model, desenvolupem un algoritme iteratiu que estima conjuntament el paràmetre d'associació del model de còpula, així com la funció de supervivència del temps intermedi T1.Fine, J. P.; Jiang, H. & Chappell, R. (2001), 'On Semi-Competing Risks Data', Biometrika 88(4), 907--919.Clayton, D. G. (1978), 'A Model for Association in Bivariate Life Tables and Its Application in Epidemiological Studies of Familial. Tendency in Chronic Disease Incidence', Biometrika 65(1), 141--151.La presente tesis trata sobre técnicas de análisis de supervivencia en situaciones con múltiples eventos y patrones complejos de censura. Proponemos una nueva metodología para tratar el problema de riesgos semi-competitivos cuando los datos están censurados en un intervalo. La motivación de este trabajo nace de nuestra colaboración con el estudio Español de Cáncer de Vejiga (SBC/EPICURO), el más grande estudio sobre cáncer de vejiga realizado en España hasta el momento. Nuestra participación en el mismo se centra en la modelización e identificación de factores pronósticos en el curso de la enfermedad.El curso de enfermedades complejas tales como el cáncer o la infección por VIH, se caracteriza por la ocurrencia de múltiples eventos en el mismo paciente, como por ejemplo la recaída o la muerte. Estos eventos pueden ser finales, cuando el seguimiento del paciente termina con el evento, o bien intermedios, cuando el individuo sigue bajo observación. La presencia de eventos finales complica el análisis de los eventos intermedios, ya que impiden su completa observación, induciendo una posible censura dependiente.En este contexto, se requieren metodologías apropiadas. Se utilizan los siguientes métodos: riesgos competitivos, modelos multiestado y riesgos semi-competitivos. De la aplicación de métodos para riesgos competitivos y modelos multi-estado resultan dos aportaciones relevantes sobre el conocimiento de la enfermedad: (1) la caracterización de los pacientes con un alto riesgo de progresión como primer evento después del diagnóstico, y (2) la construcción de un modelo pronóstico y dinámico para el riesgo de progresión.El problema de riesgos competitivos aparece cuando queremos describir el tiempo hasta el primero de K posibles eventos, junto con un indicador del tipo de evento observado. En el estudio SBC/EPICURO es relevante estudiar el tiempo hasta el primero entre recidiva, progresión o muerte. La caracterización de este primer evento permitiría seleccionar el tratamiento más adecuado de acuerdo con el perfil de riesgo basal del paciente.Los modelos multi-estado describen las diferentes tipologías que el curso de la enfermedad puede seguir, estableciendo relaciones entre los eventos de interés. Por ejemplo, un paciente puede experimentar una recidiva y después morir, o bien puede morir sin haber tenido recaída alguna. El potencial interesante de los modelos multi-estado es que permiten realizar predicciones sobre el riesgo de futuros eventos dada la historia del paciente hasta ese momento. En el caso del cáncer de vejiga, podremos evaluar la influencia que tiene en el riesgo de progresar el haber tenido o no una recidiva previa.Un caso especial de modelo multi-estado es el que contiene un evento intermedio E1 y uno final, E2. Sean T1 y T2 los tiempos hasta tales eventos, respectivamente. Ni el análisis de riesgos competitivos ni los modelos multi-estado permiten estudiar la distribución marginal de T1. En efecto, el análisis de riesgos competitivos trata con la distribución del mínimo entre los dos tiempos, T=min(T1,T2), mientras que los modelos multi-estado se centran en la distribución condicional de T2 dado T1, T2|T1, en cómo la ocurrencia de E1 modifica el riesgo de E2. En ambos casos, la distribución de T1 no es identificable a partir de los datos observados.La situación anteriormente descrita donde un evento final impide la observación de un evento intermedio se conoce como riesgos semi-competitivos (Fine et al. 2001). La estrategia de estos autores asume un modelo para la distribución conjunta (T1,T2) para así recuperar la distribución de T1 derivada de ese modelo.Proponemos una nueva metodología para tratar con riesgos semi-competitivos cuando el tiempo hasta el evento intermedio, T1, esta censurado en un intervalo. En muchos estudios médicos longitudinales, la ocurrencia del evento de interés se evalúa en visitas periódicas al paciente, por lo que T1 es desconocido, aunque se conoce que pertenece al intervalo comprendido entre los tiempos de dos visitas consecutivas. Los métodos para riesgos semi-competitivos en el contexto usual de censura por la derecha no son válidos en este caso y se requiere una nueva aproximación. En este trabajo ampliamos la metodología semi-paramétrica propuesta por Fine et al. (2001), que asume una cópula de Clayton (1978) para describir la dependencia entre T1 y T2. Bajo el mismo modelo de asociación, desarrollamos un algoritmo iterativo que estima conjuntamente el parámetro de asociación del modelo de cópula, así como la función de supervivencia del tiempo al evento intermedio T1.Fine, J. P.; Jiang, H. & Chappell, R. (2001), 'On Semi-Competing Risks Data', Biometrika 88(4), 907--919. Clayton, D. G. (1978), 'A Model for Association in Bivariate Life Tables and Its Application in Epidemiological Studies of Familial. Tendency in Chronic Disease Incidence', Biometrika 65(1), 141--151
    corecore