3,727 research outputs found

    Bibliographic Analysis on Research Publications using Authors, Categorical Labels and the Citation Network

    Full text link
    Bibliographic analysis considers the author's research areas, the citation network and the paper content among other things. In this paper, we combine these three in a topic model that produces a bibliographic model of authors, topics and documents, using a nonparametric extension of a combination of the Poisson mixed-topic link model and the author-topic model. This gives rise to the Citation Network Topic Model (CNTM). We propose a novel and efficient inference algorithm for the CNTM to explore subsets of research publications from CiteSeerX. The publication datasets are organised into three corpora, totalling to about 168k publications with about 62k authors. The queried datasets are made available online. In three publicly available corpora in addition to the queried datasets, our proposed model demonstrates an improved performance in both model fitting and document clustering, compared to several baselines. Moreover, our model allows extraction of additional useful knowledge from the corpora, such as the visualisation of the author-topics network. Additionally, we propose a simple method to incorporate supervision into topic modelling to achieve further improvement on the clustering task.Comment: Preprint for Journal Machine Learnin

    Identifying Rare and Subtle Behaviors: A Weakly Supervised Joint Topic Model

    Get PDF

    Statistical Machine Learning for Breast Cancer Detection with Terahertz Imaging

    Get PDF
    Breast conserving surgery (BCS) is a common breast cancer treatment option, in which the cancerous tissue is excised while leaving most of the healthy breast tissue intact. The lack of in-situ margin evaluation unfortunately results in a re-excision rate of 20-30% for this type of procedure. This study aims to design statistical and machine learning segmentation algorithms for the detection of breast cancer in BCS by using terahertz (THz) imaging. Given the material characterization properties of the non-ionizing radiation in the THz range, we intend to employ the responses from the THz system to identify healthy and cancerous breast tissue in BCS samples. In particular, this dissertation covers the description of four segmentation algorithms for the detection of breast cancer in THz imaging. We first explore the performance of one-dimensional (1D) Gaussian mixture and t-mixture models with Markov chain Monte Carlo (MCMC). Second, we propose a novel low-dimension ordered orthogonal projection (LOOP) algorithm for the dimension reduction of the THz information through a modified Gram-Schmidt process. Once the key features within the THz waveform have been detected by LOOP, the segmentation algorithm employs a multivariate Gaussian mixture model with MCMC and expectation maximization (EM). Third, we explore the spatial information of each pixel within the THz image through a Markov random field (MRF) approach. Finally, we introduce a supervised multinomial probit regression algorithm with polynomial and kernel data representations. For evaluation purposes, this study makes use of fresh and formalin-fixed paraffin-embedded (FFPE) heterogeneous human and mice tissue models for the quantitative assessment of the segmentation performance in terms of receiver operating characteristics (ROC) curves. Overall, the experimental results demonstrate that the proposed approaches represent a promising technique for tissue segmentation within THz images of freshly excised breast cancer samples
    • …
    corecore