33,100 research outputs found

    Bibliographic Analysis on Research Publications using Authors, Categorical Labels and the Citation Network

    Full text link
    Bibliographic analysis considers the author's research areas, the citation network and the paper content among other things. In this paper, we combine these three in a topic model that produces a bibliographic model of authors, topics and documents, using a nonparametric extension of a combination of the Poisson mixed-topic link model and the author-topic model. This gives rise to the Citation Network Topic Model (CNTM). We propose a novel and efficient inference algorithm for the CNTM to explore subsets of research publications from CiteSeerX. The publication datasets are organised into three corpora, totalling to about 168k publications with about 62k authors. The queried datasets are made available online. In three publicly available corpora in addition to the queried datasets, our proposed model demonstrates an improved performance in both model fitting and document clustering, compared to several baselines. Moreover, our model allows extraction of additional useful knowledge from the corpora, such as the visualisation of the author-topics network. Additionally, we propose a simple method to incorporate supervision into topic modelling to achieve further improvement on the clustering task.Comment: Preprint for Journal Machine Learnin

    Variable Word Rate N-grams

    Get PDF
    The rate of occurrence of words is not uniform but varies from document to document. Despite this observation, parameters for conventional n-gram language models are usually derived using the assumption of a constant word rate. In this paper we investigate the use of variable word rate assumption, modelled by a Poisson distribution or a continuous mixture of Poissons. We present an approach to estimating the relative frequencies of words or n-grams taking prior information of their occurrences into account. Discounting and smoothing schemes are also considered. Using the Broadcast News task, the approach demonstrates a reduction of perplexity up to 10%.Comment: 4 pages, 4 figures, ICASSP-200

    Should the advanced measurement approach be replaced with the standardized measurement approach for operational risk?

    Get PDF
    Recently, Basel Committee for Banking Supervision proposed to replace all approaches, including Advanced Measurement Approach (AMA), for operational risk capital with a simple formula referred to as the Standardised Measurement Approach (SMA). This paper discusses and studies the weaknesses and pitfalls of SMA such as instability, risk insensitivity, super-additivity and the implicit relationship between SMA capital model and systemic risk in the banking sector. We also discuss the issues with closely related operational risk Capital-at-Risk (OpCar) Basel Committee proposed model which is the precursor to the SMA. In conclusion, we advocate to maintain the AMA internal model framework and suggest as an alternative a number of standardization recommendations that could be considered to unify internal modelling of operational risk. The findings and views presented in this paper have been discussed with and supported by many OpRisk practitioners and academics in Australia, Europe, UK and USA, and recently at OpRisk Europe 2016 conference in London

    Information Retrieval Models

    Get PDF
    Many applications that handle information on the internet would be completely\ud inadequate without the support of information retrieval technology. How would\ud we find information on the world wide web if there were no web search engines?\ud How would we manage our email without spam filtering? Much of the development\ud of information retrieval technology, such as web search engines and spam\ud filters, requires a combination of experimentation and theory. Experimentation\ud and rigorous empirical testing are needed to keep up with increasing volumes of\ud web pages and emails. Furthermore, experimentation and constant adaptation\ud of technology is needed in practice to counteract the effects of people that deliberately\ud try to manipulate the technology, such as email spammers. However,\ud if experimentation is not guided by theory, engineering becomes trial and error.\ud New problems and challenges for information retrieval come up constantly.\ud They cannot possibly be solved by trial and error alone. So, what is the theory\ud of information retrieval?\ud There is not one convincing answer to this question. There are many theories,\ud here called formal models, and each model is helpful for the development of\ud some information retrieval tools, but not so helpful for the development others.\ud In order to understand information retrieval, it is essential to learn about these\ud retrieval models. In this chapter, some of the most important retrieval models\ud are gathered and explained in a tutorial style

    Proposed best practice for projects that involve modelling and simulation

    Get PDF
    Modelling and simulation has been used in many ways when developing new treatments. To be useful and credible, it is generally agreed that modelling and simulation should be undertaken according to some kind of best practice. A number of authors have suggested elements required for best practice in modelling and simulation. Elements that have been suggested include the pre-specification of goals, assumptions, methods, and outputs. However, a project that involves modelling and simulation could be simple or complex and could be of relatively low or high importance to the project. It has been argued that the level of detail and the strictness of pre-specification should be allowed to vary, depending on the complexity and importance of the project. This best practice document does not prescribe how to develop a statistical model. Rather, it describes the elements required for the specification of a project and requires that the practitioner justify in the specification the omission of any of the elements and, in addition, justify the level of detail provided about each element. This document is an initiative of the Special Interest Group for modelling and simulation. The Special Interest Group for modelling and simulation is a body open to members of Statisticians in the Pharmaceutical Industry and the European Federation of Statisticians in the Pharmaceutical Industry. Examples of a very detailed specification and a less detailed specification are included as appendices

    Query generation from multiple media examples

    Get PDF
    This paper exploits an unified media document representation called feature terms for query generation from multiple media examples, e.g. images. A feature term refers to a value interval of a media feature. A media document is therefore represented by a frequency vector about feature term appearance. This approach (1) facilitates feature accumulation from multiple examples; (2) enables the exploration of text-based retrieval models for multimedia retrieval. Three statistical criteria, minimised chi-squared, minimised AC/DC rate and maximised entropy, are proposed to extract feature terms from a given media document collection. Two textual ranking functions, KL divergence and a BM25-like retrieval model, are adapted to estimate media document relevance. Experiments on the Corel photo collection and the TRECVid 2006 collection show the effectiveness of feature term based query in image and video retrieval

    Dirichlet belief networks for topic structure learning

    Full text link
    Recently, considerable research effort has been devoted to developing deep architectures for topic models to learn topic structures. Although several deep models have been proposed to learn better topic proportions of documents, how to leverage the benefits of deep structures for learning word distributions of topics has not yet been rigorously studied. Here we propose a new multi-layer generative process on word distributions of topics, where each layer consists of a set of topics and each topic is drawn from a mixture of the topics of the layer above. As the topics in all layers can be directly interpreted by words, the proposed model is able to discover interpretable topic hierarchies. As a self-contained module, our model can be flexibly adapted to different kinds of topic models to improve their modelling accuracy and interpretability. Extensive experiments on text corpora demonstrate the advantages of the proposed model.Comment: accepted in NIPS 201
    corecore