14 research outputs found

    Characterization and collection of information from heterogeneous multimedia sources with users' parameters for decision support

    Get PDF
    No single information source can be good enough to satisfy the divergent and dynamic needs of users all the time. Integrating information from divergent sources can be a solution to deficiencies in information content. We present how Information from multimedia document can be collected based on associating a generic database to a federated database. Information collected in this way is brought into relevance by integrating the parameters of usage and user's parameter for decision making. We identified seven different classifications of multimedia document

    Text Document Classification: An Approach Based on Indexing

    Get PDF
    ABSTRACT In this paper we propose a new method of classifying text documents. Unlike conventional vector space models, the proposed method preserves the sequence of term occurrence in a document. The term sequence is effectively preserved with the help of a novel datastructure called ‘Status Matrix’. Further the corresponding classification technique has been proposed for efficient classification of text documents. In addition, in order to avoid sequential matching during classification, we propose to index the terms in Btree, an efficient index scheme. Each term in B-tree is associated with a list of class labels of those documents which contain the term. Further the corresponding classification technique has been proposed. To corroborate the efficacy of the proposed representation and status matrix based classification, we have conducted extensive experiments on various datasets. Original Source URL : http://aircconline.com/ijdkp/V2N1/2112ijdkp04.pdf For more details : http://airccse.org/journal/ijdkp/vol2.htm

    Document clustering using locality preserving indexing

    Full text link

    Inclusion de sens dans la représentation de documents textuels : état de l'art

    Get PDF
    Ce document donne un aperçu de l'état de l'art dans le domaine de la représentation du sens dans les documents textuels

    Discovering latent topical structure by second-order similarity analysis

    Get PDF
    This is the post-print of the Article - Copyright @ 2011 ASIS&TDocument similarity models are typically derived from a term-document vector space representation by comparing all vector-pairs using some similarity measure. Computing similarity directly from a ‘bag of words’ model can be problematic because term independence causes the relationships between synonymous and related terms and the contextual influences that determine the ‘sense’ of polysemous terms to be ignored. This paper compares two methods that potentially address these problems by modelling the higher-order relationships that lie latent within the original vector space. The first is latent semantic analysis (LSA), a dimension reduction method which is a well known means of addressing the vocabulary mismatch problem in information retrieval systems. The second is the lesser known, yet conceptually simple approach of second-order similarity (SOS) analysis, where similarity is measured in terms of profiles of first-order similarities as computed directly from the term-document space. Nearest neighbour tests show that SOS analysis produces similarity models that are consistently better than both first-order and LSA derived models at resolving both coarse and fine level semantic clusters. SOS analysis has been criticised for its cubic complexity. A second contribution is the novel application of vector truncation to reduce the run-time by a constant factor. Speed-ups of four to ten times are found to be easily achievable without losing the structural benefits associated with SOS analysis

    Using Ontology-Based Approaches to Representing Speech Transcripts for Automated Speech Scoring

    Get PDF
    Text representation is a process of transforming text into some formats that computer systems can use for subsequent information-related tasks such as text classification. Representing text faces two main challenges: meaningfulness of representation and unknown terms. Research has shown evidence that these challenges can be resolved by using the rich semantics in ontologies. This study aims to address these challenges by using ontology-based representation and unknown term reasoning approaches in the context of content scoring of speech, which is a less explored area compared to some common ones such as categorizing text corpus (e.g. 20 newsgroups and Reuters). From the perspective of language assessment, the increasing amount of language learners taking second language tests makes automatic scoring an attractive alternative to human scoring for delivering rapid and objective scores of written and spoken test responses. This study focuses on the speaking section of second language tests and investigates ontology-based approaches to speech scoring. Most previous automated speech scoring systems for spontaneous responses of test takers assess speech by primarily using acoustic features such as fluency and pronunciation, while text features are less involved and exploited. As content is an integral part of speech, the study is motivated by the lack of rich text features in speech scoring and is designed to examine the effects of different text features on scoring performance. A central question to the study is how speech transcript content can be represented in an appropriate means for speech scoring. Previously used approaches from essay and speech scoring systems include bag-of-words and latent semantic analysis representations, which are adopted as baselines in this study; the experimental approaches are ontology-based, which can help improving meaningfulness of representation units and estimating importance of unknown terms. Two general domain ontologies, WordNet and Wikipedia, are used respectively for ontology-based representations. In addition to comparison between representation approaches, the author analyzes which parameter option leads to the best performance within a particular representation. The experimental results show that on average, ontology-based representations slightly enhances speech scoring performance on all measurements when combined with the bag-of-words representation; reasoning of unknown terms can increase performance on one measurement (cos.w4) but decrease others. Due to the small data size, the significance test (t-test) shows that the enhancement of ontology-based representations is inconclusive. The contributions of the study include: 1) it examines the effects of different representation approaches on speech scoring tasks; 2) it enhances the understanding of the mechanisms of representation approaches and their parameter options via in-depth analysis; 3) the representation methodology and framework can be applied to other tasks such as automatic essay scoring

    Fisher Kernels and Probabilistic Latent Semantic Models

    Get PDF
    Tasks that rely on semantic content of documents, notably Information Retrieval and Document Classification, can benefit from a good account of document context, i.e. the semantic association between documents. To this effect, the scheme of latent semantics blends individual words appearing throughout a document collection into latent topics, thus providing a way to handle documents that is less constrained than the conventional approach by the mere appearance of such or such word. Probabilistic latent semantic models take the matter further by providing assumptions on how the documents observed in the collection would have been generated. This allows derivation of inference algorithms that can fit the model parameters to the observed document collection; with their values set, these parameters can then be used to compute the similarities between documents. The Fisher kernels, similarity functions rooted in information geometry, constitute good candidates to measure the similarity between documents in the framework of probabilistic latent semantic models. In this context, we study the use of Fisher kernels for the Probabilistic Latent Semantic Indexing (PLSI) model. By thoroughly analysing the generative process of PLSI, we derive the proper Fisher kernel for PLSI and expose the hypotheses that relate former work to this kernel. In particular, we confirm that the Fisher information matrix (FIM) should not be approximated by the identity in the case of PLSI. We also study the impact on the performances of the Fisher kernel of the contribution of the latent topics and the one of the distribution of words among the topics; eventually, we provide empirical evidence, and theoretical arguments, showing that the Fisher kernel originally published by Hofmann, corrected to account for FIM, is the best of the PLSI Fisher kernels. It can compete with the strong BM25 baseline, and even significantly outperforms it when documents sharing few words must be matched. We further study of PLSI document similarities by applying the Language model approach. This approach shuns the usual IR paradigm that considers documents and queries to be of a similar nature. Instead, they consider documents as being representative of language models, and use probabilistic tools to determine which of these models would have generated the query with highest probability. Using this scheme in the framework of PLSI provides a way to bypass the issue of query representation, which constitutes one of the specific challenges of PLSI. We find the Language model approach to perform as well as the best of the Fisher kernels when enough latent categories are provided. Eventually, we propose a new probabilistic latent semantic model consisting in a mixture of Smoothed Dirichlet distributions which, by better modeling word burstiness, provides a more realistic model of empirical observations on real document collections than the usually used Multinomials
    corecore