23,619 research outputs found

    Role of semantic indexing for text classification.

    Get PDF
    The Vector Space Model (VSM) of text representation suffers a number of limitations for text classification. Firstly, the VSM is based on the Bag-Of-Words (BOW) assumption where terms from the indexing vocabulary are treated independently of one another. However, the expressiveness of natural language means that lexically different terms often have related or even identical meanings. Thus, failure to take into account the semantic relatedness between terms means that document similarity is not properly captured in the VSM. To address this problem, semantic indexing approaches have been proposed for modelling the semantic relatedness between terms in document representations. Accordingly, in this thesis, we empirically review the impact of semantic indexing on text classification. This empirical review allows us to answer one important question: how beneficial is semantic indexing to text classification performance. We also carry out a detailed analysis of the semantic indexing process which allows us to identify reasons why semantic indexing may lead to poor text classification performance. Based on our findings, we propose a semantic indexing framework called Relevance Weighted Semantic Indexing (RWSI) that addresses the limitations identified in our analysis. RWSI uses relevance weights of terms to improve the semantic indexing of documents. A second problem with the VSM is the lack of supervision in the process of creating document representations. This arises from the fact that the VSM was originally designed for unsupervised document retrieval. An important feature of effective document representations is the ability to discriminate between relevant and non-relevant documents. For text classification, relevance information is explicitly available in the form of document class labels. Thus, more effective document vectors can be derived in a supervised manner by taking advantage of available class knowledge. Accordingly, we investigate approaches for utilising class knowledge for supervised indexing of documents. Firstly, we demonstrate how the RWSI framework can be utilised for assigning supervised weights to terms for supervised document indexing. Secondly, we present an approach called Supervised Sub-Spacing (S3) for supervised semantic indexing of documents. A further limitation of the standard VSM is that an indexing vocabulary that consists only of terms from the document collection is used for document representation. This is based on the assumption that terms alone are sufficient to model the meaning of text documents. However for certain classification tasks, terms are insufficient to adequately model the semantics needed for accurate document classification. A solution is to index documents using semantically rich concepts. Accordingly, we present an event extraction framework called Rule-Based Event Extractor (RUBEE) for identifying and utilising event information for concept-based indexing of incident reports. We also demonstrate how certain attributes of these events e.g. negation, can be taken into consideration to distinguish between documents that describe the occurrence of an event, and those that mention the non-occurrence of that event

    Advanced content-based semantic scene analysis and information retrieval: the SCHEMA project

    Get PDF
    The aim of the SCHEMA Network of Excellence is to bring together a critical mass of universities, research centers, industrial partners and end users, in order to design a reference system for content-based semantic scene analysis, interpretation and understanding. Relevant research areas include: content-based multimedia analysis and automatic annotation of semantic multimedia content, combined textual and multimedia information retrieval, semantic -web, MPEG-7 and MPEG-21 standards, user interfaces and human factors. In this paper, recent advances in content-based analysis, indexing and retrieval of digital media within the SCHEMA Network are presented. These advances will be integrated in the SCHEMA module-based, expandable reference system

    Multimedia search without visual analysis: the value of linguistic and contextual information

    Get PDF
    This paper addresses the focus of this special issue by analyzing the potential contribution of linguistic content and other non-image aspects to the processing of audiovisual data. It summarizes the various ways in which linguistic content analysis contributes to enhancing the semantic annotation of multimedia content, and, as a consequence, to improving the effectiveness of conceptual media access tools. A number of techniques are presented, including the time-alignment of textual resources, audio and speech processing, content reduction and reasoning tools, and the exploitation of surface features

    CHORUS Deliverable 2.1: State of the Art on Multimedia Search Engines

    Get PDF
    Based on the information provided by European projects and national initiatives related to multimedia search as well as domains experts that participated in the CHORUS Think-thanks and workshops, this document reports on the state of the art related to multimedia content search from, a technical, and socio-economic perspective. The technical perspective includes an up to date view on content based indexing and retrieval technologies, multimedia search in the context of mobile devices and peer-to-peer networks, and an overview of current evaluation and benchmark inititiatives to measure the performance of multimedia search engines. From a socio-economic perspective we inventorize the impact and legal consequences of these technical advances and point out future directions of research

    Semantic Retrieval and Automatic Annotation: Linear Transformations, Correlation and Semantic Spaces

    No full text
    This paper proposes a new technique for auto-annotation and semantic retrieval based upon the idea of linearly mapping an image feature space to a keyword space. The new technique is compared to several related techniques, and a number of salient points about each of the techniques are discussed and contrasted. The paper also discusses how these techniques might actually scale to a real-world retrieval problem, and demonstrates this though a case study of a semantic retrieval technique being used on a real-world data-set (with a mix of annotated and unannotated images) from a picture library
    corecore