87 research outputs found

    A System for Natural Language Unmarked Clausal Transformations in Text-to-Text Applications

    Get PDF
    A system is proposed which separates clauses from complex sentences into simpler stand-alone sentences. This is useful as an initial step on raw text, where the resulting processed text may be fed into text-to-text applications such as Automatic Summarization, Question Answering, and Machine Translation, where complex sentences are difficult to process. Grammatical natural language transformations provide a possible method to simplify complex sentences to enhance the results of text-to-text applications. Using shallow parsing, this system improves the performance of existing systems to identify and separate marked and unmarked embedded clauses in complex sentence structure resulting in syntactically simplified source for further processing

    Enhanced lexicon based models for extracting question-answer pairs from web forum

    Get PDF
    A Web forum is an online community that brings people in different geographical locations together. Members of the forum exchange ideas and expertise. As a result, a huge amount of contents on different topics are generated on a daily basis. The huge human generated contents of web forum can be mined as questionanswer pairs (Q&A). One of the major challenges in mining Q&A from web forum is to establish a good relationship between the question and the candidate answers. This problem is compounded by the noisy nature of web forum's human generated contents. Unfortunately, the existing methods that are used to mine knowledge from web forums ignore the effect of noise on the mining tools, making the lexical contents less effective. This study proposes lexicon based models that can automatically mine question-answer pairs with higher accuracy scores from web forum. The first phase of the research produces question mining model. It was implemented using features generated from unigram, bigram, forum metadata and simple rules. These features were screened using both chi-square and wrapper techniques. Wrapper generated features were used by Multinomial Naïve Bayes to finally build the model. The second phase produced a normalized lexical model for answer mining. It was implemented using 13 lexical features that cut across four quality dimensions. The performance of the features was enhanced by noise normalization, a process that fixed orthographic, phonetic and acronyms noises. The third phase of the research produced a hybridized model of lexical and non-lexical features. The average performances of the question mining model, normalized lexical model and hybridized model for answer mining were 90.3%, 97.5%, and 99.5% respectively on three data sets used. They outperformed all previous works in the domain. The first major contribution of the study is the development of an improved question mining model that is characterized by higher accuracy, better specificity, less complex and ability to generate good accuracy across different forum genres. The second contribution is the development of normalized lexical based model that has capability to establish good relationship between a question and its corresponding answer. The third contribution is the development of a hybridized model that integrates lexical features that guarantee relevance with non-lexical that guarantee quality to mine web forum answers. The fourth contribution is a novel integration of question and answer mining models to automatically generate question-answer pairs from web forum

    Geographic information extraction from texts

    Get PDF
    A large volume of unstructured texts, containing valuable geographic information, is available online. This information – provided implicitly or explicitly – is useful not only for scientific studies (e.g., spatial humanities) but also for many practical applications (e.g., geographic information retrieval). Although large progress has been achieved in geographic information extraction from texts, there are still unsolved challenges and issues, ranging from methods, systems, and data, to applications and privacy. Therefore, this workshop will provide a timely opportunity to discuss the recent advances, new ideas, and concepts but also identify research gaps in geographic information extraction

    A generic framework for context-dependent fusion with application to landmine detection.

    Get PDF
    For complex detection and classification problems, involving data with large intra-class variations and noisy inputs, no single source of information can provide a satisfactory solution. As a result, combination of multiple classifiers is playing an increasing role in solving these complex pattern recognition problems, and has proven to be a viable alternative to using a single classifier. Over the past few years, a variety of schemes have been proposed for combining multiple classifiers. Most of these were global as they assign a degree of worthiness to each classifier, that is averaged over the entire training data. This may not be the optimal way to combine the different experts since the behavior of each one may not be uniform over the different regions of the feature space. To overcome this issue, few local methods have been proposed in the last few years. Local fusion methods aim to adapt the classifiers\u27 worthiness to different regions of the feature space. First, they partition the input samples. Then, they identify the best classifier for each partition and designate it as the expert for that partition. Unfortunately, current local methods are either computationally expensive and/or perform these two tasks independently of each other. However, feature space partition and algorithm selection are not independent and their optimization should be simultaneous. In this dissertation, we introduce a new local fusion approach, called Context Extraction for Local Fusion (CELF). CELF was designed to adapt the fusion to different regions of the feature space. It takes advantage of the strength of the different experts and overcome their limitations. First, we describe the baseline CELF algorithm. We formulate a novel objective function that combines context identification and multi-algorithm fusion criteria into a joint objective function. The context identification component thrives to partition the input feature space into different clusters (called contexts), while the fusion component thrives to learn the optimal fusion parameters within each cluster. Second, we propose several variations of CELF to deal with different applications scenario. In particular, we propose an extension that includes a feature discrimination component (CELF-FD). This version is advantageous when dealing with high dimensional feature spaces and/or when the number of features extracted by the individual algorithms varies significantly. CELF-CA is another extension of CELF that adds a regularization term to the objective function to introduce competition among the clusters and to find the optimal number of clusters in an unsupervised way. CELF-CA starts by partitioning the data into a large number of small clusters. As the algorithm progresses, adjacent clusters compete for data points, and clusters that lose the competition gradually become depleted and vanish. Third, we propose CELF-M that generalizes CELF to support multiple classes data sets. The baseline CELF and its extensions were formulated to use linear aggregation to combine the output of the different algorithms within each context. For some applications, this can be too restrictive and non-linear fusion may be needed. To address this potential drawback, we propose two other variations of CELF that use non-linear aggregation. The first one is based on Neural Networks (CELF-NN) and the second one is based on Fuzzy Integrals (CELF-FI). The latter one has the desirable property of assigning weights to subsets of classifiers to take into account the interaction between them. To test a new signature using CELF (or its variants), each algorithm would extract its set of features and assigns a confidence value. Then, the features are used to identify the best context, and the fusion parameters of this context are used to fuse the individual confidence values. For each variation of CELF, we formulate an objective function, derive the necessary conditions to optimize it, and construct an iterative algorithm. Then we use examples to illustrate the behavior of the algorithm, compare it to global fusion, and highlight its advantages. We apply our proposed fusion methods to the problem of landmine detection. We use data collected using Ground Penetration Radar (GPR) and Wideband Electro -Magnetic Induction (WEMI) sensors. We show that CELF (and its variants) can identify meaningful and coherent contexts (e.g. mines of same type, mines buried at the same site, etc.) and that different expert algorithms can be identified for the different contexts. In addition to the land mine detection application, we apply our approaches to semantic video indexing, image database categorization, and phoneme recognition. In all applications, we compare the performance of CELF with standard fusion methods, and show that our approach outperforms all these methods

    Leveraging Semantic Annotations for Event-focused Search & Summarization

    Get PDF
    Today in this Big Data era, overwhelming amounts of textual information across different sources with a high degree of redundancy has made it hard for a consumer to retrospect on past events. A plausible solution is to link semantically similar information contained across the different sources to enforce a structure thereby providing multiple access paths to relevant information. Keeping this larger goal in view, this work uses Wikipedia and online news articles as two prominent yet disparate information sources to address the following three problems: • We address a linking problem to connect Wikipedia excerpts to news articles by casting it into an IR task. Our novel approach integrates time, geolocations, and entities with text to identify relevant documents that can be linked to a given excerpt. • We address an unsupervised extractive multi-document summarization task to generate a fixed-length event digest that facilitates efficient consumption of information contained within a large set of documents. Our novel approach proposes an ILP for global inference across text, time, geolocations, and entities associated with the event. • To estimate temporal focus of short event descriptions, we present a semi-supervised approach that leverages redundancy within a longitudinal news collection to estimate accurate probabilistic time models. Extensive experimental evaluations demonstrate the effectiveness and viability of our proposed approaches towards achieving the larger goal.Im heutigen Big Data Zeitalters existieren überwältigende Mengen an Textinformationen, die über mehrere Quellen verteilt sind und ein hohes Maß an Redundanz haben. Durch diese Gegebenheiten ist eine Retroperspektive auf vergangene Ereignisse für Konsumenten nur schwer möglich. Eine plausible Lösung ist die Verknüpfung semantisch ähnlicher, aber über mehrere Quellen verteilter Informationen, um dadurch eine Struktur zu erzwingen, die mehrere Zugriffspfade auf relevante Informationen, bietet. Vor diesem Hintergrund benutzt diese Dissertation Wikipedia und Onlinenachrichten als zwei prominente, aber dennoch grundverschiedene Informationsquellen, um die folgenden drei Probleme anzusprechen: • Wir adressieren ein Verknüpfungsproblem, um Wikipedia-Auszüge mit Nachrichtenartikeln zu verbinden und das Problem in eine Information-Retrieval-Aufgabe umzuwandeln. Unser neuartiger Ansatz integriert Zeit- und Geobezüge sowie Entitäten mit Text, um relevante Dokumente, die mit einem gegebenen Auszug verknüpft werden können, zu identifizieren. • Wir befassen uns mit einer unüberwachten Extraktionsmethode zur automatischen Zusammenfassung von Texten aus mehreren Dokumenten um Ereigniszusammenfassungen mit fester Länge zu generieren, was eine effiziente Aufnahme von Informationen aus großen Dokumentenmassen ermöglicht. Unser neuartiger Ansatz schlägt eine ganzzahlige lineare Optimierungslösung vor, die globale Inferenzen über Text, Zeit, Geolokationen und mit Ereignis-verbundenen Entitäten zieht. • Um den zeitlichen Fokus kurzer Ereignisbeschreibungen abzuschätzen, stellen wir einen semi-überwachten Ansatz vor, der die Redundanz innerhalb einer langzeitigen Dokumentensammlung ausnutzt, um genaue probabilistische Zeitmodelle abzuschätzen. Umfangreiche experimentelle Auswertungen zeigen die Wirksamkeit und Tragfähigkeit unserer vorgeschlagenen Ansätze zur Erreichung des größeren Ziels

    Eight Biennial Report : April 2005 – March 2007

    No full text
    corecore