12 research outputs found

    NLP-based Metadata Extraction for Legal Text Consolidation

    Get PDF
    The paper describes a system for the automatic consolidation of Italian legislative texts to be used as a support of an editorial consolidating activity and dealing with the following typology of textual amendments: repeal, substitution and integration. The focus of the paper is on the semantic analysis of the textual amendment provisions and the formalized representation of the amendments in terms of metadata. The proposed approach to consolidation is metadata- oriented and based on Natural Language Processing (NLP) techniques: we use XML-based standards for metadata annotation of legislative acts and a flexible NLP architecture for extracting metadata from parsed texts. An evaluation of achieved results is also provided

    Topical Opinion Retrieval

    Get PDF
    With a growing amount of subjective content distributed across the Web, there is a need for a domain-independent information retrieval system that would support ad hoc retrieval of documents expressing opinions on a specific topic of the user’s query. While the research area of opinion detection and sentiment analysis has received much attention in the recent years, little research has been done on identifying subjective content targeted at a specific topic, i.e. expressing topical opinion. This thesis presents a novel method for ad hoc retrieval of documents which contain subjective content on the topic of the query. Documents are ranked by the likelihood each document expresses an opinion on a query term, approximated as the likelihood any occurrence of the query term is modified by a subjective adjective. Domain-independent user-based evaluation of the proposed methods was conducted, and shows statistically significant gains over Google ranking as the baseline

    Toward Concept-Based Text Understanding and Mining

    Get PDF
    There is a huge amount of text information in the world, written in natural languages. Most of the text information is hard to access compared with other well-structured information sources such as relational databases. This is because reading and understanding text requires the ability to disambiguate text fragments at several levels, syntactically and semantically, abstracting away details and using background knowledge in a variety of ways. One possible solution to these problems is to implement a framework of concept-based text understanding and mining, that is, a mechanism of analyzing and integrating segregated information, and a framework of organizing, indexing, accessing textual information centered around real-world concepts. A fundamental difficulty toward this goal is caused by the concept ambiguity of natural language. In text, the real-world entities are referred using their names. The variability in writing a given concept, along with the fact that different concepts/enities may have very similar writings, poses a significant challenge to progress in text understanding and mining. Supporting concept-based natural language understanding requires resolving conceptual ambiguity, and in particular, identifying whether different mentions of real world entities, within and across documents, actually represent the same concept. This thesis systematically studies this fundamental problem. We study and propose different machine learning techniques to address different aspects of this problem and show that as more information can be exploited, the learning techniques developed accordingly, can continuously improve the identification accuracy. In addition, we extend our global probabilistic model to address a significant application -- semantic integration between text and databases

    Larger-first partial parsing

    Get PDF
    Larger-first partial parsing is a primarily top-down approach to partial parsing that is opposite to current easy-first, or primarily bottom-up, strategies. A rich partial tree structure is captured by an algorithm that assigns a hierarchy of structural tags to each of the input tokens in a sentence. Part-of-speech tags are first assigned to the words in a sentence by a part-of-speech tagger. A cascade of Deterministic Finite State Automata then uses this part-of-speech information to identify syntactic relations primarily in a descending order of their size. The cascade is divided into four specialized sections: (1) a Comma Network, which identifies syntactic relations associated with commas; (2) a Conjunction Network, which partially disambiguates phrasal conjunctions and llly disambiguates clausal conjunctions; (3) a Clause Network, which identifies non-comma-delimited clauses; and (4) a Phrase Network, which identifies the remaining base phrases in the sentence. Each automaton is capable of adding one or more levels of structural tags to the tokens in a sentence. The larger-first approach is compared against a well-known easy-first approach. The results indicate that this larger-first approach is capable of (1) producing a more detailed partial parse than an easy first approach; (2) providing better containment of attachment ambiguity; (3) handling overlapping syntactic relations; and (4) achieving a higher accuracy than the easy-first approach. The automata of each network were developed by an empirical analysis of several sources and are presented here in detail

    The Use Of Kullback-Leibler Divergence In Opinion Retrieval

    Get PDF
    With the huge amount of subjective contents in on-line documents, there is a clear need for an information retrieval system that supports retrieval of documents containing opinions about the topic expressed in a user’s query. In recent years, blogs, a new publishing medium, have attracted a large number of people to express personal opinions covering all kinds of topics in response to the real-world events. The opinionated nature of blogs makes them a new interesting research area for opinion retrieval. Identification and extraction of subjective contents from blogs has become the subject of several research projects. In this thesis, four novel methods are proposed to retrieve blog posts that express opinions about the given topics. The first method utilizes the Kullback-Leibler divergence (KLD) to weight the lexicon of subjective adjectives around query terms. Considering the distances between the query terms and subjective adjectives, the second method uses KLD scores of subjective adjectives based on distances from the query terms for document re-ranking. The third method calculates KLD scores of subjective adjectives for predefined query categories. In the fourth method, collocates, words co-occurring with query terms in the corpus, are used to construct the subjective lexicon automatically. The KLD scores of collocates are then calculated and used for document ranking. Four groups of experiments are conducted to evaluate the proposed methods on the TREC test collections. The results of the experiments are compared with the baseline systems to determine the effectiveness of using KLD in opinion retrieval. Further studies are recommended to explore more sophisticated approaches to identify subjectivity and promising techniques to extract opinions

    Memory-Based Grammatical Relation Finding

    Get PDF

    Question answering using document tagging and question classification

    Get PDF
    viii, 139 leaves ; 29 cm.Question answering (QA) is a relatively new area of research. QA is retriecing answers to questions rather than information retrival systems (search engines), which retrieve documents. This means that question answering systems will possibly be the next generation of search engines. What is left to be done to allow QA to be the next generation of search engines? The answer is higher accuracy, which can be achieved by investigating methods of questions answering. I took the approach of designing a question answering system that is based on document tagging and question classification. Question classification extracts useful information from the question about how to answer the question. Document tagging extracts useful information from the documents, which will be used in finding the answer to the question. We used different available systems to tage the documents. Our system classifies the questions using manually developed rules. I also investigated different ways which can use both these methods to answer questions and found that our methods had a comparable accuracy to some systems that use deeper processing techniques. This thesis includes investigations into modules of a question answering system and gives insights into how to go about developing a question answering system based on document tagging and question classification. I also evaluated our current system with the questions from the TREC 2004 question answering track

    Gesture in Automatic Discourse Processing

    Get PDF
    Computers cannot fully understand spoken language without access to the wide range of modalities that accompany speech. This thesis addresses the particularly expressive modality of hand gesture, and focuses on building structured statistical models at the intersection of speech, vision, and meaning.My approach is distinguished in two key respects. First, gestural patterns are leveraged to discover parallel structures in the meaning of the associated speech. This differs from prior work that attempted to interpret individual gestures directly, an approach that was prone to a lack of generality across speakers. Second, I present novel, structured statistical models for multimodal language processing, which enable learning about gesture in its linguistic context, rather than in the abstract.These ideas find successful application in a variety of language processing tasks: resolving ambiguous noun phrases, segmenting speech into topics, and producing keyframe summaries of spoken language. In all three cases, the addition of gestural features -- extracted automatically from video -- yields significantly improved performance over a state-of-the-art text-only alternative. This marks the first demonstration that hand gesture improves automatic discourse processing

    Answer extraction for simple and complex questions

    Get PDF
    xi, 214 leaves : ill. (some col.) ; 29 cm. --When a user is served with a ranked list of relevant documents by the standard document search engines, his search task is usually not over. He has to go through the entire document contents to find the precise piece of information he was looking for. Question answering, which is the retrieving of answers to natural language questions from a document collection, tries to remove the onus on the end-user by providing direct access to relevant information. This thesis is concerned with open-domain question answering. We have considered both simple and complex questions. Simple questions (i.e. factoid and list) are easier to answer than questions that have complex information needs and require inferencing and synthesizing information from multiple documents. Our question answering system for simple questions is based on question classification and document tagging. Question classification extracts useful information (i.e. answer type) about how to answer the question and document tagging extracts useful information from the documents, which is used in finding the answer to the question. For complex questions, we experimented with both empirical and machine learning approaches. We extracted several features of different types (i.e. lexical, lexical semantic, syntactic and semantic) for each of the sentences in the document collection in order to measure its relevancy to the user query. One hill climbing local search strategy is used to fine-tune the feature-weights. We also experimented with two unsupervised machine learning techniques: k-means and Expectation Maximization (EM) algorithms and evaluated their performance. For all these methods, we have shown the effects of different kinds of features
    corecore