55 research outputs found

    Normalized Alignment of Dependency Trees for Detecting Textual Entailment

    Get PDF
    In this paper, we investigate the usefulness of normalized alignment of dependency trees for entailment prediction. Overall, our approach yields an accuracy of 60% on the RTE2 test set, which is a significant improvement over the baseline. Results vary substantially across the different subsets, with a peak performance on the summarization data. We conclude that normalized alignment is useful for detecting textual entailments, but a robust approach will probably need to include additional sources of information

    A text Ontology Method based on mining Develop D –MATRIX

    Get PDF
    In this issue, we demonstrate a text mining method of ontology based on the development and updating of a D-matrix naturally extraction of a large number of verbatim repairs (written in unstructured text) collected during the analysis stages. dependence (D) Fault - Matrix is a systematic demonstrative model is used to capture data symptomatic level progressive elimination system including dependencies between observable symptoms and failure modes associated with a frame. Matrix is a time D-long process. The development of D-matrix from first standards and update using the domain information is a concentrated work. In addition, increased D-die time for the disclosure of new symptoms and failure modes observed for the first race is a difficult task. In this methodology, we first develop the fault diagnosis ontology includes concepts and relationships regularly seen in fault diagnosis field. Then we use text mining algorithm that make use of this ontology to distinguish basic items, such as coins, symptoms, failure modes, and conditions of the unstructured text verbatim repair. The proposed technique is tools like a prototype tool and accepted using real - life information collected from cars space

    Period disambiguation with MaxEnt model

    Get PDF
    Abstract. This paper presents our recent work on period disambiguation, the kernel problem in sentence boundary identification, with the maximum entropy (Maxent) model. A number of experiments are conducted on PTB-II WSJ corpus for the investigation of how context window, feature space and lexical information such as abbreviated and sentence-initial words affect the learning performance. Such lexical information can be automatically acquired from a training corpus by a learner. Our experimental results show that extending the feature space to integrate these two kinds of lexical information can eliminate 93.52% of the remaining errors from the baseline Maxent model, achieving an F-score of 99.8227%.

    Development of Graph from D-matrix based on Ontological Text Mining Method

    Get PDF
    Fault dependency (D-matrix) is a diagnostic model that catches the fault system data and its causal relationship at the hierarchical system-level. It consists of dependencies and relationship between observable failure modes and symptoms associated with a system. Constructing such D-matrix fault detection model is time consuming task. In this paper, a system is proposed that describes an ontology based text mining method for automatically constructing D-matrix by mining hundreds of repair verbatim text data (typically written in unstructured text) collected during the diagnosis episodes. First we construct fault diagnosis ontology and then text mining techniques are applied to identify dependencies among failure modes and observable symptom. D-matrix is represented in graph so that analysis gets easier and faulty parts becomes easily detectable. The proposed method will be implemented as a prototype tool and validated by using real-life data collected from the automobile domain. DOI: 10.17762/ijritcc2321-8169.15055

    Character-based Deep Learning Models for Token and Sentence Segmentation

    Get PDF
    In this work we address the problems of sentence segmentation and tokenization. Informally the task of sentence segmentation involves splitting a given text into units that satisfy a certain definition (or a number of definitions) of a sentence. Similarly, tokenization has as its goal splitting a text into chunks that for a certain task constitute basic units of operation, e.g. words, digits, punctuation marks and other symbols for part of speech tagging. As seen from the definition, tokenization is an absolute prerequisite for virtually every natural language processing (NLP) task. Many of so called downstream NLP applications with higher level of sophistication, e.g. machine translation, additionally require sentence segmentation. Thus both of the problems that we address are the very basic steps in NLP and, as such, are widely regarded as solved problems. Indeed there is a large body of work devoted to these problems, and there is a number of popular, highly accurate off the shelf solutions for them. Nevertheless, the problems of sentence segmentation and tokenization persist, and in practice one often faces certain difficulties whenever confronted with raw text that needs to be tokenized and/or split into sentences. This happens because existing approaches, if they are unsupervised, rely heavily on hand-crafted rules and lexicons, or, if they are supervised, rely on extraction of hand-engineered features. Such systems are not easy to maintain and adapt to new domains and languages because for those one may need to revise the rules and feature definitions. In order to address the aforementioned challenges, we develop character-based deep learning models which require neither rule nor feature engineering. The only resource required is a training set, where each character is labeled with an IOB (Inside Outside Beginning) tag. Such training sets are easily attainable from existing tokenized and sentence-segmented corpora, or, in absence of those, have to be created (but the same is true for rules, lexicons, and hand-crafted features). The IOB-like annotation allows us to solve both tokenization and sentence segmentation problems simultaneously casting them as a single sequence-labeling task, where each character has to be tagged with one of four tags: beginning of a sentence (S), beginning of a token (T), inside of a token (I) and outside of a token (O). To this end we design three models based on artificial neural networks: (i) a fully connected feed forward network; (ii) long short term memory (LSTM) network; (iii) bi-directional version of LSTM. The proposed models utilize character embeddings, i.e. represent characters as vectors in a multidimensional continuous space. We evaluate our approach on three typologically distant languages, namely English, Italian, and Kazakh. In terms of evaluation metrics we use standard precision, recall, and F-measure scores, as well as combined error rate for sentence and token boundary detection. We use two state of the art supervised systems as baselines, and show that our models consistently outperform both of them in terms of error rate

    Break Down Resumes into Sections to Extract Data and Perform Text Analysis using Python

    Get PDF
    The objective of AI-based resume screening is to automate the screening process, and text, keyword, and named entity recognition extraction are critical. This paper discusses segmenting resumes in order to extract data and perform text analysis. The raw CV file has been imported, and the resume data cleaned to remove extra spaces, punctuation and stop words. To extract names from resumes, regular expressions are used. We have also used the spaCy library which is considered the most accurate natural language processing library. It includes already-trained models for entity recognition, parsing, and tagging. The experimental method is used with resume data sourced from Kaggle, and external Source (MTIS)

    An environment for relation mining over richly annotated corpora: the case of GENIA

    Get PDF
    BACKGROUND: The biomedical domain is witnessing a rapid growth of the amount of published scientific results, which makes it increasingly difficult to filter the core information. There is a real need for support tools that 'digest' the published results and extract the most important information. RESULTS: We describe and evaluate an environment supporting the extraction of domain-specific relations, such as protein-protein interactions, from a richly-annotated corpus. We use full, deep-linguistic parsing and manually created, versatile patterns, expressing a large set of syntactic alternations, plus semantic ontology information. CONCLUSION: The experiments show that our approach described is capable of delivering high-precision results, while maintaining sufficient levels of recall. The high level of abstraction of the rules used by the system, which are considerably more powerful and versatile than finite-state approaches, allows speedy interactive development and validation
    corecore