18 research outputs found
Combining Strategies for Extracting Relations from Text Collections
Text documents often contain valuable structured data that is hidden in regular English sentences. This data is best exploited if available as a relational table that we could use for answering precise queries or for running data mining tasks. Our Snowball system extracts these relations from document collections starting with only a handful of user-provided example tuples. Based on these tuples, Snowball generates patterns that are used, in turn, to find more tuples. In this paper we introduce a new pattern and tuple generation scheme for Snowball, with different strengths and weaknesses than those of our original system. We also show preliminary results on how we can combine the two versions of Snowball to extract tuples more accurately
On-Line Error Detection of Annotated Corpus Using Modular Neural Networks
Abstract. This paper proposes an on-line error detecting method for a manually annotated corpus using min-max modular (M3) neural net-works. The basic idea of the method is to use guaranteed convergence of the M3 network to detect errors in learning data. To confirm the ef-fectiveness of the method, a preliminary computer experiment was per-formed on a small Japanese corpus containing 217 sentences. The results show that the method can not only detect errors within a corpus, but may also discover some kinds of knowledge or rules useful for natural language processing.
Recommended from our members
Measuring the Interestingness of News Articles
An explosive growth of online news has taken place. Users are inundated with thousands of news articles, only some of which are interesting. A system to filter out uninteresting articles would aid users that need to read and analyze many articles daily, such as financial analysts and government officials. The most obvious approach for reducing the amount of information overload is to learn keywords of interest for a user (Carreira et al., 2004). Although filtering articles based on keywords removes many irrelevant articles, there are still many uninteresting articles that are highly relevant to keyword searches. A relevant article may not be interesting for various reasons, such as the article's age or if it discusses an event that the user has already read about in other articles. Although it has been shown that collaborative filtering can aid in personalized recommendation systems (Wang et al., 2006), a large number of users is needed. In a limited user environment, such as a small group of analysts monitoring news events, collaborative filtering would be ineffective. The definition of what makes an article interesting--or its 'interestingness'--varies from user to user and is continually evolving, calling for adaptable user personalization. Furthermore, due to the nature of news, most articles are uninteresting since many are similar or report events outside the scope of an individual's concerns. There has been much work in news recommendation systems, but none have yet addressed the question of what makes an article interesting
Adapting noise filters for ranking
Noise filtering can be considered an important preprocessing step in the data mining process, making data more reliable for pattern extraction. An interesting aspect for increasing data understanding would be to rank the potential noisy cases, in order to evidence the most unreliable instances to be further examined. Since the majority of the filters from the literature were designed only for hard classification, distinguishing whether an example is noisy or not, in this paper we adapt the output of some state of the art noise filters for ranking the cases identified as suspicious. We also present new evaluation measures for the noise rankers designed, which take into account the ordering of the detected noisy cases.FAPESPCNP
Annotation Protocol for Textbook Enrichment with Prerequisite Knowledge Graph
Extracting and formally representing the knowledge embedded in textbooks, such as the concepts explained and the relations between them, can support the provision of advanced knowledge-based services for learning environments and digital libraries. In this paper, we consider a specific type of relation in textbooks referred to as prerequisite relations (PR). PRs represent precedence relations between concepts aimed to provide the reader with the knowledge needed to understand a further concept(s). Their annotation in educational texts produces datasets that can be represented as a graph of concepts connected by PRs. However, building good-quality and reliable datasets of PRs from a textbook is still an open issue, not just for automated annotation methods but even for manual annotation. In turn, the lack of good-quality datasets and well-defined criteria to identify PRs affect the development and validation of automated methods for prerequisite identification. As a contribution to this issue, in this paper, we propose PREAP, a protocol for the annotation of prerequisite relations in textbooks aimed at obtaining reliable annotated data that can be shared, compared, and reused in the research community. PREAP defines a novel textbook-driven annotation method aimed to capture the structure of prerequisites underlying the text. The protocol has been evaluated against baseline methods for manual and automatic annotation. The findings show that PREAP enables the creation of prerequisite knowledge graphs that have higher inter-annotator agreement, accuracy, and alignment with text than the baseline methods. This suggests that the protocol is able to accurately capture the PRs expressed in the text. Furthermore, the findings show that the time required to complete the annotation using PREAP are significantly shorter than with the other manual baseline methods. The paper includes also guidelines for using PREAP in three annotation scenarios, experimentally tested. We also provide example datasets and a user interface that we developed to support prerequisite annotation