139 research outputs found

    A Corpus-Based Tool for Exploring Domain-Specific Collocations in English

    Get PDF

    Keyphrase extraction by synonym analysis of n-grams for e-journals categorisation

    Get PDF
    Automatic keyword or keyphrase extraction is concerned with assigning keyphrases to documents based on words from within the document. Previous studies have shown that in a significant number of cases author-supplied keywords are not appropriate for the document to which they are attached. This can either be because they represent what the author believes the paper is about not what it actually is, or because they include keyphrases which are more classificatory than explanatory e.g., “University of Poppleton” instead of “Knowledge Discovery in Databases”. Thus, there is a need for a system that can generate appropriate and diverse range of keyphrases that reflect the document. This paper proposes a solution that examines the synonyms of words and phrases in the document to find the underlying themes, and presents these as appropriate keyphrases. The primary method explores taking n-grams of the source document phrases, and examining the synonyms of these, while the secondary considers grouping outputs by their synonyms. The experiments undertaken show the primary method produces good results and that the secondary method produces both good results and potential for future work

    Automated categorisation of e-journals by synonym analysis of n-grams

    Get PDF
    Automatic keyword or keyphrase extraction is concerned with assigning keyphrases to documents based on words from within the document. Previous studies have shown that in a significant number of cases author-supplied keywords are not appropriate for the document to which they are attached. This can either be because they represent what the author believes a paper is about not what it actually is, or because they include keyphrases which are more classificatory than explanatory e.g., “University of Poppleton” instead of “Knowledge Discovery in Databases”. Thus, there is a need for a system that can generate an appropriate and diverse range of keyphrases that reflect the document. This paper proposes two possible solutions that examine the synonyms of words and phrases in the document to find the underlying themes, and presents these as appropriate keyphrases. Using three different freely available thesauri, the work undertaken examines two different methods of producing keywords and compares the outcomes across multiple strands in the timeline. The primary method explores taking n-grams of the source document phrases, and examining the synonyms of these, while the secondary considers grouping outputs by their synonyms. The experiments undertaken show the primary method produces good results and that the secondary method produces both good results and potential for future work. In addition, the different qualities of the thesauri are examined and it is concluded that the more entries in a thesaurus, the better it is likely to perform. The age of the thesaurus or the size of each entry does not correlate to performance

    Automatic Acquisition of Knowledge About Multiword Predicates

    Get PDF
    PACLIC 19 / Taipei, taiwan / December 1-3, 200

    Evaluation of cutoff policies for term extraction

    Get PDF

    Semi-automated Ontology Generation for Biocuration and Semantic Search

    Get PDF
    Background: In the life sciences, the amount of literature and experimental data grows at a tremendous rate. In order to effectively access and integrate these data, biomedical ontologies – controlled, hierarchical vocabularies – are being developed. Creating and maintaining such ontologies is a difficult, labour-intensive, manual process. Many computational methods which can support ontology construction have been proposed in the past. However, good, validated systems are largely missing. Motivation: The biocuration community plays a central role in the development of ontologies. Any method that can support their efforts has the potential to have a huge impact in the life sciences. Recently, a number of semantic search engines were created that make use of biomedical ontologies for document retrieval. To transfer the technology to other knowledge domains, suitable ontologies need to be created. One area where ontologies may prove particularly useful is the search for alternative methods to animal testing, an area where comprehensive search is of special interest to determine the availability or unavailability of alternative methods. Results: The Dresden Ontology Generator for Directed Acyclic Graphs (DOG4DAG) developed in this thesis is a system which supports the creation and extension of ontologies by semi-automatically generating terms, definitions, and parent-child relations from text in PubMed, the web, and PDF repositories. The system is seamlessly integrated into OBO-Edit and Protégé, two widely used ontology editors in the life sciences. DOG4DAG generates terms by identifying statistically significant noun-phrases in text. For definitions and parent-child relations it employs pattern-based web searches. Each generation step has been systematically evaluated using manually validated benchmarks. The term generation leads to high quality terms also found in manually created ontologies. Definitions can be retrieved for up to 78% of terms, child ancestor relations for up to 54%. No other validated system exists that achieves comparable results. To improve the search for information on alternative methods to animal testing an ontology has been developed that contains 17,151 terms of which 10% were newly created and 90% were re-used from existing resources. This ontology is the core of Go3R, the first semantic search engine in this field. When a user performs a search query with Go3R, the search engine expands this request using the structure and terminology of the ontology. The machine classification employed in Go3R is capable of distinguishing documents related to alternative methods from those which are not with an F-measure of 90% on a manual benchmark. Approximately 200,000 of the 19 million documents listed in PubMed were identified as relevant, either because a specific term was contained or due to the automatic classification. The Go3R search engine is available on-line under www.Go3R.org

    Semi-automated Ontology Generation for Biocuration and Semantic Search

    Get PDF
    Background: In the life sciences, the amount of literature and experimental data grows at a tremendous rate. In order to effectively access and integrate these data, biomedical ontologies – controlled, hierarchical vocabularies – are being developed. Creating and maintaining such ontologies is a difficult, labour-intensive, manual process. Many computational methods which can support ontology construction have been proposed in the past. However, good, validated systems are largely missing. Motivation: The biocuration community plays a central role in the development of ontologies. Any method that can support their efforts has the potential to have a huge impact in the life sciences. Recently, a number of semantic search engines were created that make use of biomedical ontologies for document retrieval. To transfer the technology to other knowledge domains, suitable ontologies need to be created. One area where ontologies may prove particularly useful is the search for alternative methods to animal testing, an area where comprehensive search is of special interest to determine the availability or unavailability of alternative methods. Results: The Dresden Ontology Generator for Directed Acyclic Graphs (DOG4DAG) developed in this thesis is a system which supports the creation and extension of ontologies by semi-automatically generating terms, definitions, and parent-child relations from text in PubMed, the web, and PDF repositories. The system is seamlessly integrated into OBO-Edit and Protégé, two widely used ontology editors in the life sciences. DOG4DAG generates terms by identifying statistically significant noun-phrases in text. For definitions and parent-child relations it employs pattern-based web searches. Each generation step has been systematically evaluated using manually validated benchmarks. The term generation leads to high quality terms also found in manually created ontologies. Definitions can be retrieved for up to 78% of terms, child ancestor relations for up to 54%. No other validated system exists that achieves comparable results. To improve the search for information on alternative methods to animal testing an ontology has been developed that contains 17,151 terms of which 10% were newly created and 90% were re-used from existing resources. This ontology is the core of Go3R, the first semantic search engine in this field. When a user performs a search query with Go3R, the search engine expands this request using the structure and terminology of the ontology. The machine classification employed in Go3R is capable of distinguishing documents related to alternative methods from those which are not with an F-measure of 90% on a manual benchmark. Approximately 200,000 of the 19 million documents listed in PubMed were identified as relevant, either because a specific term was contained or due to the automatic classification. The Go3R search engine is available on-line under www.Go3R.org

    El tratamiento y la representación de las colocaciones verbales en el lenguaje especializado del turismo de aventura

    Get PDF
    A collocation is considered a frequent co-occurrence of two words which hold a syntactic relationship and whose elements enjoy a different status. Given their perception as a unit in language, access to the prominent word (base) involves immediate access to the other item (collocate). In terms of meaning, some combinations tend to be more transparent than others. The pervasiveness of these word associations in language has sparked a strong research interest in the last decades. A compelling reason for this approach may be the fact that they are naturally produced by native speakers but must be actively learned by non-native individuals. Not only has this reality led to their treatment in the general language, but it has also become a legitimate field of study in a wide range of specialized languages, such as the environment, computing, law or tourism, which is our object of study. As a consequence, specialized knowledge resources covering this type of word combinations have seen the light with the primary purpose of offering some extra help to people who deal with this type of language, for example, translators, linguists or other professionals. Nevertheless, there is still much to do in this respect. Taken this into account, it is hypothesized that verb collocations in the specialized language of adventure tourism convey specialized meaning that is worth being collected in terminological products. Therefore, this work endeavors, as its main purpose, to perform a deep analysis of verb collocations in this specialized domain and their implementation in the entries for motion verbs in DicoAdventure, a specialized dictionary of adventure tourism, whose inspirational idea was to highlight the significant role of verbs in the linguistic expression of concepts. Accordingly, the following theoretical objectives were set: first, to cover the linguistic branches which influence specialized lexicography; second, to define the concept of specialized collocation; and third, to examine a vast number of lexicographical and terminological resources so as to discover the items of information that would make an adequate representation of collocations in a specialized dictionary and, then, design a model for such task. Furthermore, the following practical objectives were formulated: first, to extract the motion verbs which would be the bases of the collocations implemented; second, to retrieve the lexical collocations of these verbs; and third, to classify the resulting list of collocations according to the meaning expressed, that is, actual motion or fictive (or metaphorical) motion. The practical steps taken in this research were based on the English monolingual specialized corpus ADVENCOR, which contains promotional texts about adventure tourism, and the use of corpus management software. The results of the theoretical work can be summarized as follows: (1) the specialized language of adventure tourism must be considered as specialized as any others; (2) collocations are not usually encoded in verb entries in dictionaries; and (3) a specialized collocation carries specialized knowledge which must be covered in terminological products. On the other hand, regarding the practical work, 12% of the verbs extracted were selected, as they were the ones expressing motion. However, only 46.61% of them produced collocations according to the extraction criteria established. Last, after applying more strict criteria for the collocation classification, only 25.42% of the verbs along with their collocations were collected in the dictionary. In addition to these results, the theory of Frame Semantics proved useful to understand the meaning of the verbs and their collocates. As for their implementation, which was the primary objective of this doctoral dissertation, the inclusion of verb collocations was of paramount importance for the identification of distinct meanings expressed by one verb in different contexts, as collocates conveyed subtle nuances of meaning. Finally, it was concluded that the incorporation of explanations about the combinations in lay terms facilitates the comprehension of the entries to any type of user, from experts to laypersons, which makes DicoAdventure a terminological product that can render valuable assistance to individuals with distinct specialized expertise.Una colocación es una coaparición frecuente de dos palabras que mantienen una relación sintáctica y cuyos elementos alcanzan un estatus diferente. Puesto que se perciben como una unidad del lenguaje, el acceso al elemento prominente (base) conlleva el acceso inmediato al otro componente (colocativo). Con respecto a su significado, algunas combinaciones tienden a ser más transparentes que otras. La constante presencia de las colocaciones en el lenguaje ha despertado gran interés por su investigación en las últimas décadas. Una razón convincente de este acercamiento podría ser el hecho de que los hablantes nativos las producen de forma natural, mientras que los no nativos deben aprenderlas de manera activa. Esta realidad no solo ha llevado a su tratamiento en el lenguaje general, sino también a que se hayan convertido en un campo de estudio legítimo en una amplia gama de lenguajes especializados, como son el medio ambiente, la informática, el derecho o el turismo, que es el objeto de estudio de esta investigación. Como consecuencia, se han creado recursos de conocimiento especializado con el propósito fundamental de ofrecer ayuda a las personas que interactúan con este tipo de lenguaje, por ejemplo, traductores, lingüistas u otro tipo de profesionales. No obstante, aún queda mucho por hacer en este aspecto. Teniendo esto en cuenta, la hipótesis de este trabajo se basa en la idea de que las colocaciones verbales en el lenguaje especializado del turismo de aventura expresan significados especializados que merecen ser recopilados en productos terminológicos. Por lo tanto, este trabajo tiene como principal objetivo el estudio exhaustivo de las colocaciones verbales en este campo de especialidad y su implementación en las entradas de los verbos de movimiento en DicoAdventure, un diccionario especializado del turismo de aventura, cuyo punto de partida fue la intención de destacar el importante papel que juegan los verbos en la expresión lingüística de los conceptos. Por consiguiente, se establecieron los siguientes objetivos teóricos: primero, revisar las ramas de la lingüística que ejercen una influencia en la lexicografía especializada; segundo, definir el concepto de colocación especializada; y tercero, examinar un gran número de recursos lexicográficos y terminológicos para descubrir qué tipo de información conformaría una representación adecuada de colocaciones en un diccionario especializado y, a continuación, diseñar un modelo para esta tarea. Además, se propusieron estos objetivos prácticos: primero, extraer los verbos de movimiento que serían las bases de las colocaciones implementadas; segundo, extraer las colocaciones léxicas de estos verbos; y tercero; clasificar la lista resultante de colocaciones según su significado, es decir, movimiento real o movimiento figurado (o metafórico). Los pasos prácticos que se dieron en esta investigación se llevaron a cabo mediante la gestión del corpus especializado monolingüe en inglés ADVENCOR, que contiene textos promocionales sobre el turismo de aventura, y el uso de software de gestión de corpus. Los resultados de la parte teórica del trabajo se pueden resumir de la siguiente manera: (1) el lenguaje especializado del turismo de aventura debe considerarse tan especializado como otros; (2) las colocaciones no suelen codificarse en las entradas de verbos en los diccionarios; y (3) una colocación especializada contiene conocimiento especializado que debe aparecer en productos terminológicos. Por otro lado, con respecto al trabajo práctico, se seleccionó el 12% de los verbos extraídos, ya que eran los que expresaban movimiento. Sin embargo, solo el 46,61% de ellos produjeron colocaciones según los criterios de extracción establecidos. Por último, después de aplicar criterios más estrictos para la clasificación de las colocaciones, solo el 25,42% de los verbos con sus colocaciones fueron recogidos en el diccionario. Además de estos resultados, se demostró la utilidad de la teoría de la Semántica de Marcos para entender el significado de los verbos y sus colocativos. En cuanto a su implementación, que era el objetivo principal de esta tesis doctoral, la inclusión de colocaciones verbales fue de suma importancia para la identificación de los distintos significados expresados por un verbo en diferentes contextos, puesto que los colocativos aportaban sutiles matices de significado. Finalmente, se concluyó que la incorporación de explicaciones sobre las combinaciones en términos legos favorece la comprensión de las entradas por parte de cualquier tipo de usuario, desde expertos a personas no especialistas, lo cual hace de DicoAdventure un producto terminológico que puede proporcionar valiosa ayuda a personas con diversa formación especializada

    A Computational Lexicon and Representational Model for Arabic Multiword Expressions

    Get PDF
    The phenomenon of multiword expressions (MWEs) is increasingly recognised as a serious and challenging issue that has attracted the attention of researchers in various language-related disciplines. Research in these many areas has emphasised the primary role of MWEs in the process of analysing and understanding language, particularly in the computational treatment of natural languages. Ignoring MWE knowledge in any NLP system reduces the possibility of achieving high precision outputs. However, despite the enormous wealth of MWE research and language resources available for English and some other languages, research on Arabic MWEs (AMWEs) still faces multiple challenges, particularly in key computational tasks such as extraction, identification, evaluation, language resource building, and lexical representations. This research aims to remedy this deficiency by extending knowledge of AMWEs and making noteworthy contributions to the existing literature in three related research areas on the way towards building a computational lexicon of AMWEs. First, this study develops a general understanding of AMWEs by establishing a detailed conceptual framework that includes a description of an adopted AMWE concept and its distinctive properties at multiple linguistic levels. Second, in the use of AMWE extraction and discovery tasks, the study employs a hybrid approach that combines knowledge-based and data-driven computational methods for discovering multiple types of AMWEs. Third, this thesis presents a representative system for AMWEs which consists of multilayer encoding of extensive linguistic descriptions. This project also paves the way for further in-depth AMWE-aware studies in NLP and linguistics to gain new insights into this complicated phenomenon in standard Arabic. The implications of this research are related to the vital role of the AMWE lexicon, as a new lexical resource, in the improvement of various ANLP tasks and the potential opportunities this lexicon provides for linguists to analyse and explore AMWE phenomena

    The assessment of the special educational needs for children with autism in Singapore.

    Get PDF
    In Singapore, there is a high reliance on IQ scores as the basis for deciding children's access to special educational provisions. Children with disabilities remain in the mainstream if they are perceived to be able to cope with the demands of the mainstream schools. On the other hand, if children were seen to require intensive support, referral to special schools would be initiated. This thesis aims to evaluate the validity of measures of intelligence and other selected indicators of special educational needs (SEN) for children with autism in Singapore. The first phase of the thesis involved identifying an independent measure of SEN. Results of Study 1, which involved interviews with the parents of 40 children with autism, provided support for the International Classification of Functioning Disability and Health (ICF: WHO, 2001) as an adequate independent measure of SEN. The second phase involved the evaluation of selected indicators of SEN that can be used alongside the ICF, namely measures of intelligence, theory of mind, executive function, central coherence and cognitive modifiability. These were evaluated based on their psychometric and treatment validity, as defined in educational contexts. For evaluations of psychometric validity, two criteria were used: firstly, the extent to which the indicators were able to predict children's SEN level and secondly, the extent to which the indicators were able to distinguish children with autism who can cope with mainstream schools, from those that require special schools. This involved individual assessments with 52 children with autism and interviews with their parents (Study 2). For evaluations of treatment validity a qualitative approach was adopted to obtain practitioners' views on the extent to which the indicators of SEN were able to provide information that can be used to plan interventions (Study 3). The findings indicated that it was the combination of indicators that accounted for the greatest variance in the SEN levels of children with autism. However, depending on the purpose of testing and types of sub-group of children with autism, different indicators proved to have different validity strength. When the treatment validity of these measures was evaluated, measures of theory of mind showed the strongest treatment validity. The findings are discussed in terms of their implications for SEN assessments in Singapore, and the assessment of children with autism in general
    corecore