52 research outputs found

    Using Ontology-Based Approaches to Representing Speech Transcripts for Automated Speech Scoring

    Get PDF
    Text representation is a process of transforming text into some formats that computer systems can use for subsequent information-related tasks such as text classification. Representing text faces two main challenges: meaningfulness of representation and unknown terms. Research has shown evidence that these challenges can be resolved by using the rich semantics in ontologies. This study aims to address these challenges by using ontology-based representation and unknown term reasoning approaches in the context of content scoring of speech, which is a less explored area compared to some common ones such as categorizing text corpus (e.g. 20 newsgroups and Reuters). From the perspective of language assessment, the increasing amount of language learners taking second language tests makes automatic scoring an attractive alternative to human scoring for delivering rapid and objective scores of written and spoken test responses. This study focuses on the speaking section of second language tests and investigates ontology-based approaches to speech scoring. Most previous automated speech scoring systems for spontaneous responses of test takers assess speech by primarily using acoustic features such as fluency and pronunciation, while text features are less involved and exploited. As content is an integral part of speech, the study is motivated by the lack of rich text features in speech scoring and is designed to examine the effects of different text features on scoring performance. A central question to the study is how speech transcript content can be represented in an appropriate means for speech scoring. Previously used approaches from essay and speech scoring systems include bag-of-words and latent semantic analysis representations, which are adopted as baselines in this study; the experimental approaches are ontology-based, which can help improving meaningfulness of representation units and estimating importance of unknown terms. Two general domain ontologies, WordNet and Wikipedia, are used respectively for ontology-based representations. In addition to comparison between representation approaches, the author analyzes which parameter option leads to the best performance within a particular representation. The experimental results show that on average, ontology-based representations slightly enhances speech scoring performance on all measurements when combined with the bag-of-words representation; reasoning of unknown terms can increase performance on one measurement (cos.w4) but decrease others. Due to the small data size, the significance test (t-test) shows that the enhancement of ontology-based representations is inconclusive. The contributions of the study include: 1) it examines the effects of different representation approaches on speech scoring tasks; 2) it enhances the understanding of the mechanisms of representation approaches and their parameter options via in-depth analysis; 3) the representation methodology and framework can be applied to other tasks such as automatic essay scoring

    Induction, Semantic Validation and Evaluation of a Derivational Morphology Lexicon for German

    Get PDF
    This thesis is about computational morphology for German derivation. Derivation is a word formation process that creates new words from existing ones, where the base and the derived word share the same stem. Mostly, derivation is conducted by means of relatively regular affixation rules, as in to bake - bakery. In German, derivation is highly productive, thus leading to a high language variability which can be employed to express similar facts in different ways, as derivationally related words are often also semantically related (or transparent). However, linguistic variance is a challenge for computational applications, particularly in semantic processing: It makes it more difficult to automatically grasp the meaning of texts and to match similar information onto each other. Thus, computational systems require linguistic knowledge. We develop methods to induce and represent derivational knowledge, and to apply it in language processing. The main outcome of our study is DErivBase, a German derivational lexicon. It groups derivationally related words (words that are derived from the same stem) into derivational families. To achieve high quality and high coverage, we induce DErivBase by combining rule-based and data-driven methods: We implement linguistic derivation rules to define derivational processes, and feed lemmas extracted from a German corpus into the rules to derive new lemmas. All words that are connected - directly or indirectly - by such rules are considered a derivational family. As mentioned above, a derivational relationship often implies semantic relationship, but this is not always the case. Semantic drifts can cause semantically unrelated (opaque) derivational relations, such as to depart - department. Capturing the difference between transparent and opaque relations is important from a linguistic as well as a practical point of view. Thus, we conduct a semantic refinement of DErivBase, i.e., we determine which lemma pairs are derivationally and semantically related, and which are not. We establish a second, semantically validated version of our lexicon, where families are sub-clustered according to semantic coherence, using supervised machine learning methods: We learn a binary classifier based on features that arise from structural information about the derivation rules, and from distributional information about the semantic relatedness of lemmas. Accordingly, the derivational families are subdivided into semantically coherent clusters. To demonstrate the utility of the two lexicon versions, we evaluate them on three extrinsic - and in the broadest sense, semantic - tasks. The underlying assumption for applying DErivBase to semantic tasks is that derivational relatedness is a reasonable approximation to semantic relatedness, since derivation is often semantically transparent. Our three experiments are the following: 1., we incorporate DErivBase into distributional semantic models to overcome sparsity problems and to improve the prediction quality of the underlying model. We test this method, which we call derivational smoothing, for semantic similarity prediction, and for synonym choice. 2., we employ DErivBase to model a psycholinguistic experiment that examines priming effects of transparent and opaque derivations to draw conclusions about the mental lexical representation in German. Derivational information is again incorporated into a distributional model, but this time, it introduces a kind of morphological generalisation. 3., in order to solve the task of Recognising Textual Entailment, we integrate DErivBase into a matching-based entailment system by means of a query expansion. Assuming that derivational relationships between two texts suggest them to be entailing rather than non-entailing, this expansion increases the chance of a lexical overlap, which should improve the system's entailment predictions. The incorporation of DErivBase indeed improves the performance of the underlying systems in each task, however, it is differently suitable in different settings. In experiment 1., the semantically validated lexicon yields improvements over the purely morphological lexicon, and the more coarse-grained similarity prediction profits more from DErivBase than the synonym choice. In experiment 2., purely morphological information clearly outperforms the other lexicon version, as the latter cannot model opaque derivations. On the entailment task in experiment 3., DErivBase has only minor impact, because textual entailment is hard to solve by addressing only one linguistic phenomenon. In sum, our findings show that the induction of a high-quality, high-coverage derivational lexicon is beneficial for very different applications in computational linguistics. It might be worthwhile to further investigate the semantic aspects of derivation to better understand its impact on language and thus, on language processing

    Utilizing graph-based representation of text in a hybrid approach to multiple documents summarization

    Get PDF
    The aim of automatic text summarization is to process text with the purpose of identifying and presenting the most important information appearing in the text. In this research, we aim to investigate automatic multiple document summarization using a hybrid approach of extractive and “shallow abstractive methods. We aim to utilize the graph-based representation approach proposed in [1] and [2] as part of our method to multiple document summarization aiming to provide concise, informative and coherent summaries. We start by scoring sentences based on significance to extract top scoring ones from each document of the set of documents being summarized. In this step, we look into different criteria of scoring sentences, which include: the presence of highly frequent words of the document, the presence of highly frequent words of the set of documents and the presence of words found in the first and last sentence of the document and the different combination of such features. Upon running our experiments we found that the best combination of features to use is utilizing the presence of highly frequent words of the document and presence of words found in the first and last sentences of the document. The average f-score of those features had an average of 7.9% increase to other features\u27 f-scores. Secondly, we address the issue of redundancy of information through clustering sentences of same or similar information into one cluster that will be compressed into one sentence, thus avoiding redundancy of information as much as possible. We investigated clustering the extracted sentences based on two criteria for similarity, the first of which uses word frequency vector for similarity measure and the second of which uses word semantic similarity. Through our experiment, we found that the use of the word vector features yields much better clusters in terms of sentence similarity. The word feature vector had a 20% more number of clusters labeled to contain similar sentences as opposed to those of the word semantic feature. We then adopted a graph-based representation of text proposed in [1] and [2] to represent each sentence in a cluster, and using the k-shortest paths we found the shortest path to represent the final compressed sentence and use it as a final sentence in the summary. Human evaluator scored sentences based on grammatical correctness and almost 74% of 51 sentences evaluated got a perfect score of 2 which is a perfect or near perfect sentence. We finally propose a method for scoring the compressed sentences according to the order in which they should appear in the final summary. We used the Document Understanding Conference dataset for year 2014 as the evaluating dataset for our final system. We used the ROUGE system for evaluation which stands for Recall-Oriented Understudy for Gisting Evaluation. This system compare the automatic summaries to “ideal human references. We also compared our summaries ROUGE scores to those of summaries generated using the MEAD summarization tool. Our system provided better precision and f-score as well as comparable recall scores. On average our system has a percentage increase of 2% for precision and 1.6% increase in f-score than those of MEAD while MEAD has an increase of 0.8% in recall. In addition, our system provided more compressed version of the summary as opposed to that generated by MEAD. We finally ran an experiment to evaluate the order of sentences in the final summary and its comprehensibility where we show that our ordering method produced a comprehensible summary. On average, summaries that scored a perfect score in term of comprehensibility constitute 72% of the evaluated summaries. Evaluators were also asked to count the number of ungrammatical and incomprehensible sentences in the evaluated summaries and on average they were only 10.9% of the summaries sentences. We believe our system provide a \u27shallow abstractive summary to multiple documents that does not require intensive Natural Language Processing.

    Learning of a multilingual bitaxonomy of Wikipedia and its application to semantic predicates

    Get PDF
    The ability to extract hypernymy information on a large scale is becoming increasingly important in natural language processing, an area of the artificial intelligence which deals with the processing and understanding of natural language. While initial studies extracted this type of information from textual corpora by means of lexico-syntactic patterns, over time researchers moved to alternative, more structured sources of knowledge, such as Wikipedia. After the first attempts to extract is-a information fromWikipedia categories, a full line of research gave birth to numerous knowledge bases containing information which, however, is either incomplete or irremediably bound to English. To this end we put forward MultiWiBi, the first approach to the construction of a multilingual bitaxonomy which exploits the inner connection between Wikipedia pages and Wikipedia categories to induce a wide-coverage and fine-grained integrated taxonomy. A series of experiments show state-of-the-art results against all the available taxonomic resources available in the literature, also with respect to two novel measures of comparison. Another dimension where existing resources usually fall short is their degree of multilingualism. While knowledge is typically language agnostic, currently resources are able to extract relevant information only in languages providing highquality tools. In contrast, MultiWiBi does not leave any language behind: we show how to taxonomize Wikipedia in an arbitrary language and in a way that is fully independent of additional resources. At the core of our approach lies, in fact, the idea that the English version of Wikipedia can be linguistically exploited as a pivot to project the taxonomic information extracted from English to any other Wikipedia language in order to have a bitaxonomy in a second, arbitrary language; as a result, not only concepts which have an English equivalent are covered, but also those concepts which are not lexicalized in the source language. We also present the impact of having the taxonomized encyclopedic knowledge offered by MultiWiBi embedded into a semantic model of predicates (SPred) which crucially leverages Wikipedia to generalize collections of related noun phrases to infer a probability distribution over expected semantic classes. We applied SPred to a word sense disambiguation task and show that, when MultiWiBi is plugged in to replace an internal component, SPred’s generalization power increases as well as its precision and recall. Finally, we also published MultiWiBi as linked data, a paradigm which fosters interoperability and interconnection among resources and tools through the publication of data on the Web, and developed a public interface which lets the users navigate through MultiWiBi’s taxonomic structure in a graphical, captivating manner

    LiDom builder: Automatising the construction of multilingual domain modules

    Get PDF
    136 p.Laburpena Lan honetan LiDOM Builder tresnaren analisi, diseinu eta ebaluazioa aurkezten dira. Teknologian oinarritutako hezkuntzarako tresnen Domeinu Modulu Eleaniztunak testuliburu elektronikoetatik era automatikoan erauztea ahalbidetzen du LiDOM Builderek. Ezagutza eskuratzeko, Hizkuntzaren Prozesamendurako eta Ikaste Automatikorako teknikekin batera, hainbat baliabide eleaniztun erabiltzen ditu, besteak beste, Wikipedia eta WordNet.Domeinu Modulu Elebakarretik Domeinu Modulu Eleaniztunerako bidean, LiDOM Builder tresna DOM-Sortze ingurunearen (Larrañaga, 2012; Larrañaga et al., 2014) bilakaera dela esan genezake. Horretarako, LiDOM Builderek domeinua ikuspegi eleaniztun batetik adieraztea ahalbidetzen duen mekanismoa dakar. Domeinu Modulu Eleaniztunak bi maila ezberdinetako ezagutza jasotzen du: Ikaste Domeinuaren Ontologia (IDO), non hizkuntza ezberdinetan etiketatutako topikoak eta hauen arteko erlazio pedagogikoak jasotzen baitira, eta Ikaste Objektuak (IO), hau da, metadatuekin etiketatutako baliabide didaktikoen bilduma, hizkuntza horietan. LiDOM Builderek onartutako hizkuntza guztietan domeinuaren topikoak adierazteko aukera ematen du. Topiko bakoitza lotuta dago dagokion hizkuntzako bere etiketa baliokidearekin. Gainera, IOak deskribatzeko metadatu aberastuak erabiltzen ditu hizkuntza desberdinetan parekideak diren baliabide didaktikoak lotzeko.LiDOM Builderen, hasiera batean, domeinu-modulua hizkuntza jakin batean idatzitako dokumentu batetik erauziko da eta, baliabide eleaniztunak erabiliko dira, gerora, bai topikoak bai IOak beste hizkuntzetan ere lortzeko. Lan honetan, Ingelesez idatzitako liburuek osatuko dute informazio-iturri nagusia bai doitze-prozesuan bai ebaluazio-prozesuan. Zehazki, honako testuliburu hauek erabili dira: Principles of Object Oriented Programming (Wong and Nguyen, 2010), Introduction to Astronomy (Morison, 2008) eta Introduction to Molecular Biology (Raineri, 2010). Baliabide eleaniztunei dagokienez, Wikipedia, WordNet eta Wikipediatik erauzitako beste hainbat ezagutza-base erabili dira. Testuliburuetatik Domeinu Modulu Eleaniztunak eraikitzeko, LiDOM Builder hiru modulu nagusitan oinarritzen da: LiTeWi eta LiReWi moduluak IDO eleaniztuna eraikitzeaz arduratuko dira eta LiLoWi, aldiz, IO eleaniztunak eraikitzeaz. Jarraian, aipatutako modulu bakoitza xehetasun gehiagorekin azaltzen da.¿ LiTeWi (Conde et al., 2015) moduluak, edozein ikaste-domeinutako testuliburu batetik abiatuta, Hezkuntzarako Ontologia bati dagozkion hainbat termino eleaniztun identifikatuko ditu, hala nola TF-IDF, KP-Miner, CValue eta Shallow Parsing Grammar. Hori lortzeko, gainbegiratu gabeko datu-erauzketa teknikez eta Wikipediaz baliatzen da. Ontologiako topikoak erauzteak LiTeWi-n hiru urrats ditu: lehenik hautagai diren terminoen erauzketa; bigarrenik, lortutako terminoen konbinatzea eta fintzea azken termino zerrenda osatuz; eta azkenik, zerrendako terminoak beste hizkuntzetara mapatzea Wikipedia baliatuz.¿ LiReWi (Conde et al., onartzeko) moduluak Hezkuntzarako Ontologia erlazio pedagogikoez aberastuko du, beti ere testuliburua abiapuntu gisa erabilita. Lau motatako erlazio pedagogikoak erauziko ditu (isA, partOf, prerequisite eta pedagogicallyClose) hainbat teknika eta ezagutza-base konbinatuz. Ezagutza-baseen artean Wikipedia, WordNet, WikiTaxonomy, WibiTaxonomy eta WikiRelations daude. LiReWi-k ere hiru urrats emango ditu erlazioak lortzeko: hasteko, ontologiako topikoak erlazioak erauzteko erabiliko diren ezagutza-base desberdinekin mapatuko ditu; gero, hainbat erlazio-erauzle, bakoitza teknika desberdin batean oinarritzen dena, exekutatuko ditu konkurrenteki erlazio hautagaiak erauzteko; eta, bukatzeko, lortutako emaitza guztiak konbinatu eta iragaziko ditu erlazio pedagogikoen azken multzoa lortuz. Gainera, DOM-Sortzetik LiDOM Buildererako trantsizioan, tesi honetan hobetu egin dira dokumentuen indizeetatik erauzitako isA eta partOf erlazioak, Wikipedia baliabide gehigarri bezala erabilita (Conde et al., 2014).¿ LiLoWi moduluak IOak -batzuk eleaniztunak- erauziko ditu, abiapuntuko testuliburutik ez ezik Wikipedia edo WordNet bezalako ezagutza-baseetatik ere. IDO ontologiako topiko bakoitza Wikipedia eta WordNet-ekin mapatu ostean, LiLoWi-k baliabide didaktikoak erauziko ditu hainbat IO erauzlez baliatuz.IO erauzketa-prozesuan, DOM-Sortzetik LiDOM Buildereko bidean, eta Wikipedia eta WordNet erabili aurretik, ingelesa hizkuntza ere gehitu eta ebaluatu da (Conde et al., 2012).LiDOM Builderen ebaluaziori dagokionez, modulu bakoitza bere aldetik testatua eta ebaluatua izan da bai Gold-standard teknika bai aditu-ebaluazioa baliatuz. Gainera, Wikipedia eta WordNet ezagutza-baseen integrazioak IOen erauzketari ekarri dion hobekuntza ere ebaluatu da. Esan genezake kasu guztietan lortu diren emaitzak oso onak direla.Bukatzeko, eta laburpen gisa, lau dira LiDOM Builderek Domeinu Modulu Eleaniztunaren arloari egin dizkion ekarpen nagusiak:¿ Domeinu Modulu Eleaniztunak adierazteko mekanismo egokia.¿ LiTeWiren garapena. Testuliburuetatik Hezkuntzarako Ontologietarako terminologia eleaniztuna erauztea ahalbidetzen du modulu honek. Ingelesa eta Gaztelera hizkuntzentzako termino-erauzlea eskura dago https://github.com/Neuw84/LiTe URLan.¿ LiReWiren garapena. Testuliburuetatik Hezkuntzarako Ontologietarako erlazio pedagogikoak erauztea ahalbidetzen du modulu honek. Erabiltzen duen Wikipedia/WordNet mapatzailea eskura dago https://github.com/Neuw84/Wikipedia2WordNet URLan.¿ LiLoWiren garapena. Testuliburua eta Wikipedia eta WordNet ezagutza-baseak erabilita IO eleaniztunak erauztea ahalbidetzen du modulu honek

    Three Essays on Trust Mining in Online Social Networks

    Get PDF
    This dissertation research consists of three essays on studying trust in online social networks. Trust plays a critical role in online social relationships, because of the high levels of risk and uncertainty involved. Guided by relevant social science and computational graph theories, I develop conceptual and predictive models to gain insights into trusting behaviors in online social relationships. In the first essay, I propose a conceptual model of trust formation in online social networks. This is the first study that integrates the existing graph-based view of trust formation in social networks with socio-psychological theories of trust to provide a richer understanding of trusting behaviors in online social networks. I introduce new behavioral antecedents of trusting behaviors and redefine and integrate existing graph-based concepts to develop the proposed conceptual model. The empirical findings indicate that both socio-psychological and graph-based trust-related factors should be considered in studying trust formation in online social networks. In the second essay, I propose a theory-based predictive model to predict trust and distrust links in online social networks. Previous trust prediction models used limited network structural data to predict future trust/distrust relationships, ignoring the underlying behavioral trust-inducing factors. I identify a comprehensive set of behavioral and structural predictors of trust/distrust links based on related theories, and then build multiple supervised classification models to predict trust/distrust links in online social networks. The empirical results confirm the superior fit and predictive performance of the proposed model over the baselines. In the third essay, I propose a lexicon-based text mining model to mine trust related user-generated content (UGC). This is the first theory-based text mining model to examine important factors in online trusting decisions from UGC. I build domain-specific trustworthiness lexicons for online social networks based on related behavioral foundations and text mining techniques. Next, I propose a lexicon-based text mining model that automatically extracts and classifies trustworthiness characteristics from trust reviews. The empirical evaluations show the superior performance of the proposed text mining system over the baselines

    Fuzzy natural language similarity measures through computing with words

    Get PDF
    A vibrant area of research is the understanding of human language by machines to engage in conversation with humans to achieve set goals. Human language is naturally fuzzy by nature, with words meaning different things to different people, depending on the context. Fuzzy words are words with a subjective meaning, typically used in everyday human natural language dialogue and often ambiguous and vague in meaning and dependent on an individual’s perception. Fuzzy Sentence Similarity Measures (FSSM) are algorithms that can compare two or more short texts which contain fuzzy words and return a numeric measure of similarity of meaning between them. The motivation for this research is to create a new FSSM called FUSE (FUzzy Similarity mEasure). FUSE is an ontology-based similarity measure that uses Interval Type-2 Fuzzy Sets to model relationships between categories of human perception-based words. Four versions of FUSE (FUSE_1.0 – FUSE_4.0) have been developed, investigating the presence of linguistic hedges, the expansion of fuzzy categories and their use in natural language, incorporating logical operators such as ‘not’ and the introduction of the fuzzy influence factor. FUSE has been compared to several state-of-the-art, traditional semantic similarity measures (SSM’s) which do not consider the presence of fuzzy words. FUSE has also been compared to the only published FSSM, FAST (Fuzzy Algorithm for Similarity Testing), which has a limited dictionary of fuzzy words and uses Type-1 Fuzzy Sets to model relationships between categories of human perception-based words. Results have shown FUSE is able to improve on the limitations of traditional SSM’s and the FAST algorithm by achieving a higher correlation with the average human rating (AHR) compared to traditional SSM’s and FAST using several published and gold-standard datasets. To validate FUSE, in the context of a real-world application, versions of the algorithm were incorporated into a simple Question & Answer (Q&A) dialogue system (DS), referred to as FUSION, to evaluate the improvement of natural language understanding. FUSION was tested on two different scenarios using human participants and results compared to a traditional SSM known as STASIS. Results of the DS experiments showed a True rating of 88.65% compared to STASIS with an average True rating of 61.36%. Results showed that the FUSE algorithm can be used within real world applications and evaluation of the DS showed an improvement of natural language understanding, allowing semantic similarity to be calculated more accurately from natural user responses. The key contributions of this work can be summarised as follows: The development of a new methodology to model fuzzy words using Interval Type-2 fuzzy sets; leading to the creation of a fuzzy dictionary for nine fuzzy categories, a useful resource which can be used by other researchers in the field of natural language processing and Computing with Words with other fuzzy applications such as semantic clustering. The development of a FSSM known as FUSE, which was expanded over four versions, investigating the incorporation of linguistic hedges, the expansion of fuzzy categories and their use in natural language, inclusion of logical operators such as ‘not’ and the introduction of the fuzzy influence factor. Integration of the FUSE algorithm into a simple Q&A DS referred to as FUSION demonstrated that FSSM can be used in a real-world practical implementation, therefore making FUSE and its fuzzy dictionary generalisable to other applications

    Development of a text mining approach to disease network discovery

    Get PDF
    Scientific literature is one of the major sources of knowledge for systems biology, in the form of papers, patents and other types of written reports. Text mining methods aim at automatically extracting relevant information from the literature. The hypothesis of this thesis was that biological systems could be elucidated by the development of text mining solutions that can automatically extract relevant information from documents. The first objective consisted in developing software components to recognize biomedical entities in text, which is the first step to generate a network about a biological system. To this end, a machine learning solution was developed, which can be trained for specific biological entities using an annotated dataset, obtaining high-quality results. Additionally, a rule-based solution was developed, which can be easily adapted to various types of entities. The second objective consisted in developing an automatic approach to link the recognized entities to a reference knowledge base. A solution based on the PageRank algorithm was developed in order to match the entities to the concepts that most contribute to the overall coherence. The third objective consisted in automatically extracting relations between entities, to generate knowledge graphs about biological systems. Due to the lack of annotated datasets available for this task, distant supervision was employed to train a relation classifier on a corpus of documents and a knowledge base. The applicability of this approach was demonstrated in two case studies: microRNAgene relations for cystic fibrosis, obtaining a network of 27 relations using the abstracts of 51 recently published papers; and cell-cytokine relations for tolerogenic cell therapies, obtaining a network of 647 relations from 3264 abstracts. Through a manual evaluation, the information contained in these networks was determined to be relevant. Additionally, a solution combining deep learning techniques with ontology information was developed, to take advantage of the domain knowledge provided by ontologies. This thesis contributed with several solutions that demonstrate the usefulness of text mining methods to systems biology by extracting domain-specific information from the literature. These solutions make it easier to integrate various areas of research, leading to a better understanding of biological systems
    • …
    corecore