3,520 research outputs found

    Qualities, objects, sorts, and other treasures : gold digging in English and Arabic

    Get PDF
    In the present monograph, we will deal with questions of lexical typology in the nominal domain. By the term "lexical typology in the nominal domain", we refer to crosslinguistic regularities in the interaction between (a) those areas of the lexicon whose elements are capable of being used in the construction of "referring phrases" or "terms" and (b) the grammatical patterns in which these elements are involved. In the traditional analyses of a language such as English, such phrases are called "nominal phrases". In the study of the lexical aspects of the relevant domain, however, we will not confine ourselves to the investigation of "nouns" and "pronouns" but intend to take into consideration all those parts of speech which systematically alternate with nouns, either as heads or as modifiers of nominal phrases. In particular, this holds true for adjectives both in English and in other Standard European Languages. It is well known that adjectives are often difficult to distinguish from nouns, or that elements with an overt adjectival marker are used interchangeably with nouns, especially in particular semantic fields such as those denoting MATERIALS or NATlONALlTIES. That is, throughout this work the expression "lexical typology in the nominal domain" should not be interpreted as "a typology of nouns", but, rather, as the cross-linguistic investigation of lexical areas constitutive for "referring phrases" irrespective of how the parts-of-speech system in a specific language is defined

    Mapping Acoustic and Semantic Dimensions of Auditory Perception

    Get PDF
    Auditory categorisation is a function of sensory perception which allows humans to generalise across many different sounds present in the environment and classify them into behaviourally relevant categories. These categories cover not only the variance of acoustic properties of the signal but also a wide variety of sound sources. However, it is unclear to what extent the acoustic structure of sound is associated with, and conveys, different facets of semantic category information. Whether people use such data and what drives their decisions when both acoustic and semantic information about the sound is available, also remains unknown. To answer these questions, we used the existing methods broadly practised in linguistics, acoustics and cognitive science, and bridged these domains by delineating their shared space. Firstly, we took a model-free exploratory approach to examine the underlying structure and inherent patterns in our dataset. To this end, we ran principal components, clustering and multidimensional scaling analyses. At the same time, we drew sound labels’ semantic space topography based on corpus-based word embeddings vectors. We then built an LDA model predicting class membership and compared the model-free approach and model predictions with the actual taxonomy. Finally, by conducting a series of web-based behavioural experiments, we investigated whether acoustic and semantic topographies relate to perceptual judgements. This analysis pipeline showed that natural sound categories could be successfully predicted based on the acoustic information alone and that perception of natural sound categories has some acoustic grounding. Results from our studies help to recognise the role of physical sound characteristics and their meaning in the process of sound perception and give an invaluable insight into the mechanisms governing the machine-based and human classifications

    A corpus-driven study of features of Chinese students' undergraduate writing in UK universities

    Get PDF
    Chinese people now comprise the ‘largest single overseas student group in the UK’ with more than 85,000 Chinese students registered at UK institutions in 2009 (British Council, 2010a). While there have been many studies carried out on short argumentative essays from this group (e.g. Chen, 2009), and on postgraduate theses (e.g. Hyland, 2008b), there has been comparatively little research conducted on the high-stakes genre of undergraduate assignments. This study examines assessed writing from Chinese and British undergraduates studying in UK universities between 2000 and 2008; these are investigated using corpus linguistic procedures, supported by qualitative reading. A particular focus is the use of lexical chunks, or recurring strings of words. Findings from the literature on Chinese students’ written English indicate high use of informal chunks, connecting chunks, and those containing first person pronouns (e.g. Milton, 1999). This study found that while the Chinese students make greater use of particular connectors and the first person plural, both student groups make (limited) use of informal language. These areas of difference are more apparent in year 1/2 assignments than those from year 3, suggesting that students gradually conform to the academy’s expectations. Unexpected findings which have not been previously identified in the literature include Chinese students’ significantly higher use of tables, figures (or ‘visuals’) and lists, compared to the British students’ writing. Detailed exploration of writing within Biology, Economics and Engineering suggests that using visuals and lists are different, yet equally acceptable, ways of writing assignments. Since the writing of both student groups has been judged by discipline specialists to be of a high standard, it is argued that the difference in use of visuals and lists illustrates the range of acceptability at undergraduate level. The thesis proposes that scholars therefore need to consider expanding the notion of what constitutes ‘good’ student writing

    Semantic enrichment of knowledge sources supported by domain ontologies

    Get PDF
    This thesis introduces a novel conceptual framework to support the creation of knowledge representations based on enriched Semantic Vectors, using the classical vector space model approach extended with ontological support. One of the primary research challenges addressed here relates to the process of formalization and representation of document contents, where most existing approaches are limited and only take into account the explicit, word-based information in the document. This research explores how traditional knowledge representations can be enriched through incorporation of implicit information derived from the complex relationships (semantic associations) modelled by domain ontologies with the addition of information presented in documents. The relevant achievements pursued by this thesis are the following: (i) conceptualization of a model that enables the semantic enrichment of knowledge sources supported by domain experts; (ii) development of a method for extending the traditional vector space, using domain ontologies; (iii) development of a method to support ontology learning, based on the discovery of new ontological relations expressed in non-structured information sources; (iv) development of a process to evaluate the semantic enrichment; (v) implementation of a proof-of-concept, named SENSE (Semantic Enrichment kNowledge SourcEs), which enables to validate the ideas established under the scope of this thesis; (vi) publication of several scientific articles and the support to 4 master dissertations carried out by the department of Electrical and Computer Engineering from FCT/UNL. It is worth mentioning that the work developed under the semantic referential covered by this thesis has reused relevant achievements within the scope of research European projects, in order to address approaches which are considered scientifically sound and coherent and avoid “reinventing the wheel”.European research projects - CoSpaces (IST-5-034245), CRESCENDO (FP7-234344) and MobiS (FP7-318452

    Sparse distributed representations as word embeddings for language understanding

    Get PDF
    Word embeddings are vector representations of words that capture semantic and syntactic similarities between them. Similar words tend to have closer vector representations in a N dimensional space considering, for instance, Euclidean distance between the points associated with the word vector representations in a continuous vector space. This property, makes word embeddings valuable in several Natural Language Processing tasks, from word analogy and similarity evaluation to the more complex text categorization, summarization or translation tasks. Typically state of the art word embeddings are dense vector representations, with low dimensionality varying from tens to hundreds of floating number dimensions, usually obtained from unsupervised learning on considerable amounts of text data by training and optimizing an objective function of a neural network. This work presents a methodology to derive word embeddings as binary sparse vectors, or word vector representations with high dimensionality, sparse representation and binary features (e.g. composed only by ones and zeros). The proposed methodology tries to overcome some disadvantages associated with state of the art approaches, namely the size of corpus needed for training the model, while presenting comparable evaluations in several Natural Language Processing tasks. Results show that high dimensionality sparse binary vectors representations, obtained from a very limited amount of training data, achieve comparable performances in similarity and categorization intrinsic tasks, whereas in analogy tasks good results are obtained only for nouns categories. Our embeddings outperformed eight state of the art word embeddings in word similarity tasks, and two word embeddings in categorization tasks.A designação word embeddings refere-se a representações vetoriais das palavras que capturam as similaridades semânticas e sintáticas entre estas. Palavras similares tendem a ser representadas por vetores próximos num espaço N dimensional considerando, por exemplo, a distância Euclidiana entre os pontos associados a estas representações vetoriais num espaço vetorial contínuo. Esta propriedade, torna as word embeddings importantes em várias tarefas de Processamento Natural da Língua, desde avaliações de analogia e similaridade entre palavras, às mais complexas tarefas de categorização, sumarização e tradução automática de texto. Tipicamente, as word embeddings são constituídas por vetores densos, de dimensionalidade reduzida. São obtidas a partir de aprendizagem não supervisionada, recorrendo a consideráveis quantidades de dados, através da otimização de uma função objetivo de uma rede neuronal. Este trabalho propõe uma metodologia para obter word embeddings constituídas por vetores binários esparsos, ou seja, representações vetoriais das palavras simultaneamente binárias (e.g. compostas apenas por zeros e uns), esparsas e com elevada dimensionalidade. A metodologia proposta tenta superar algumas desvantagens associadas às metodologias do estado da arte, nomeadamente o elevado volume de dados necessário para treinar os modelos, e simultaneamente apresentar resultados comparáveis em várias tarefas de Processamento Natural da Língua. Os resultados deste trabalho mostram que estas representações, obtidas a partir de uma quantidade limitada de dados de treino, obtêm performances consideráveis em tarefas de similaridade e categorização de palavras. Por outro lado, em tarefas de analogia de palavras apenas se obtém resultados consideráveis para a categoria gramatical dos substantivos. As word embeddings obtidas com a metodologia proposta, e comparando com o estado da arte, superaram a performance de oito word embeddings em tarefas de similaridade, e de duas word embeddings em tarefas de categorização de palavras

    Expanding the Usage of Web Archives by Recommending Archived Webpages Using Only the URI

    Get PDF
    Web archives are a window to view past versions of webpages. When a user requests a webpage on the live Web, such as http://tripadvisor.com/where_to_t ravel/, the webpage may not be found, which results in an HyperText Transfer Protocol (HTTP) 404 response. The user then may search for the webpage in a Web archive, such as the Internet Archive. Unfortunately, if this page had never been archived, the user will not be able to view the page, nor will the user gain any information on other webpages that have similar content in the archive, such as the archived webpage http://classy-travel.net. Similarly, if the user requests the webpage http://hokiesports.com/football/ from the Internet Archive, the user will only find the requested webpage, and the user will not gain any information on other webpages that have similar content in the archive, such as the archived webpage http://techsideline.com. In this research, we will build a model for selecting and ranking possible recommended webpages at a Web archive. This is to enhance both HTTP 404 responses and HTTP 200 responses by surfacing webpages in the archive that the user may not know existed. First, we detect semantics in the requested Uniform Resource Identifier (URI). Next, we classify the URI using an ontology, such as DMOZ or any website directory. Finally, we filter and rank candidates based on several features, such as archival quality, webpage popularity, temporal similarity, and content similarity. We measure the performance of each step using different techniques, including calculating the F1 to measure of different tokenization methods and the classification. We tested the model using human evaluation to determine if we could classify and find recommendations for a sample of requests from the Internet Archive’s Wayback Machine access log. Overall, when selecting the full categorization, reviewers agreed with 80.3% of the recommendations, which is much higher than “do not agree” and “I do not know”. This indicates the reviewer is more likely to agree on the recommendations when selecting the full categorization. But when selecting the first level only, reviewers only agreed with 25.5% of the recommendations. This indicates that having deep level categorization improves the performance of finding relevant recommendations

    Code Smells and Refactoring: A Tertiary Systematic Review of Challenges and Observations

    Full text link
    In this paper, we present a tertiary systematic literature review of previous surveys, secondary systematic literature reviews, and systematic mappings. We identify the main observations (what we know) and challenges (what we do not know) on code smells and refactoring. We show that code smells and refactoring have a strong relationship with quality attributes, i.e., with understandability, maintainability, testability, complexity, functionality, and reusability. We argue that code smells and refactoring could be considered as the two faces of a same coin. Besides, we identify how refactoring affects quality attributes, more than code smells. We also discuss the implications of this work for practitioners, researchers, and instructors. We identify 13 open issues that could guide future research work. Thus, we want to highlight the gap between code smells and refactoring in the current state of software-engineering research. We wish that this work could help the software-engineering research community in collaborating on future work on code smells and refactoring

    A COGNITIVE APPROACH TO PHONOLOGY: EVIDENCE FROM SIGNED LANGUAGES

    Get PDF
    This dissertation uses corpus data from ASL and Libras (Brazilian Sign Language), to investigate the distribution of a series of static and dynamic handshapes across the two languages. While traditional phonological frameworks argue handshape distribution to be a facet of well-formedness constraints and articulatory ease (Brentari, 1998), the data analyzed here suggests that the majority of handshapes cluster around schematic form-meaning mappings. Furthermore, these schematic mappings are shown to be motivated by both language-internal and language-external construals of formal articulatory properties and embodied experiential gestalts. Usage-based approaches to phonology (Bybee, 2001) and cognitively oriented constructional approaches (Langacker, 1987) have recognized that phonology is not modular. Instead, phonology is expected to interact with all levels of grammar, including semantic association. In this dissertation I begin to develop a cognitive model of phonology which views phonological content as similar in kind to other constructional units of language. I argue that, because formal units of linguistic structure emerge from the extraction of commonalities across usage events, phonological form is not immune from an accumulation of semantic associations. Finally, I demonstrate that appealing to such approaches allows one to account for both idiosyncratic, unconventionalized mappings seen in creative language use, as well as motivation in highly conventionalized form-meaning associations

    Multi-Word Terminology Extraction and Its Role in Document Embedding

    Get PDF
    Automated terminology extraction is a crucial task in natural language processing and ontology construction. Termhood can be inferred using linguistic and statistic techniques. This thesis focuses on the statistic methods. Inspired by feature selection techniques in documents classification, we experiment with a variety of metrics including PMI (point-wise mutual information), MI (mutual information), and Chi-squared. We find that PMI is in favour of identifying top keywords in a domain, but Chi-squared can recognize more keywords overall. Based on this observation, we propose a hybrid approach, called HMI, that combines the best of PMI and Chi-squared. HMI outperforms both PMI and Chi-squared. The result is verified by comparing overlapping between the extracted keywords and the author-identified keywords in arXiv data. When the corpora are computer science and physics papers, the top-100 hit rate can reach 0.96 for HMI. We also demonstrate that terminologies can improve documents embeddings. In this experiment, we treat machine-identified multi-word terminologies with one word. Then we use the transformed text as input for the document embedding. Compared with the representations learnt from unigrams only, we observe a performance improvement over 9.41% for F1 score in arXiv data on document classification tasks
    corecore