46 research outputs found

    A Hybrid Framework for Text Analysis

    Get PDF
    2015 - 2016In Computational Linguistics there is an essential dichotomy between Linguists and Computer Scientists. The rst ones, with a strong knowledge of language structures, have not engineering skills. The second ones, contrariwise, expert in computer and mathematics skills, do not assign values to basic mechanisms and structures of language. Moreover, this discrepancy, especially in the last decades, has increased due to the growth of computational resources and to the gradual computerization of the world; the use of Machine Learning technologies in Arti cial Intelligence problems solving, which allows for example the machines to learn , starting from manually generated examples, has been more and more often used in Computational Linguistics in order to overcome the obstacle represented by language structures and its formal representation. The dichotomy has resulted in the birth of two main approaches to Computational Linguistics that respectively prefers: rule-based methods, that try to imitate the way in which man uses and understands the language, reproducing syntactic structures on which the understanding process is based on, building lexical resources as electronic dictionaries, taxonomies or ontologies; statistic-based methods that, conversely, treat language as a group of elements, quantifying words in a mathematical way and trying to extract information without identifying syntactic structures or, in some algorithms, trying to confer to the machine the ability to learn these structures. One of the main problems is the lack of communication between these two di erent approaches, due to substantial di erences characterizing them: on the one hand there is a strong focus on how language works and on language characteristics, there is a tendency to analytical and manual work. From other hand, engineering perspective nds in language an obstacle, and recognizes in the algorithms the fastest way to overcome this problem. However, the lack of communication is not only an incompatibility: following Harris, the best way to approach natural language, could result by taking the best of both. At the moment, there is a large number of open-source tools that perform text analysis and Natural Language Processing. A great part of these tools are based on statistical models and consist on separated modules which could be combined in order to create a pipeline for the processing of the text. Many of these resources consist in code packages which have not a GUI (Graphical User Interface) and they result impossible to use for users without programming skills. Furthermore, the vast majority of these open-source tools support only English language and, when Italian language is included, the performances of the tools decrease signi cantly. On the other hand, open source tools for Italian language are very few. In this work we want to ll this gap by present a new hybrid framework for the analysis of Italian texts. It must not be intended as a commercial tool, but the purpose for which it was built is to help linguists and other scholars to perform rapid text analysis and to produce linguistic data. The framework, that performs both statistical and rule-based analysis, is called LG-Starship. The idea is to built a modular software that includes, in the beginning, the basic algorithms to perform di erent kind of analysis. Modules will perform the following tasks: Preprocessing Module: a module with which it is possible to charge a text, normalize it or delete stop-words. As output, the module presents the list of tokens and letters which compose the texts with respective occurrences count and the processed text. Mr. Ling Module: a module with which POS tagging and Lemmatization are performed. The module also returns the table of lemmas with the count of occurrences and the table with the quanti cation of grammatical tags. Statistic Module: with which it is possible to calculate Term Frequency and TF-IDF of tokens or lemmas, extract bi-grams and tri-grams units and export results as tables. Semantic Module: which use The Hyperspace Analogue to Language algorithm to calculate semantic similarity between words. The module returns similarity matrices of words per word which can be exported and analyzed. SyntacticModule: which analyze syntax structures of a selected sentence and tag the verbs and its arguments with semantic labels. The objective of the Framework is to build an all-in-one platform for NLP which allows any kind of users to perform basic and advanced text analysis. With the purpose of make the Framework accessible to users who have not speci c computer science and programming language skills, the modules have been provided with an intuitive GUI. The framework can be considered hybrid in a double sense: as explained in the previous lines, it uses both statistical and rule/based methods, by relying on standard statistical algorithms or techniques, and, at the same time, on Lexicon-Grammar syntactic theory. In addition, it has been written in both Java and Python programming languages. LG-Starship Framework has a simple Graphic User Interface but will be also released as separated modules which may be included in any NLP pipelines independently. There are many resources of this kind, but the large majority works for English. There are very few free resources for Italian language and this work tries to cover this need by proposing a tool which can be used both by linguists or other scientist interested in language and text analysis who have no idea about programming languages, as by computer scientists, who can use free modules in their own code or in combination with di erent NLP algorithms. The Framework takes the start from a text or corpus written directly by the user or charged from an external resource. The LG-Starship Framework work ow is described in the owchart shown in g. 1. The pipeline shows that the Pre-Processing Module is applied on original imported or generated text in order to produce a clean and normalized preprocessed text. This module includes a function for text splitting, a stop-word list and a tokenization method. On the text preprocessed the Statistic Module or the Mr. Ling Module can be applied. The rst one, which includes basic statistics algorithm as Term Frequency, tf-idf and n-grams extraction, produces as output databases of lexical and numerical data which can be used to produce charts or perform more external analysis; the second one, is divided in two main task: a Pos tagger, based on the Averaged Perceptron Tagger [?] and trained on the PaisĂ  Corpus [Lyding et al., 2014], perform the Part-Of- Speech Tagging and produce an annotated text. A lemmatization method, which relies on a set of electronic dictionaries developed at the University of Salerno [Elia, 1995, Elia et al., 2010], take as input the Postagged text and produces a new lemmatized version of original text with information about syntactic and semantic properties. This lemmatized text, which can also be processed with the Statistic Module, serves as input for two deeper level of text analysis carried out by both the Syntactic Module and the Semantic Module. The rst one lays on the Lexicon Grammar Theory [Gross, 1971, 1975] and use a database of Predicate structures in development at the Department of Political, Social and Communication Science. Its objective is to produce a Dependency Graph of the sentences that compose the text. The Semantic Module uses the Hyperspace Analogue to Language distributional semantics algorithm [Lund and Burgess, 1996] trained on the PaisĂ  Corpus to produce a semantic network of the words of the text. These work ow has been included in two di erent experiments in which two User Generated Corpora have been involved. The rst experiment represent a statistical study of the language of Rap Music in Italy through the analysis of a great corpus of Rap Song lyrics downloaded from on line databases of user generated lyrics. The second experiment is a Feature-Based Sentiment Analysis project performed on user product reviews. For this project we integrated a large domain database of linguistic resources for Sentiment Analysis, developed in the past years by the Department of Political, Social and Communication Science of the University of Salerno, which consists of polarized dictionaries of Verbs, Adjectives, Adverbs and Nouns. These two experiment underline how the linguistic framework can be applied to di erent level of analysis and to produce both Qualitative data and Quantitative data. For what concern the obtained results, the Framework, which is only at a Beta Version, obtain discrete results both in terms of processing time that in terms of precision. Nevertheless, the work is far from being considered complete. More algorithms will be added to the Statistic Module and the Syntactic Module will be completed. The GUI will be improved and made more attractive and modern and, in addiction, an open-source on-line version of the modules will be published. [edited by author]XV n.s

    La dimensione Testuale del Videogioco. Classificazione dei transcript dei videogiochi basata sul lessico

    No full text
    In this work, we explore the textual dimension of video games. Despite their pronounced visual and interactive characteristics, video games can be regarded as documents due to their narrative and communicative elements. Our research delves into this textual dimension to automatically generate rating tags associated with offensive language, violence, and the presence of drugs. We utilized a dictionary of English slang, compiled from various online sources and manually annotated with four categories: Slang, Violence, Drugs, and Discrimination. The resulting electronic dictionary facilitated the automatic assignment of the three rating tags with high precision. It has also been employed to classify video games based on their lexical content. The two classification tasks – by rating tags and by lexical dimension – could pave the way for an automatic warning system capable of analyzing the full textual dimension of a video game

    Unsupervised Classification of Medical Documents Through Hybrid MWEs Discovery

    No full text
    The automatic processing of medical language represents a clue for computational linguists due to intrinsic feature of these sub-codes: its lexicon comprises a vast number of terms that appear infrequently in texts. In addition, the presence of many sub-domains that can coincide in a single text complicates the collection of the training set for a supervised classification task. This paper will tackle the problem of unsupervised classification of medical scientific papers based on a hybrid Multiword Expression Discovery. We apply a morpho-semantic approach to extract medical domain terms and their semantic tags in addition to the classic MWEs discovery strategies. The collected MWEs will be used to vectorize texts and generate a network of similarities among corpus documents. With this approach, we try to solve both problems caused by the medical domain features. The presence of a vast lexicon of low-frequency terms is dealt with by extracting many semantic tags with a small dictionary; the issues of co-occurring sub-domains are solved by generating clusters of similarity values instead of a rigid classification

    Classificazione automatica di testi medici basata sulla terminologia di dominio

    No full text
    La terminologia costituisce ormai un fruttuoso campo di studi e ricerche in cui operano variegate figure professionali specializzate nella comunicazione specialistica con competenze linguistiche avanzate. Dal punto di vista storico, la terminologia ha una lunga tradizione in quanto disciplina applicata, che durante il Novecento ha conosciuto una formalizzazione teorica tale da consentirle di essere pienamente riconosciuta in quanto disciplina autonoma anche in campo accademico. Il XXXI Convegno annuale dell’Associazione Italiana per la Terminologia “Ieri e oggi: la terminologia e le sfide delle Digital Humanities”, organizzato in collaborazione con il Dipartimento di Lingue e Letterature Straniere dell’Università degli Studi di Verona, e finanziato dal Progetto di Eccellenza (2018-2022) Le Digital Humanities applicate alle lingue e letterature straniere, ha inteso offrire un quadro di riflessione ampio sui diversi studi possibili in terminologia

    Domain embeddings for generating complex descriptions of concepts in Italian language

    No full text
    : In this work, we propose a Distributional Semantic resource enriched with linguistic and lexical information extracted from electronic dictionaries. This resource is designed to bridge the gap between the continuous semantic values represented by distributional vectors and the discrete descriptions provided by general semantics theory. Recently, many researchers have focused on the connection between embeddings and a comprehensive theory of semantics and meaning. This often involves translating the representation of word meanings in Distributional Models into a set of discrete, manually constructed properties, such as semantic primitives or features, using neural decoding techniques. Our approach introduces an alternative strategy based on linguistic data. We have developed a collection of domain-specific co-occurrence matrices derived from two sources: a list of Italian nouns classified into four semantic traits and 20 concrete noun sub-categories and Italian verbs classified by their semantic classes. In these matrices, the co-occurrence values for each word are calculated exclusively with a defined set of words relevant to a particular lexical domain. The resource includes 21 domain-specific matrices, one comprehensive matrix, and a Graphical User Interface. Our model facilitates the generation of reasoned semantic descriptions of concepts by selecting matrices directly associated with concrete conceptual knowledge, such as a matrix based on location nouns and the concept of animal habitats. We assessed the utility of the resource through two experiments, achieving promising outcomes in both the automatic classification of animal nouns and the extraction of animal features

    Domain Ontology, Probability and Virtual Reality

    No full text
    Ontologies are powerful instruments for high–level description of concepts, especially for Semantic Web applications or features classification. In some cases, Ontologies have been used to create high–level description of the Virtual Worlds objects in order to simplify the definition of a Virtual Reality. In this paper, we will describe an hybrid methodology that, starting from a domain Ontology, provides the basis for the creation of a Virtual Reality that could train workers to correct use of protection tools. In addition, starting from real data, we generate a Bayesian Network that calculates the probability of death or injuries in case of misconduct

    Extract Similarities from Syntactic Contexts: a Distributional Semantic Model Based on Syntactic Distance

    No full text
    Distributional Semantics (DS) models are based on the idea that two words which appear in similar contexts, i.e. similar neighborhoods, have similar meanings. This concept was originally presented by Harris in his Distributional Hypothesis (DH) (Harris 1954). Even though DH forms the basis of the majority of DS models, Harris states in later works that only syntactic analysis can allow for a more precise formulation of the neighborhoods involved: the arguments and the operators.In this work, we present a DS model based on the concept of Syntactic Distance inspired by a study of Harris’s theories concerning the syntactic-semantic interface. In our model, the context of each word is derived from its dependency network generated by a parser. With this strategy, the co-occurring terms of a target word are calculated on the basis of their syntactic relations, which are also preserved in the event of syntactical transformations. The model, named Syntactic Distance as Word Window (SD-W2), has been tested on three state-of-the-art tasks: Semantic Distance, Synonymy and Single Word Priming, and compared with other classical DS models. In addition, the model has been subjected to a new test based on Operator-Argument selection. Although the results obtained by SD-W2 do not always reach those of modern contextualized models, they are often above average and, in many cases, they are comparable with the result of GLOVE or BERT

    Extract Similarities from Syntactic Contexts: a Distributional Semantic Model Based on Syntactic Distance

    No full text
    Distributional Semantics (DS) models are based on the idea that two words which appear in similar contexts, i.e. similar neighborhoods, have similar meanings. This concept was originally presented by Harris in his Distributional Hypothesis (DH) (Harris 1954). Even though DH forms the basis of the majority of DS models, Harris states in later works that only syntactic analysis can allow for a more precise formulation of the neighborhoods involved: the arguments and the operators. In this work, we present a DS model based on the concept of Syntactic Distance inspired by a study of Harris’s theories concerning the syntactic-semantic interface. In our model, the context of each word is derived from its dependency network generated by a parser. With this strategy, the co-occurring terms of a target word are calculated on the basis of their syntactic relations, which are also preserved in the event of syntactical transformations. The model, named Syntactic Distance as Word Window (SD-W2), has been tested on three state-of-the-art tasks: Semantic Distance, Synonymy and Single Word Priming, and compared with other classical DS models. In addition, the model has been subjected to a new test based on Operator-Argument selection. Although the results obtained by SD-W2 do not always reach those of modern contextualized models, they are often above average and, in many cases, they are comparable with the result of GLOVE or BER
    corecore