46 research outputs found
A Hybrid Framework for Text Analysis
2015 - 2016In Computational Linguistics there is an essential dichotomy between Linguists
and Computer Scientists. The rst ones, with a strong knowledge of
language structures, have not engineering skills. The second ones, contrariwise,
expert in computer and mathematics skills, do not assign values to basic
mechanisms and structures of language. Moreover, this discrepancy, especially
in the last decades, has increased due to the growth of computational
resources and to the gradual computerization of the world; the use of Machine
Learning technologies in Arti cial Intelligence problems solving, which
allows for example the machines to learn , starting from manually generated
examples, has been more and more often used in Computational Linguistics
in order to overcome the obstacle represented by language structures and its
formal representation.
The dichotomy has resulted in the birth of two main approaches to Computational
Linguistics that respectively prefers:
rule-based methods, that try to imitate the way in which man uses and
understands the language, reproducing syntactic structures on which
the understanding process is based on, building lexical resources as electronic
dictionaries, taxonomies or ontologies;
statistic-based methods that, conversely, treat language as a group of
elements, quantifying words in a mathematical way and trying to extract
information without identifying syntactic structures or, in some
algorithms, trying to confer to the machine the ability to learn these
structures.
One of the main problems is the lack of communication between these two
di erent approaches, due to substantial di erences characterizing them: on
the one hand there is a strong focus on how language works and on language
characteristics, there is a tendency to analytical and manual work. From other
hand, engineering perspective nds in language an obstacle, and recognizes in
the algorithms the fastest way to overcome this problem.
However, the lack of communication is not only an incompatibility: following
Harris, the best way to approach natural language, could result by taking the
best of both.
At the moment, there is a large number of open-source tools that perform
text analysis and Natural Language Processing. A great part of these tools are
based on statistical models and consist on separated modules which could be
combined in order to create a pipeline for the processing of the text. Many of these resources consist in code packages which have not a GUI (Graphical User
Interface) and they result impossible to use for users without programming
skills. Furthermore, the vast majority of these open-source tools support only
English language and, when Italian language is included, the performances
of the tools decrease signi cantly. On the other hand, open source tools for
Italian language are very few.
In this work we want to ll this gap by present a new hybrid framework
for the analysis of Italian texts. It must not be intended as a commercial tool,
but the purpose for which it was built is to help linguists and other scholars to
perform rapid text analysis and to produce linguistic data. The framework,
that performs both statistical and rule-based analysis, is called LG-Starship.
The idea is to built a modular software that includes, in the beginning, the
basic algorithms to perform di erent kind of analysis. Modules will perform
the following tasks:
Preprocessing Module: a module with which it is possible to charge a
text, normalize it or delete stop-words. As output, the module presents
the list of tokens and letters which compose the texts with respective
occurrences count and the processed text.
Mr. Ling Module: a module with which POS tagging and Lemmatization
are performed. The module also returns the table of lemmas
with the count of occurrences and the table with the quanti cation of
grammatical tags.
Statistic Module: with which it is possible to calculate Term Frequency
and TF-IDF of tokens or lemmas, extract bi-grams and tri-grams units
and export results as tables.
Semantic Module: which use The Hyperspace Analogue to Language
algorithm to calculate semantic similarity between words. The module
returns similarity matrices of words per word which can be exported
and analyzed.
SyntacticModule: which analyze syntax structures of a selected sentence
and tag the verbs and its arguments with semantic labels.
The objective of the Framework is to build an all-in-one platform for NLP
which allows any kind of users to perform basic and advanced text analysis.
With the purpose of make the Framework accessible to users who have not
speci c computer science and programming language skills, the modules have
been provided with an intuitive GUI. The framework can be considered hybrid in a double sense: as explained
in the previous lines, it uses both statistical and rule/based methods, by relying
on standard statistical algorithms or techniques, and, at the same time,
on Lexicon-Grammar syntactic theory. In addition, it has been written in
both Java and Python programming languages. LG-Starship Framework has
a simple Graphic User Interface but will be also released as separated modules
which may be included in any NLP pipelines independently.
There are many resources of this kind, but the large majority works for English.
There are very few free resources for Italian language and this work tries
to cover this need by proposing a tool which can be used both by linguists
or other scientist interested in language and text analysis who have no idea
about programming languages, as by computer scientists, who can use free
modules in their own code or in combination with di erent NLP algorithms.
The Framework takes the start from a text or corpus written directly by
the user or charged from an external resource. The LG-Starship Framework
work ow is described in the owchart shown in g. 1. The pipeline shows that the Pre-Processing Module is applied on original
imported or generated text in order to produce a clean and normalized preprocessed
text. This module includes a function for text splitting, a stop-word
list and a tokenization method. On the text preprocessed the Statistic Module
or the Mr. Ling Module can be applied. The rst one, which includes basic statistics algorithm as Term Frequency, tf-idf and n-grams extraction, produces
as output databases of lexical and numerical data which can be used to
produce charts or perform more external analysis; the second one, is divided
in two main task: a Pos tagger, based on the Averaged Perceptron Tagger [?]
and trained on the PaisĂ Corpus [Lyding et al., 2014], perform the Part-Of-
Speech Tagging and produce an annotated text. A lemmatization method,
which relies on a set of electronic dictionaries developed at the University of
Salerno [Elia, 1995, Elia et al., 2010], take as input the Postagged text and
produces a new lemmatized version of original text with information about
syntactic and semantic properties.
This lemmatized text, which can also be processed with the Statistic Module,
serves as input for two deeper level of text analysis carried out by both
the Syntactic Module and the Semantic Module.
The rst one lays on the Lexicon Grammar Theory [Gross, 1971, 1975] and
use a database of Predicate structures in development at the Department of
Political, Social and Communication Science. Its objective is to produce a
Dependency Graph of the sentences that compose the text.
The Semantic Module uses the Hyperspace Analogue to Language distributional
semantics algorithm [Lund and Burgess, 1996] trained on the PaisĂ
Corpus to produce a semantic network of the words of the text.
These work ow has been included in two di erent experiments in which
two User Generated Corpora have been involved.
The rst experiment represent a statistical study of the language of Rap
Music in Italy through the analysis of a great corpus of Rap Song lyrics downloaded
from on line databases of user generated lyrics.
The second experiment is a Feature-Based Sentiment Analysis project performed
on user product reviews. For this project we integrated a large domain
database of linguistic resources for Sentiment Analysis, developed in the past
years by the Department of Political, Social and Communication Science of
the University of Salerno, which consists of polarized dictionaries of Verbs,
Adjectives, Adverbs and Nouns.
These two experiment underline how the linguistic framework can be applied
to di erent level of analysis and to produce both Qualitative data and Quantitative
data.
For what concern the obtained results, the Framework, which is only at
a Beta Version, obtain discrete results both in terms of processing time that
in terms of precision. Nevertheless, the work is far from being considered
complete. More algorithms will be added to the Statistic Module and the
Syntactic Module will be completed. The GUI will be improved and made more attractive and modern and, in addiction, an open-source on-line version
of the modules will be published. [edited by author]XV n.s
La dimensione Testuale del Videogioco. Classificazione dei transcript dei videogiochi basata sul lessico
In this work, we explore the textual dimension of video games. Despite their pronounced visual and interactive characteristics, video games can be regarded as documents due to their narrative and communicative elements. Our research delves into this textual dimension to automatically generate rating tags associated with offensive language, violence, and the presence of drugs. We utilized a dictionary of English slang, compiled from various online sources and manually annotated with four categories: Slang, Violence, Drugs, and Discrimination. The resulting electronic dictionary facilitated the automatic assignment of the three rating tags with high precision. It has also been employed to classify video games based on their lexical content. The two classification tasks – by rating tags and by lexical dimension – could pave the way for an automatic warning system capable of analyzing the full textual dimension of a video game
Unsupervised Classification of Medical Documents Through Hybrid MWEs Discovery
The automatic processing of medical language represents a clue for computational linguists due to intrinsic feature of these sub-codes: its lexicon comprises a vast number of terms that appear infrequently in texts. In addition, the presence of many sub-domains that can coincide in a single text complicates the collection of the training set for a supervised classification task. This paper will tackle the problem of unsupervised classification of medical scientific papers based on a hybrid Multiword Expression Discovery. We apply a morpho-semantic approach to extract medical domain terms and their semantic tags in addition to the classic MWEs discovery strategies. The collected MWEs will be used to vectorize texts and generate a network of similarities among corpus documents. With this approach, we try to solve both problems caused by the medical domain features. The presence of a vast lexicon of low-frequency terms is dealt with by extracting many semantic tags with a small dictionary; the issues of co-occurring sub-domains are solved by generating clusters of similarity values instead of a rigid classification
Classificazione automatica di testi medici basata sulla terminologia di dominio
La terminologia costituisce ormai un fruttuoso campo di studi e ricerche in cui operano variegate figure professionali specializzate nella comunicazione specialistica con competenze linguistiche avanzate. Dal punto di vista storico, la terminologia ha una lunga tradizione in quanto disciplina applicata, che durante il Novecento ha conosciuto una formalizzazione teorica tale da consentirle di essere pienamente riconosciuta in quanto disciplina autonoma anche in campo accademico. Il XXXI Convegno annuale dell’Associazione Italiana per la Terminologia “Ieri e oggi: la terminologia e le sfide delle Digital Humanities”, organizzato in collaborazione con il Dipartimento di Lingue e Letterature Straniere dell’Università degli Studi di Verona, e finanziato dal Progetto di Eccellenza (2018-2022) Le Digital Humanities applicate alle lingue e letterature straniere, ha inteso offrire un quadro di riflessione ampio sui diversi studi possibili in terminologia
Domain embeddings for generating complex descriptions of concepts in Italian language
: In this work, we propose a Distributional Semantic resource enriched with linguistic and lexical information extracted from electronic dictionaries. This resource is designed to bridge the gap between the continuous semantic values represented by distributional vectors and the discrete descriptions provided by general semantics theory. Recently, many researchers have focused on the connection between embeddings and a comprehensive theory of semantics and meaning. This often involves translating the representation of word meanings in Distributional Models into a set of discrete, manually constructed properties, such as semantic primitives or features, using neural decoding techniques. Our approach introduces an alternative strategy based on linguistic data. We have developed a collection of domain-specific co-occurrence matrices derived from two sources: a list of Italian nouns classified into four semantic traits and 20 concrete noun sub-categories and Italian verbs classified by their semantic classes. In these matrices, the co-occurrence values for each word are calculated exclusively with a defined set of words relevant to a particular lexical domain. The resource includes 21 domain-specific matrices, one comprehensive matrix, and a Graphical User Interface. Our model facilitates the generation of reasoned semantic descriptions of concepts by selecting matrices directly associated with concrete conceptual knowledge, such as a matrix based on location nouns and the concept of animal habitats. We assessed the utility of the resource through two experiments, achieving promising outcomes in both the automatic classification of animal nouns and the extraction of animal features
Domain Ontology, Probability and Virtual Reality
Ontologies are powerful instruments for high–level description of concepts, especially for Semantic Web applications or features classification. In some cases, Ontologies have been used to create high–level description of the Virtual Worlds objects in order to simplify the definition of a Virtual Reality. In this paper, we will describe an hybrid methodology that, starting from a domain Ontology, provides the basis for the creation of a Virtual Reality that could train workers to correct use of protection tools. In addition, starting from real data, we generate a Bayesian Network that calculates the probability of death or injuries in case of misconduct
Extract Similarities from Syntactic Contexts: a Distributional Semantic Model Based on Syntactic Distance
Distributional Semantics (DS) models are based on the idea that two words which appear in similar contexts, i.e. similar neighborhoods, have similar meanings. This concept was originally presented by Harris in his Distributional Hypothesis (DH) (Harris 1954). Even though DH forms the basis of the majority of DS models, Harris states in later works that only syntactic analysis can allow for a more precise formulation of the neighborhoods involved: the arguments and the operators.In this work, we present a DS model based on the concept of Syntactic Distance inspired by a study of Harris’s theories concerning the syntactic-semantic interface. In our model, the context of each word is derived from its dependency network generated by a parser. With this strategy, the co-occurring terms of a target word are calculated on the basis of their syntactic relations, which are also preserved in the event of syntactical transformations. The model, named Syntactic Distance as Word Window (SD-W2), has been tested on three state-of-the-art tasks: Semantic Distance, Synonymy and Single Word Priming, and compared with other classical DS models. In addition, the model has been subjected to a new test based on Operator-Argument selection. Although the results obtained by SD-W2 do not always reach those of modern contextualized models, they are often above average and, in many cases, they are comparable with the result of GLOVE or BERT
Extract Similarities from Syntactic Contexts: a Distributional Semantic Model Based on Syntactic Distance
Distributional Semantics (DS) models are based on the idea that two words which appear in
similar contexts, i.e. similar neighborhoods, have similar meanings. This concept was originally
presented by Harris in his Distributional Hypothesis (DH) (Harris 1954). Even though DH
forms the basis of the majority of DS models, Harris states in later works that only syntactic
analysis can allow for a more precise formulation of the neighborhoods involved: the arguments
and the operators.
In this work, we present a DS model based on the concept of Syntactic Distance inspired by a
study of Harris’s theories concerning the syntactic-semantic interface. In our model, the context
of each word is derived from its dependency network generated by a parser. With this strategy,
the co-occurring terms of a target word are calculated on the basis of their syntactic relations,
which are also preserved in the event of syntactical transformations. The model, named Syntactic
Distance as Word Window (SD-W2), has been tested on three state-of-the-art tasks: Semantic
Distance, Synonymy and Single Word Priming, and compared with other classical DS models.
In addition, the model has been subjected to a new test based on Operator-Argument selection.
Although the results obtained by SD-W2 do not always reach those of modern contextualized
models, they are often above average and, in many cases, they are comparable with the result of
GLOVE or BER