49 research outputs found
Stacking classifiers for anti-spam filtering of e-mail
We evaluate empirically a scheme for combining classifiers, known as stacked
generalization, in the context of anti-spam filtering, a novel cost-sensitive
application of text categorization. Unsolicited commercial e-mail, or "spam",
floods mailboxes, causing frustration, wasting bandwidth, and exposing minors
to unsuitable content. Using a public corpus, we show that stacking can improve
the efficiency of automatically induced anti-spam filters, and that such
filters can be used in real-life applications
United we stand: improving sentiment analysis by joining machine learning and rule based methods
In the past, we have succesfully used machine learning approaches for sentiment analysis. In the course of those experiments, we observed that our machine learning method, although able to cope well with figurative language could not always reach a certain decision about the polarity orientation of sentences, yielding erroneous evaluations. We support the conjecture that these cases bearing mild figurativeness could be better handled by a rule-based system. These two systems, acting complementarily, could bridge the gap between machine learning and rule-based approaches. Experimental results using the corpus of the Affective Text Task of SemEval ’07, provide evidence in favor of this direction. 1
Source authoring for multilingual generation of personalised object descriptions
We present the source authoring facilities of a natural language generation system that produces personalised descriptions of objects in multiple natural languages starting from language-independent symbolic information in ontologies and databases as well as pieces of canned text. The system has been tested in applications ranging from museum exhibitions to presentations of computer equipment for sale. We discuss the architecture of the overall system, the resources that the authors manipulate, the functionality of the authoring facilities, the system's personalisation mechanisms, and how they relate to source authoring. A usability evaluation of the authoring facilities is also presented, followed by more recent work on reusing information extracted from existing databases and documents, and supporting the owl ontology specification language
Information retrieval and text mining technologies for chemistry
Efficient access to chemical information contained in scientific literature, patents, technical reports, or the web is a pressing need shared by researchers and patent attorneys from different chemical disciplines. Retrieval of important chemical information in most cases starts with finding relevant documents for a particular chemical compound or family. Targeted retrieval of chemical documents is closely connected to the automatic recognition of chemical entities in the text, which commonly involves the extraction of the entire list of chemicals mentioned in a document, including any associated information. In this Review, we provide a comprehensive and in-depth description of fundamental concepts, technical implementations, and current technologies for meeting these information demands. A strong focus is placed on community challenges addressing systems performance, more particularly CHEMDNER and CHEMDNER patents tasks of BioCreative IV and V, respectively. Considering the growing interest in the construction of automatically annotated chemical knowledge bases that integrate chemical information and biological data, cheminformatics approaches for mapping the extracted chemical names into chemical structures and their subsequent annotation together with text mining applications for linking chemistry with biological information are also presented. Finally, future trends and current challenges are highlighted as a roadmap proposal for research in this emerging field.A.V. and M.K. acknowledge funding from the European
Community’s Horizon 2020 Program (project reference:
654021 - OpenMinted). M.K. additionally acknowledges the
Encomienda MINETAD-CNIO as part of the Plan for the
Advancement of Language Technology. O.R. and J.O. thank
the Foundation for Applied Medical Research (FIMA),
University of Navarra (Pamplona, Spain). This work was
partially funded by Consellería
de Cultura, Educación e Ordenación Universitaria (Xunta de Galicia), and FEDER (European Union), and the Portuguese Foundation for Science and Technology (FCT) under the scope of the strategic
funding of UID/BIO/04469/2013 unit and COMPETE 2020
(POCI-01-0145-FEDER-006684). We thank Iñigo Garciá -Yoldi
for useful feedback and discussions during the preparation of
the manuscript.info:eu-repo/semantics/publishedVersio