Search CORE

16 research outputs found

Similitud entre documentos multilingües de carácter científico-técnico en un entorno Web

Author: Alegría Loinaz Iñaki
Saralegi Urizar Xabier
Publication venue: Sociedad Española para el Procesamiento del Lenguaje Natural
Publication date: 01/01/2007
Field of study

En este artículo se presenta un sistema para la agrupación multilingüe de documentos que tratan temas similares. Para la representación de los documentos se ha empleado el modelo de espacio vectorial, utilizando criterios lingüísticos para la selección de las palabras clave, la fórmula tf-idf para el cálculo de sus relevancias, y RSS feedback y wrappers para actualizar el repositorio. Respecto al tratamiento multilingüe se ha seguido una estrategia basada en diccionarios bilingües con desambiguación. Debido al carácter científico-técnico de los textos se han empleado diccionarios técnicos combinados con diccionarios de carácter general. Los resultados obtenidos han sido evaluados manualmente.In this paper we present a system to identify documents of similar content. To represent the documents we’ve used the vector space model using linguistic knowledge to choose keywords and tf-idf to calculate the relevancy. The documents repository is updated by RSS and HTML wrappers. As for the multilingual treatment we have used a strategy based in bilingual dictionaries. Due to the scientific-technical nature of the texts, the translation of the vector has been carried off by technical dictionaries combined with general dictionaries. The obtained results have been evaluated in order to estimate the precision of the system.Este trabajo está subvencionado por el Departamento de Industria del Gobierno Vasco (proyectos Dokusare SA-2005/00272, Dokusare SA-2006/00167)

Repositorio Institucional de la Universidad de Alicante

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Hierarchical multi-label news article classification with distributed semantic model based features

Author: Ivana Clairine Irsan
Masayu Leylia Khodra
Publication venue: 'Universitas Ahmad Dahlan, Kampus 3'
Publication date: 01/03/2019
Field of study

Automatic news categorization is essential to automatically handle the classification of multi-label news articles in online portal. This research employs some potential methods to improve performance of hierarchical multi-label classifier for Indonesian news article. First potential method is using Convolutional Neural Network (CNN) to build the top level classifier. The second method could improve the classification performance by calculating the average of the word vectors obtained from distributed semantic model. The third method combines lexical and semantic method to extract documents features, which multiplied word term frequency (lexical) with word vector average (semantic). Model build using Calibrated Label Ranking as multi-label classification method, and trained using Naïve Bayes algorithm has the best F1-measure of 0.7531. Multiplication of word term frequency and the average of word vectors were also used to build this classifiers. This configuration improved multi-label classification performance by 4.25%, compared to the baseline. The distributed semantic model that gave best performance in this experiment obtained from 300-dimension word2vec of Wikipedia’s articles. The multi-label classification model performance is also influenced by news’ released date. The difference period between training and testing data would also decrease models’ performance

International Journal of Advances in Intelligent Informatics

Directory of Open Access Journals

International Journal of Advances in Intelligent Informatics (IJAIN)

Novelty and redundancy detection in adaptive filtering

Author: Jamie Callan
Thomas Minka
Yi Zhang
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2003
Field of study

Crossref

A model for Anticipatory Event Detection

Author: CHANG Kuiyu
HE Qi
LIM Ee Peng
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2006
Field of study

Crossref

Institutional Knowledge at Singapore Management University

Analyzing feature trajectories for event detection

Author: CHANG Kuiyu
HE Qi
LIM Ee Peng
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2007
Field of study

Crossref

Institutional Knowledge at Singapore Management University

Automatic online news topic ranking using media focus and user attention based on aging theory

Author
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2008
Field of study

Crossref

Anticipatory event detection via classification

Author: C Cortes
Ee-Peng Lim
J Allan
Kuiyu Chang
Qi He
TJ Strader
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2007
Field of study

Crossref

Institutional Knowledge at Singapore Management University

Modeling Anticipatory Event Transitions

Author: CHANG Kuiyu
LIM Ee Peng
QI He
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2010
Field of study

Institutional Knowledge at Singapore Management University

Keep it simple with time: A reexamination of probabilistic topic detection models

Author: Banerjee Arindam
CHANG Kuiyu
HE Qi
LIM Ee Peng
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2010
Field of study

Institutional Knowledge at Singapore Management University

Combining Semantic and Syntactic Document Classifiers to Improve First Story Detection

Author: Joe Carthy
Nicola Stokes
Publication venue: ACM Press
Publication date: 01/01/2001
Field of study

In this paper we describe a type of data fusion involving the combination of evidence derived from multiple document representations. Our aim is to investigate if a composite representation can improve the online detection of novel events in a stream of broadcast news stories. This classification process otherwise known as first story detection FSD (or in the Topic Detection and Tracking pilot study as online new event detection [1]), is one of three main classification tasks defined by the TDT initiative. Our composite document representation consists of a semantic representation (based on the lexical chains derived from a text) and a syntactic representation (using proper nouns). Using the TDT1 evaluation methodology, we evaluate a number of document representation combinations using these document classifiers

CiteSeerX