Search CORE

6,712 research outputs found

Document Clustering with K-tree

Author: De Vries Christopher M.
Geva Shlomo
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2009
Field of study

This paper describes the approach taken to the XML Mining track at INEX 2008 by a group at the Queensland University of Technology. We introduce the K-tree clustering algorithm in an Information Retrieval context by adapting it for document clustering. Many large scale problems exist in document clustering. K-tree scales well with large inputs due to its low complexity. It offers promising results both in terms of efficiency and quality. Document classification was completed using Support Vector Machines.Comment: 12 pages, INEX 200

arXiv.org e-Print Archive

Queensland University of Technology ePrints Archive

Extracting common sense knowledge from Wikipedia

Author: Halpin Harry
Klein Ewan
Suh Sangweon
Publication venue
Publication date: 01/01/2006
Field of study

Edinburgh Research Explorer

Use of Wikipedia Categories in Entity Ranking

Author: Pehcevski Jovan
Thom James A.
Vercoustre Anne-Marie
Publication venue
Publication date: 01/01/2007
Field of study

Wikipedia is a useful source of knowledge that has many applications in language processing and knowledge representation. The Wikipedia category graph can be compared with the class hierarchy in an ontology; it has some characteristics in common as well as some differences. In this paper, we present our approach for answering entity ranking queries from the Wikipedia. In particular, we explore how to make use of Wikipedia categories to improve entity ranking effectiveness. Our experiments show that using categories of example entities works significantly better than using loosely defined target categories

arXiv.org e-Print Archive

CiteSeerX

INRIA a CCSD electronic archive server

HAL Descartes

Hal-Diderot

Establishing a New State-of-the-Art for French Named Entity Recognition

Author: Dupont Yoann
Muller Benjamin
Romary Laurent
Sagot Benoît
Suárez Pedro Javier Ortiz
Publication venue
Publication date: 11/05/2020
Field of study

The French TreeBank developed at the University Paris 7 is the main source of morphosyntactic and syntactic annotations for French. However, it does not include explicit information related to named entities, which are among the most useful information for several natural language processing tasks and applications. Moreover, no large-scale French corpus with named entity annotations contain referential information, which complement the type and the span of each mention with an indication of the entity it refers to. We have manually annotated the French TreeBank with such information, after an automatic pre-annotation step. We sketch the underlying annotation guidelines and we provide a few figures about the resulting annotations

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

HAL Descartes

HAL-Rennes 1

Uncertainty Detection as Approximate Max-Margin Sequence Labelling

Author: Dalianis Hercules
Eriksson Gunnar
Hassel Martin
Karlgren Jussi
Täckström Oscar
Velupillai Sumithra
Publication venue
Publication date: 01/01/2010
Field of study

This paper reports experiments for the CoNLL 2010 shared task on learning to detect hedges and their scope in natural language text. We have addressed the experimental tasks as supervised linear maximum margin prediction problems. For sentence level hedge detection in the biological domain we use an L1-regularised binary support vector machine, while for sentence level weasel detection in the Wikipedia domain, we use an L2-regularised approach. We model the in-sentence uncertainty cue and scope detection task as an L2-regularised approximate maximum margin sequence labelling problem, using the BIO-encoding. In addition to surface level features, we use a variety of linguistic features based on a functional dependency analysis. A greedy forward selection strategy is used in exploring the large set of potential features. Our official results for Task 1 for the biological domain are 85.2 F1-score, for the Wikipedia set 55.4 F1-score. For Task 2, our official results are 2.1 for the entire task with a score of 62.5 for cue detection. After resolving errors and final bugs, our final results are for Task 1, biological: 86.0, Wikipedia: 58.2; Task 2, scopes: 39.6 and cues: 78.5

CiteSeerX

Publikationer från Stockholms universitet

Publikationer från Uppsala Universitet

RISE – Research Institutes of Sweden

Digitala Vetenskapliga Arkivet - Academic Archive On-line

Swedish Institute of Computer Science Publications Database

Software institutes' Online Digital Archive