Search CORE

11,118 research outputs found

Information retrieval on turkish texts

Author: Balcik E.
Can F.
Kaynak C.
Kocberber S.
Ocalan H.C.
Vursavas O.M.
Publication venue: 'Wiley'
Publication date: 01/01/2008
Field of study

In this study, we investigate information retrieval (IR) on Turkish texts using a large-scale test collection that contains 408,305 documents and 72 ad hoc queries. We examine the effects of several stemming options and query-document matching functions on retrieval performance. We show that a simple word truncation approach, a word truncation approach that uses language-dependent corpus statistics, and an elaborate lemmatizer-based stemmer provide similar retrieval effectiveness in Turkish IR. We investigate the effects of a range of search conditions on the retrieval performance; these include scalability issues, query and document length effects, and the use of stop-word list in indexing. © 2007 Wiley Periodicals, Inc

Bilkent University Institutional Repository

First large-scale information retrieval experiments on Turkish texts

Author: Balcik E.
Can F.
Kaynak C.
Kocberber S.
Ocalan H.C.
Vursavas O.M.
Publication venue
Publication date: 01/01/2006
Field of study

We present the results of the first large-scale Turkish information retrieval experiments performed on a TREC-like test collection. The test bed, which has been created for this study, contains 95.5 million words, 408,305 documents, 72 ad hoc queries and has a size of about 800MB. All documents come from the Turkish newspaper Milliyet. We implement and apply simple to sophisticated stemmers and various query-document matching fonctions and show that truncating words at a prefix length of 5 creates an effective retrieval environment in Turkish. However, a lemmatizer-based stemmer provides significantly better effectiveness over a variety of matching functions

Bilkent University Institutional Repository

Bridge Correlational Neural Networks for Multilingual Multimodal Representation Learning

Author: Chandar Sarath
Khapra Mitesh M.
Rajendran Janarthanan
Ravindran Balaraman
Publication venue
Publication date: 01/01/2016
Field of study

Recently there has been a lot of interest in learning common representations for multiple views of data. Typically, such common representations are learned using a parallel corpus between the two views (say, 1M images and their English captions). In this work, we address a real-world scenario where no direct parallel data is available between two views of interest (say,

V_1

and

V_2

) but parallel data is available between each of these views and a pivot view (

V_3

). We propose a model for learning a common representation for

V_1

V_2

and

V_3

using only the parallel data available between

V_1V_3

and

V_2V_3

. The proposed model is generic and even works when there are

n

views of interest and only one pivot view which acts as a bridge between them. There are two specific downstream applications that we focus on (i) transfer learning between languages

L_1

L_2

,...,

L_n

using a pivot language

L

and (ii) cross modal access between images and a language

L_1

using a pivot language

L_2

. Our model achieves state-of-the-art performance in multilingual document classification on the publicly available multilingual TED corpus and promising results in multilingual multimodal retrieval on a new dataset created and released as a part of this work.Comment: Published at NAACL-HLT 201

arXiv.org e-Print Archive

Crossref

PolyPublie

Opinion Mining on Non-English Short Text

Author: A Kennedy
AG Vural
B Liu
B Pang
CM Özsert
D Fragoudis
M Thelwall
R Dehkharghani
Publication venue
Publication date: 03/04/2017
Field of study

As the type and the number of such venues increase, automated analysis of sentiment on textual resources has become an essential data mining task. In this paper, we investigate the problem of mining opinions on the collection of informal short texts. Both positive and negative sentiment strength of texts are detected. We focus on a non-English language that has few resources for text mining. This approach would help enhance the sentiment analysis in languages where a list of opinionated words does not exist. We propose a new method projects the text into dense and low dimensional feature vectors according to the sentiment strength of the words. We detect the mixture of positive and negative sentiments on a multi-variant scale. Empirical evaluation of the proposed framework on Turkish tweets shows that our approach gets good results for opinion mining

arXiv.org e-Print Archive

Crossref