Search CORE

15,461 research outputs found

KACST Arabic Text Classification Project: Overview and Preliminary Results

Author: Al-Rajeh A.
Alharbi S.
Almuhareb A.
Althubaity A.
Khorsheed M.
Publication venue
Publication date: 01/01/2008
Field of study

Electronically formatted Arabic free-texts can be found in abundance these days on the World Wide Web, often linked to commercial enterprises and/or government organizations. Vast tracts of knowledge and relations lie hidden within these texts, knowledge that can be exploited once the correct intelligent tools have been identified and applied. For example, text mining may help with text classification and categorization. Text classification aims to automatically assign text to a predefined category based on identifiable linguistic features. Such a process has different useful applications including, but not restricted to, E-Mail spam detection, web pages content filtering, and automatic message routing. In this paper an overview of King Abdulaziz City for Science and Technology (KACST) Arabic Text Classification Project will be illustrated along with some preliminary results. This project will contribute to the better understanding and elaboration of Arabic text classification techniques

Southampton (e-Prints Soton)

Bridge Correlational Neural Networks for Multilingual Multimodal Representation Learning

Author: Chandar Sarath
Khapra Mitesh M.
Rajendran Janarthanan
Ravindran Balaraman
Publication venue
Publication date: 01/01/2016
Field of study

Recently there has been a lot of interest in learning common representations for multiple views of data. Typically, such common representations are learned using a parallel corpus between the two views (say, 1M images and their English captions). In this work, we address a real-world scenario where no direct parallel data is available between two views of interest (say,

V_1

and

V_2

) but parallel data is available between each of these views and a pivot view (

V_3

). We propose a model for learning a common representation for

V_1

V_2

and

V_3

using only the parallel data available between

V_1V_3

and

V_2V_3

. The proposed model is generic and even works when there are

n

views of interest and only one pivot view which acts as a bridge between them. There are two specific downstream applications that we focus on (i) transfer learning between languages

L_1

L_2

,...,

L_n

using a pivot language

L

and (ii) cross modal access between images and a language

L_1

using a pivot language

L_2

. Our model achieves state-of-the-art performance in multilingual document classification on the publicly available multilingual TED corpus and promising results in multilingual multimodal retrieval on a new dataset created and released as a part of this work.Comment: Published at NAACL-HLT 201

arXiv.org e-Print Archive

Crossref

PolyPublie

EveTAR: Building a Large-Scale Multi-Task Test Collection over Arabic Tweets

Author: A Bruns
AM Azmi
BS Wasike
D Bodoff
D Elsweiler
Hind Almerekhi
J Benhardus
JL Fleiss
JR Landis
K Darwish
M Efron
M Rowe
M Sanderson
Maram Hasanain
Mucahid Kutlu
Reem Suwaileh
RL Brennan
Tamer Elsayed
W Magdy
Zhang Y
Publication venue
Publication date: 21/08/2017
Field of study

This article introduces a new language-independent approach for creating a large-scale high-quality test collection of tweets that supports multiple information retrieval (IR) tasks without running a shared-task campaign. The adopted approach (demonstrated over Arabic tweets) designs the collection around significant (i.e., popular) events, which enables the development of topics that represent frequent information needs of Twitter users for which rich content exists. That inherently facilitates the support of multiple tasks that generally revolve around events, namely event detection, ad-hoc search, timeline generation, and real-time summarization. The key highlights of the approach include diversifying the judgment pool via interactive search and multiple manually-crafted queries per topic, collecting high-quality annotations via crowd-workers for relevancy and in-house annotators for novelty, filtering out low-agreement topics and inaccessible tweets, and providing multiple subsets of the collection for better availability. Applying our methodology on Arabic tweets resulted in EveTAR , the first freely-available tweet test collection for multiple IR tasks. EveTAR includes a crawl of 355M Arabic tweets and covers 50 significant events for which about 62K tweets were judged with substantial average inter-annotator agreement (Kappa value of 0.71). We demonstrate the usability of EveTAR by evaluating existing algorithms in the respective tasks. Results indicate that the new collection can support reliable ranking of IR systems that is comparable to similar TREC collections, while providing strong baseline results for future studies over Arabic tweets

arXiv.org e-Print Archive

Qatar University Institutional Repository

Crossref

The CoNLL 2007 shared task on dependency parsing

Author: Hall Johan
Kübler Sandra
McDonald Ryan
Nilsson Jens
Nivre Joakim
Riedel Sebastian
Yuret Deniz
Publication venue
Publication date: 01/01/2007
Field of study

The Conference on Computational Natural Language Learning features a shared task, in which participants train and test their learning systems on the same data sets. In 2007, as in 2006, the shared task has been devoted to dependency parsing, this year with both a multilingual track and a domain adaptation track. In this paper, we define the tasks of the different tracks and describe how the data sets were created from existing treebanks for ten languages. In addition, we characterize the different approaches of the participating systems, report the test results, and provide a first analysis of these results

UCL Discovery

Hochschulschriftenserver - Universität Frankfurt am Main

TA-COS 2018 : 2nd Workshop on Text Analytics for Cybersecurity and Online Safety : Proceedings

Author: De Pauw Guy
Desmet Bart
Lefever Els
Publication venue: European Language Resources Association (ELRA)
Publication date: 01/01/2018
Field of study

Ghent University Academic Bibliography

FROM MARTO TO MARFELINO, A SHIFT IN NAMING IN GOTPUTUK VILLAGE

Author: Nurhayat Nurhayat
Publication venue
Publication date: 05/07/2012
Field of study

This is a study of names in a village called Gotputuk. Naming is one of language manifestation. Therefore, studying the way naming is maintained or shifted can reflect the language maintenance and shift. Using 1,648 names as data, the study exposes that Javanese names are still maintained but they are influenced by Arabic names and urban names

Diponegoro University Institutional Repository

One script for two languages. Latin & Arabic in an early allographic papyrus

Author: D'Ottone Arianna
Internullo Dario
Publication venue: Fabrizio Serra
Publication date: 01/01/2018
Field of study

This contribution presents a unique papyrus letter in Latin script and Latin language and in Latin script and Arabic language that is possible to date, on palaeographic grounds, from the end of the 7th to the 9th century AD. This precious witness is exam- ined under the historical, graphical, linguistic and cultural point of view and its prove- nance is discussed accordingly. An edition of the whole text is provided and a number of correspondences in Arabic are suggested

Università degli Studi di Napoli Federico Il Open Archive

Archivio della ricerca- Università di Roma La Sapienza