Search CORE

7 research outputs found

KACST Arabic Text Classification Project: Overview and Preliminary Results

Author: Al-Rajeh A.
Alharbi S.
Almuhareb A.
Althubaity A.
Khorsheed M.
Publication venue
Publication date: 01/01/2008
Field of study

Electronically formatted Arabic free-texts can be found in abundance these days on the World Wide Web, often linked to commercial enterprises and/or government organizations. Vast tracts of knowledge and relations lie hidden within these texts, knowledge that can be exploited once the correct intelligent tools have been identified and applied. For example, text mining may help with text classification and categorization. Text classification aims to automatically assign text to a predefined category based on identifiable linguistic features. Such a process has different useful applications including, but not restricted to, E-Mail spam detection, web pages content filtering, and automatic message routing. In this paper an overview of King Abdulaziz City for Science and Technology (KACST) Arabic Text Classification Project will be illustrated along with some preliminary results. This project will contribute to the better understanding and elaboration of Arabic text classification techniques

Southampton (e-Prints Soton)

Asynchronous Training of Word Embeddings for Large Text Corpora

Author: Almuhareb A.
Boucher T.
Garten J.
Ghannay S.
Goikoetxea J.
Jurgens D. A.
Levy O.
Li Y.
Luong M.-T.
Mikolov T.
Recht B.
Socher R.
Socher R.
Stergiou S.
Vuurens J. B. P.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 07/12/2018
Field of study

Word embeddings are a powerful approach for analyzing language and have been widely popular in numerous tasks in information retrieval and text mining. Training embeddings over huge corpora is computationally expensive because the input is typically sequentially processed and parameters are synchronously updated. Distributed architectures for asynchronous training that have been proposed either focus on scaling vocabulary sizes and dimensionality or suffer from expensive synchronization latencies. In this paper, we propose a scalable approach to train word embeddings by partitioning the input space instead in order to scale to massive text corpora while not sacrificing the performance of the embeddings. Our training procedure does not involve any parameter synchronization except a final sub-model merge phase that typically executes in a few minutes. Our distributed training scales seamlessly to large corpus sizes and we get comparable and sometimes even up to 45% performance improvement in a variety of NLP benchmarks using models trained by our distributed procedure which requires

1/10

of the time taken by the baseline approach. Finally we also show that we are robust to missing words in sub-models and are able to effectively reconstruct word representations.Comment: This paper contains 9 pages and has been accepted in the WSDM201

arXiv.org e-Print Archive

Crossref

Automatic Arabic Text Classification

Author: Al-Harbi S
Al-Rajeh A
Al-Thubaity A
Almuhareb A
Khorsheed M. S.
Publication venue
Publication date: 01/03/2008
Field of study

Automated document classification is an important text mining task especially with the rapid growth of the number of online documents present in Arabic language. Text classification aims to automatically assign the text to a predefined category based on linguistic features. Such a process has different useful applications including, but not restricted to, e-mail spam detection, web page content filtering, and automatic message routing. This paper presents the results of experiments on document classification achieved on seven different Arabic corpora using statistical methodology. The performance of two popular classification algorithms in classifying the aforementioned corpora has been evaluated

Southampton (e-Prints Soton)

Multilingual Lexicalisation and Population of Event Ontologies: A Case Study for Social Media

Author: A. Almuhareb
H. Tanev
J. Piskorski
J. Völker
T. Gruber
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Automatic Extraction of Property Norm‐Like Data From Large Text Corpora

Author: Almuhareb A.
Almuhareb A.
Barbu E.
Baroni M.
Baroni M.
Briscoe T.
Bruni E.
Church K.
Clark S.
Cohen J.
Collins M.
Davidov D.
Dumais S. T.
Dunning T.
Fisher R.
Garrard P.
Jurafsky D.
Kelly C.
Kremer G.
Leacock C.
Lin D.
Manning C.
McRae K.
Mintz M.
Pantel P.
Pantel P.
Poon H.
Rindflesch T.
Steyvers M.
Turney P.
Publication venue: 'Wiley'
Publication date
Field of study

Crossref

Cross-domain sentiment aware word embeddings for review sentiment analysis

Author: A Almuhareb
A Hassan
A Le
D Bollegala
D Deng
D Tang
E Moreno
H Garg
K Schouten
L Dong
L Maaten
M Lin
M Lin
M Mills
MJ Er
S Hu
S Rida-E-Fatima
S Xiong
T Duyu
T Mikolov
X Dong
XL Wu
Y Bengio
Y LeCun
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Prevalence of complementary and alternative medicine use among rheumatoid arthritis patients in Saudi Arabia

Crossref