16,827 research outputs found
A Machine Learning Approach For Opinion Holder Extraction In Arabic Language
Opinion mining aims at extracting useful subjective information from reliable
amounts of text. Opinion mining holder recognition is a task that has not been
considered yet in Arabic Language. This task essentially requires deep
understanding of clauses structures. Unfortunately, the lack of a robust,
publicly available, Arabic parser further complicates the research. This paper
presents a leading research for the opinion holder extraction in Arabic news
independent from any lexical parsers. We investigate constructing a
comprehensive feature set to compensate the lack of parsing structural
outcomes. The proposed feature set is tuned from English previous works coupled
with our proposed semantic field and named entities features. Our feature
analysis is based on Conditional Random Fields (CRF) and semi-supervised
pattern recognition techniques. Different research models are evaluated via
cross-validation experiments achieving 54.03 F-measure. We publicly release our
own research outcome corpus and lexicon for opinion mining community to
encourage further research
An automatically built named entity lexicon for Arabic
We have successfully adapted and extended the automatic Multilingual, Interoperable Named Entity Lexicon approach to Arabic, using Arabic WordNet (AWN) and Arabic Wikipedia (AWK). First, we extract AWNâs instantiable nouns and identify the corresponding categories and hyponym subcategories in AWK. Then, we exploit Wikipedia inter-lingual links to locate correspondences between articles in ten different languages in order to identify Named Entities (NEs). We apply keyword search on AWK abstracts to provide for Arabic articles that do not have a correspondence in any of the other languages. In addition, we perform a post-processing step to fetch further NEs from AWK not reachable through AWN. Finally, we investigate diacritization using matching with geonames databases, MADA-TOKAN tools and different heuristics for restoring vowel marks of Arabic NEs. Using this methodology, we have extracted approximately 45,000 Arabic NEs and built, to the best of our knowledge, the largest, most mature and well-structured Arabic NE lexical resource to date. We have stored and organised this lexicon following the Lexical Markup Framework (LMF) ISO standard. We conduct a quantitative and qualitative evaluation of the lexicon against a manually annotated gold standard and achieve precision scores from
95.83% (with 66.13% recall) to 99.31% (with 61.45% recall) according to different values of a threshold
ANNOTATION MODEL FOR LOANWORDS IN INDONESIAN CORPUS: A LOCAL GRAMMAR FRAMEWORK
There is a considerable number for loanwords in Indonesian language as it has been,
or even continuously, in contact with other languages. The contact takes place via different
media; one of them is via machine readable medium. As the information in different languages
can be obtained by a mouse click these days, the contact becomes more and more intense. This
paper aims at proposing an annotation model and lexical resource for loanwords in
Indonesian. The lexical resource is applied to a corpus by a corpus processing software called
UNITEX. This software works under local grammar framewor
- âŚ