431 research outputs found
Exploration of Approaches to Arabic Named Entity Recognition
Abstract. The Named Entity Recognition (NER) task has attracted significant attention in Natural Language Processing (NLP) as it can enhance the performance of many NLP applications. In this paper, we compare English NER with Arabic NER in an experimental way to investigate the impact of using different classifiers and sets of features including language-independent and language-specific features. We explore the features and classifiers on five different datasets. We compare deep neural network architectures for NER with more traditional machine learning approaches to NER. We discover that most of the techniques and features used for English NER perform well on Arabic NER. Our results highlight the improvements achieved by using language-specific features in Arabic NER
How to match bilingual tweets ?
International audienceIn this paper, we propose a method that aligns comparable bilingual tweets which, not only takes into account the specificity of a Tweet, but treats also proper names, dates and numbers in two different languages. This permits to retrieve more relevant target tweets. The process of matching proper names between Arabic and English is a difficult task, because these two languages use different scripts. For that, we used an approach which projects the sounds of an English proper name into Arabic and aligns it with the most appropriate proper name. We evaluated the method with a classical measure and compared it to the one we developed. The experiments have been achieved on two parallel corpora and shows that our measure outperforms the baseline by 5.6% at R@1 recall
Fine-grained Arabic named entity recognition
This thesis addresses the problem of fine-grained NER for Arabic, which poses unique linguistic challenges to NER; such as the absence of capitalisation and short vowels, the complex morphology, and the highly in infection process. Instead of classifying the detected NE phrases into small sets of classes, we target a broader range (i.e. 50 fine-grained classes 'hierarchal-based of two levels') to increase the depth of the semantic knowledge extracted. This has increased the number of classes, complicating the task, when compared with traditional (coarse-grained) NER, because of the increase in the number of semantic classes and the decrease in semantic differences between fine-grained classes.
Our approach to developing fine-grained NER relies on two different supervised Machine Learning (ML) technologies (i.e. Maximum Entropy 'ME' and Conditional Random Fields 'CRF'), which require annotated training data in order to learn by extracting informative features. We develop a methodology which exploit the richness of Arabic Wikipedia (A W) in order to create a scalable fine-grained lexical resource and a corpus automatically. Moreover, two gold-standard created corpora from different genres were also developed to perform comparable evaluation. The thesis also developed a new approach to feature representation by relying on the dependency structure of the sentence to overcome the limitation of traditional window-based (i.e. n-gram) representation. Furthermore, by exploiting the richness of unannotated textual data to extract global informative features using word-level clustering technique was also achieved. Each contribution was evaluated via controlled experiment and reported using three commonly applied metrics, i.e. precision, recall and harmonic F-measure
A Survey on Semantic Processing Techniques
Semantic processing is a fundamental research domain in computational
linguistics. In the era of powerful pre-trained language models and large
language models, the advancement of research in this domain appears to be
decelerating. However, the study of semantics is multi-dimensional in
linguistics. The research depth and breadth of computational semantic
processing can be largely improved with new technologies. In this survey, we
analyzed five semantic processing tasks, e.g., word sense disambiguation,
anaphora resolution, named entity recognition, concept extraction, and
subjectivity detection. We study relevant theoretical research in these fields,
advanced methods, and downstream applications. We connect the surveyed tasks
with downstream applications because this may inspire future scholars to fuse
these low-level semantic processing tasks with high-level natural language
processing tasks. The review of theoretical research may also inspire new tasks
and technologies in the semantic processing domain. Finally, we compare the
different semantic processing techniques and summarize their technical trends,
application trends, and future directions.Comment: Published at Information Fusion, Volume 101, 2024, 101988, ISSN
1566-2535. The equal contribution mark is missed in the published version due
to the publication policies. Please contact Prof. Erik Cambria for detail
B!SON: A Tool for Open Access Journal Recommendation
Finding a suitable open access journal to publish scientific work is a complex task: Researchers have to navigate a constantly growing number of journals, institutional agreements with publishers, funders’ conditions and the risk of Predatory Publishers. To help with these challenges, we introduce a web-based journal recommendation system called B!SON. It is developed based on a systematic requirements analysis, built on open data, gives publisher-independent recommendations and works across domains. It suggests open access journals based on title, abstract and references provided by the user. The recommendation quality has been evaluated using a large test set of 10,000 articles. Development by two German scientific libraries ensures the longevity of the project
Minimally-supervised Methods for Arabic Named Entity Recognition
Named Entity Recognition (NER) has attracted much attention over the past twenty years, as a main task of Information Extraction. The current dominant techniques for addressing NER are supervised methods that can achieve high performance, but require new manually annotated data for every new domain and/or genre change. Our work focuses on approaches that make it possible to tackle new domains with minimal human intervention to identify Named Entities (NEs) in Arabic text. Specifically, we investigate two minimally-supervised methods: semi-supervised learning and distant learning. Our semi-supervised algorithm for identifying NEs does not require annotated training data or gazetteers. It only requires, for each NE type, a seed list of a few instances to initiate the learning process. Novel aspects of our algorithm include (i) a new way to produce and generalise the extraction patterns (ii) a new filtering criterion to remove noisy patterns (iii) a comparison of two ranking measures for determining the most reliable candidate NEs. Next, we present our methodology to exploit Wikipedia structure to automatically develop an Arabic NE annotated corpus. A novel mechanism is introduced, based on the high coverage of Wikipedia, in order to address two challenges particular to tagging NEs in Arabic text: rich morphology and the absence of capitalisation. Neither technique has yet achieved performance levels comparable to those of supervised methods. Semi-supervised algorithms tend to have high precision but comparatively low recall, whereas distant learning tends to achieve higher recall but lower precision. Therefore, we present a novel approach to Arabic NER using a combination of semi-supervised and distant learning techniques. We used a variety of classifier combination schemes, including the Bayesian Classifier Combination (BCC) procedure, recently proposed for sentiment analysis. According to our results, the BCC model leads to an increase in performance of 8 percentage points over the best minimally-supervised classifier
Tune your brown clustering, please
Brown clustering, an unsupervised hierarchical clustering technique based on ngram mutual information, has proven useful in many NLP applications. However, most uses of Brown clustering employ the same default configuration; the appropriateness of this configuration has gone predominantly unexplored. Accordingly, we present information for practitioners on the behaviour of Brown clustering in order to assist hyper-parametre tuning, in the form of a theoretical model of Brown clustering utility. This model is then evaluated empirically in two sequence labelling tasks over two text types. We explore the dynamic between the input corpus size, chosen number of classes, and quality of the resulting clusters, which has an impact for any approach using Brown clustering. In every scenario that we examine, our results reveal that the values most commonly used for the clustering are sub-optimal
Can humain association norm evaluate latent semantic analysis?
This paper presents the comparison of word association norm created by a psycholinguistic experiment to association lists generated by algorithms operating on text corpora. We compare lists generated by Church and Hanks algorithm and lists generated by LSA algorithm. An argument is presented on how those automatically generated lists reflect real semantic relations
Multifaceted Geotagging for Streaming News
News sources on the Web generate constant streams of information, describing the events that shape our world. In particular, geography plays a key role in the news, and understanding the geographic information present in news allows for its useful spatial browsing and retrieval. This process of understanding is called geotagging, and involves first finding in the document all textual references to geographic locations, known as toponyms, and second, assigning the correct lat/long values to each toponym, steps which are termed toponym recognition and toponym resolution, respectively. These steps are difficult due to ambiguities in natural language: some toponyms share names with non-location entities, and further, a given toponym can have many location interpretations. Removing these ambiguities is crucial for successful geotagging.
To this end, geotagging methods are described which were developed for streaming news. First, a spatio-textual search engine named STEWARD, and an interactive map-based news browsing system named NewsStand are described, which feature geotaggers as central components, and served as motivating systems and experimental testbeds for developing geotagging methods. Next, a geotagging methodology is presented that follows a multifaceted approach involving a variety of techniques. First, a multifaceted toponym recognition process is described that uses both rule-based and machine learning–based methods to ensure high toponym recall. Next, various forms of toponym resolution evidence are explored. One such type of evidence is lists of toponyms, termed comma groups, whose toponyms share a common thread in their geographic properties that enables correct resolution. In addition to explicit evidence, authors take advantage of the implicit geographic knowledge of their audiences. Understanding the local places known by an audience, termed its local lexicon, affords great performance gains when geotagging articles from local newspapers, which account for the vast majority of news on the Web. Finally, considering windows of text of varying size around each toponym, termed adaptive context, allows for a tradeoff between geotagging execution speed and toponym resolution accuracy. Extensive experimental evaluations of all the above methods, using existing and two newly-created, large corpora of streaming news, show great performance gains over several competing prominent geotagging methods
Recommended from our members
Identifying and Processing Crisis Information from Social Media
Social media platforms play a crucial role in how people communicate, particularly during crisis situations such as natural disasters. People share and disseminate information on social media platforms that relates to updates, alerts, rescue and relief requests among other crisis relevant information. Hurricane Harvey and Hurricane Sandy saw over tens of millions of posts getting generated, on Twitter, in a short span of time. The ambit of such posts spreads across a wide range such as personal and official communications, and citizen sensing, to mention a few. This makes social media platforms a source of vital information to different stakeholders in crisis situations such as impacted communities, relief agencies, and civic authorities. However, the overwhelming volume of data generated during such times, makes it impossible to manually identify information relevant to crisis. Additionally, a large portion of posts in voluminous streams is not relevant or bears minimal relevance to crisis situations.
This has steered much research towards exploring methods that can automatically identify crisis relevant information from voluminous streams of data during such scenarios. However, the problem of identifying crisis relevant information from social media platforms, such as Twitter, is not trivial given the nature of unstructured text such as short text length and syntactic variations among other challenges. A key objective, while creating automatic crisis relevancy classification systems, is to make them adaptable to a wide range of crisis types and languages. Many related approaches rely on statistical features which are quantifiable properties and linguistic properties of the text. A general approach is to train the classification model on labelled data acquired from crisis events and evaluate on other crisis events. A key aspect missing from explored literature is the validity of crisis relevancy classification models when applied to data from unseen types of crisis events and languages. For instance, how would the accuracy of a crisis relevancy classification model, trained on earthquake type of events, change when applied to flood type of events. Or, how would a model perform when trained on crisis data in English but applied to data in Italian.
This thesis investigates these problems from a semantics perspective, where the challenges posed by diverse types of crisis and language variations are seen as the problems that can be tackled by enriching the data semantically. The use of knowledge bases such as DBpedia, BabelNet, and Wikipedia, for semantic enrichment of data in text classification problems has often been studied. Semantic enrichment of data through entity linking and expansion of context via knowledge bases can take advantage of connections between different concepts and thus enhance contextual coherency across crisis types and languages. Several previous works have focused on similar problems and proposed approaches using statistical features and/or non-semantic features. The use of semantics extracted through knowledge graphs has remained unexplored in building crisis relevancy classifiers that are adaptive to varying crisis types and multilingual data. Experiments conducted in this thesis consider data from Twitter, a micro-blogging social media platform, and analyse multiple aspects of crisis data classification. The results obtained through various analyses in this thesis demonstrate the value of semantic enrichment of text through knowledge graphs in improving the adaptability of crisis relevancy classifiers across crisis types and languages, in comparison to statistical features as often used in much of the related work
- …