39 research outputs found

    A Google trends spatial clustering approach for a worldwide Twitter user geolocation

    Get PDF
    User location data is valuable for diverse social media analytics. In this paper, we address the non-trivial task of estimating a worldwide city-level Twitter user location considering only historical tweets. We propose a purely unsupervised approach that is based on a synthetic geographic sampling of Google Trends (GT) city-level frequencies of tweet nouns and three clustering algorithms. The approach was validated empirically by using a recently collected dataset, with 3,268 worldwide city-level locations of Twitter users, obtaining competitive results when compared with a state-of-the-art Word Distribution (WD) user location estimation method. The best overall results were achieved by the GT noun DBSCAN (GTN-DB) method, which is computationally fast, and correctly predicts the ground truth locations of 15%, 23%, 39% and 58% of the users for tolerance distances of 250 km, 500 km, 1,000 km and 2,000 km.The work of P. Cortez was supported by FCT – Funda ̧c ̃ao para a Ciˆencia eTecnologia within the R&D Units Project Scope: UIDB/00319/2020. We wouldalso like to thank the anonymous reviewers for their helpful suggestions

    Robust input representations for low-resource information extraction

    Get PDF
    Recent advances in the field of natural language processing were achieved with deep learning models. This led to a wide range of new research questions concerning the stability of such large-scale systems and their applicability beyond well-studied tasks and datasets, such as information extraction in non-standard domains and languages, in particular, in low-resource environments. In this work, we address these challenges and make important contributions across fields such as representation learning and transfer learning by proposing novel model architectures and training strategies to overcome existing limitations, including a lack of training resources, domain mismatches and language barriers. In particular, we propose solutions to close the domain gap between representation models by, e.g., domain-adaptive pre-training or our novel meta-embedding architecture for creating a joint representations of multiple embedding methods. Our broad set of experiments demonstrates state-of-the-art performance of our methods for various sequence tagging and classification tasks and highlight their robustness in challenging low-resource settings across languages and domains.Die jĂŒngsten Fortschritte auf dem Gebiet der Verarbeitung natĂŒrlicher Sprache wurden mit Deep-Learning-Modellen erzielt. Dies fĂŒhrte zu einer Vielzahl neuer Forschungsfragen bezĂŒglich der StabilitĂ€t solcher großen Systeme und ihrer Anwendbarkeit ĂŒber gut untersuchte Aufgaben und DatensĂ€tze hinaus, wie z. B. die Informationsextraktion fĂŒr Nicht-Standardsprachen, aber auch TextdomĂ€nen und Aufgaben, fĂŒr die selbst im Englischen nur wenige Trainingsdaten zur VerfĂŒgung stehen. In dieser Arbeit gehen wir auf diese Herausforderungen ein und leisten wichtige BeitrĂ€ge in Bereichen wie ReprĂ€sentationslernen und Transferlernen, indem wir neuartige Modellarchitekturen und Trainingsstrategien vorschlagen, um bestehende BeschrĂ€nkungen zu ĂŒberwinden, darunter fehlende Trainingsressourcen, ungesehene DomĂ€nen und Sprachbarrieren. Insbesondere schlagen wir Lösungen vor, um die DomĂ€nenlĂŒcke zwischen ReprĂ€sentationsmodellen zu schließen, z.B. durch domĂ€nenadaptives Vortrainieren oder unsere neuartige Meta-Embedding-Architektur zur Erstellung einer gemeinsamen ReprĂ€sentation mehrerer Embeddingmethoden. Unsere umfassende Evaluierung demonstriert die LeistungsfĂ€higkeit unserer Methoden fĂŒr verschiedene Klassifizierungsaufgaben auf Word und Satzebene und unterstreicht ihre Robustheit in anspruchsvollen, ressourcenarmen Umgebungen in verschiedenen Sprachen und DomĂ€nen
    corecore