112 research outputs found

    Trending topic extraction from social media

    Get PDF
    Social media has become the first source of information for many people. The amount of information posted on social media daily has become very vast that it became difficult to track. One of the most popular social media applications is Twitter. Users follow lots of news accounts, public figures, and their friends so they can be updated by the latest events around them. Since the dialect language and the style of writing differ from a region to another, our objective in this research is to extract trending topics for an Egyptian twitter user. In this way, the user can easily get at a glimpse of the trending topics discussed by the people he follows. To find the best approach achieving our objective, we investigate the document pivot and the feature pivot approaches. By applying the document pivot approach on the baseline data using tf-itf (term frequency-inverse tweet frequency) representation, repeated bisecting k-means clustering technique and extracting most frequent n-grams from each cluster we could achieve a recall value of 100% and F1 measure of 0.8. The application of the feature pivot approach on the baseline data using the content similarity algorithm to group related unigrams together, could achieve a recall value of 100% and F1 measure of 0.923. To validate our results we collected 12 different data sets of different sizes (200, 400, 600, and 1200) and from three different domains (sports, entertainment, and news) then applied both approaches to them. The average recall, precision and F1 measure values resulted from applying the feature pivot approach are larger than those achieved by applying the document pivot approach. To make sure this difference in results is statistically significant we applied the Two-sample one-tailed paired significance t-test that showed the results are significantly better at confidence interval of 90% The results showed that the document pivot approach could extract the trending topics for an Egyptian twitter user with an average recall value of 0.714, average precision value of 0.521, and average F1 measure value of 0.556 versus average recall, precision and F1 measure values of 0.981, 0.754, and 0.833 respectively, when applying the feature pivot approach. â€

    Automatic text summarization in digital libraries

    Get PDF
    xiii, 142 leaves ; 28 cm.A digital library is a collection of services and information objects for storing, accessing, and retrieving digital objects. Automatic text summarization presents salient information in a condensed form suitable for user needs. This thesis amalgamates digital libraries and automatic text summarization by extending the Greenstone Digital Library software suite to include the University of Lethbridge Summarizer. The tool generates summaries, nouns, and non phrases for use as metadata for searching and browsing digital collections. Digital collections of newspapers, PDFs, and eBooks were created with summary metadata. PDF documents were processed the fastest at 1.8 MB/hr, followed by the newspapers at 1.3 MB/hr, with eBooks being the slowest at 0.9 MV/hr. Qualitative analysis on four genres: newspaper, M.Sc. thesis, novel, and poetry, revealed narrative newspapers were most suitable for automatically generated summarization. The other genres suffered from incoherence and information loss. Overall, summaries for digital collections are suitable when used with newspaper documents and unsuitable for other genres

    Representing and Redefining Specialised Knowledge: Medical Discourse

    Get PDF
    This volume brings together five selected papers on medical discourse which show how specialised medical corpora provide a framework that helps those engaging with medical discourse to determine how the everyday and the specialised combine to shape the discourse of medical professionals and non-medical communities in relation to both long and short-term factors. The papers contribute, in an exemplary way, to illustrating the shifting boundaries in today’s society between the two major poles making up the medical discourse cline: healthcare discourse at the one end, which records the demand for personalised therapies and individual medical services; and clinical discourse the other, which documents research into society’s collective medical needs

    Approaches to Automatic Text Structuring

    Get PDF
    Structured text helps readers to better understand the content of documents. In classic newspaper texts or books, some structure already exists. In the Web 2.0, the amount of textual data, especially user-generated data, has increased dramatically. As a result, there exists a large amount of textual data which lacks structure, thus making it more difficult to understand. In this thesis, we will explore techniques for automatic text structuring to help readers to fulfill their information needs. Useful techniques for automatic text structuring are keyphrase identification, table-of-contents generation, and link identification. We improve state of the art results for approaches to text structuring on several benchmark datasets. In addition, we present new representative datasets for users’ everyday tasks. We evaluate the quality of text structuring approaches with regard to these scenarios and discover that the quality of approaches highly depends on the dataset on which they are applied. In the first chapter of this thesis, we establish the theoretical foundations regarding text structuring. We describe our findings from a user survey regarding web usage from which we derive three typical scenarios of Internet users. We then proceed to the three main contributions of this thesis. We evaluate approaches to keyphrase identification both by extracting and assigning keyphrases for English and German datasets. We find that unsupervised keyphrase extraction yields stable results, but for datasets with predefined keyphrases, additional filtering of keyphrases and assignment approaches yields even higher results. We present a de- compounding extension, which further improves results for datasets with shorter texts. We construct hierarchical table-of-contents of documents for three English datasets and discover that the results for hierarchy identification are sufficient for an automatic system, but for segment title generation, user interaction based on suggestions is required. We investigate approaches to link identification, including the subtasks of identifying the mention (anchor) of the link and linking the mention to an entity (target). Approaches that make use of the Wikipedia link structure perform best, as long as there is sufficient training data available. For identifying links to sense inventories other than Wikipedia, approaches that do not make use of the link structure outperform the approaches using existing links. We further analyze the effect of senses on computing similarities. In contrast to entity linking, where most entities can be discriminated by their name, we consider cases where multiple entities with the same name exist. We discover that similarity de- pends on the selected sense inventory. To foster future evaluation of natural language processing components for text structuring, we present two prototypes of text structuring systems, which integrate techniques for automatic text structuring in a wiki setting and in an e-learning setting with eBooks

    Low-Resource Unsupervised NMT:Diagnosing the Problem and Providing a Linguistically Motivated Solution

    Get PDF
    Unsupervised Machine Translation hasbeen advancing our ability to translatewithout parallel data, but state-of-the-artmethods assume an abundance of mono-lingual data. This paper investigates thescenario where monolingual data is lim-ited as well, finding that current unsuper-vised methods suffer in performance un-der this stricter setting. We find that theperformance loss originates from the poorquality of the pretrained monolingual em-beddings, and we propose using linguis-tic information in the embedding train-ing scheme. To support this, we look attwo linguistic features that may help im-prove alignment quality: dependency in-formation and sub-word information. Us-ing dependency-based embeddings resultsin a complementary word representationwhich offers a boost in performance ofaround 1.5 BLEU points compared to stan-dardWORD2VECwhen monolingual datais limited to 1 million sentences per lan-guage. We also find that the inclusion ofsub-word information is crucial to improv-ing the quality of the embedding

    Towards a poetics of titles : the prehistory

    Get PDF
    This thesis initiates a diachronic reconsideration of the English literary title. Unlike previous critical studies of titling practices, which focus almost exclusively on modern printed works, the thesis turns to the titling practices of manuscripts, addressing the different forms, functions and meanings of premodern titling. The overlapping of theoretical and material concerns necessitates a new multidisciplinary approach which combines critical theories of titology with codicological and bibliographical modes of enquiry. The introductory chapter contrasts different titling practices of contemporary and premodern literary cultures. Chapter two identifies shortcomings in current titological theories. The third chapter opens with a consideration of the meanings and uses the word title specific to the premodern era and the possible influences ancient and early medieval approaches to identifying and defining texts may have had on later medieval titling. Chapter four considers the growth in external and internal forms of vernacular titling practice evident in selected manuscripts of the eleventh, twelfth and thirteenth centuries. The fifth chapter moves the discussion into the thirteenth and fourteenth centuries as witnessed by three important codices from this time: Oxford, Bodleian Library, Digby 86; Scotland, National Library, Advocates 19. 2. 1 (Auchinleck); and Oxford, Bodleian Library, Eng. poet. a.1 (Vernon). The conclusion affirms that titling practices did have currency in premodernity though the identification of texts was a practice that exhibits great diversity, and in that feature, as well as in many others, what may appear superficially to be recognisable as titling stands a significant distance apart from modern concepts of the title and involves many other contemporary assumptions, about (para)texts, authors and readers, which are essential to an understanding of what medieval authors and scribes meant when they gave identity to texts.EThOS - Electronic Theses Online ServiceGBUnited Kingdo

    Towards a poetics of titles: the prehistory

    Get PDF
    This thesis initiates a diachronic reconsideration of the English literary title. Unlike previous critical studies of titling practices, which focus almost exclusively on modern printed works, the thesis turns to the titling practices of manuscripts, addressing the different forms, functions and meanings of premodern titling. The overlapping of theoretical and material concerns necessitates a new multidisciplinary approach which combines critical theories of titology with codicological and bibliographical modes of enquiry. The introductory chapter contrasts different titling practices of contemporary and premodern literary cultures. Chapter two identifies shortcomings in current titological theories. The third chapter opens with a consideration of the meanings and uses the word title specific to the premodern era and the possible influences ancient and early medieval approaches to identifying and defining texts may have had on later medieval titling. Chapter four considers the growth in external and internal forms of vernacular titling practice evident in selected manuscripts of the eleventh, twelfth and thirteenth centuries. The fifth chapter moves the discussion into the thirteenth and fourteenth centuries as witnessed by three important codices from this time: Oxford, Bodleian Library, Digby 86; Scotland, National Library, Advocates 19. 2. 1 (Auchinleck); and Oxford, Bodleian Library, Eng. poet. a.1 (Vernon). The conclusion affirms that titling practices did have currency in premodernity though the identification of texts was a practice that exhibits great diversity, and in that feature, as well as in many others, what may appear superficially to be recognisable as titling stands a significant distance apart from modern concepts of the title and involves many other contemporary assumptions, about (para)texts, authors and readers, which are essential to an understanding of what medieval authors and scribes meant when they gave identity to text
    • …
    corecore