49 research outputs found

    Utilisation of metadata fields and query expansion in cross-lingual search of user-generated Internet video

    Get PDF
    Recent years have seen signicant eorts in the area of Cross Language Information Retrieval (CLIR) for text retrieval. This work initially focused on formally published content, but more recently research has begun to concentrate on CLIR for informal social media content. However, despite the current expansion in online multimedia archives, there has been little work on CLIR for this content. While there has been some limited work on Cross-Language Video Retrieval (CLVR) for professional videos, such as documentaries or TV news broadcasts, there has to date, been no signicant investigation of CLVR for the rapidly growing archives of informal user generated (UGC) content. Key differences between such UGC and professionally produced content are the nature and structure of the textual UGC metadata associated with it, as well as the form and quality of the content itself. In this setting, retrieval eectiveness may not only suer from translation errors common to all CLIR tasks, but also recognition errors associated with the automatic speech recognition (ASR) systems used to transcribe the spoken content of the video and with the informality and inconsistency of the associated user-created metadata for each video. This work proposes and evaluates techniques to improve CLIR effectiveness of such noisy UGC content. Our experimental investigation shows that dierent sources of evidence, e.g. the content from dierent elds of the structured metadata, significantly affect CLIR effectiveness. Results from our experiments also show that each metadata eld has a varying robustness to query expansion (QE) and hence can have a negative impact on the CLIR eectiveness. Our work proposes a novel adaptive QE technique that predicts the most reliable source for expansion and shows how this technique can be effective for improving CLIR effectiveness for UGC content

    Machine translation of user-generated content

    Get PDF
    The world of social media has undergone huge evolution during the last few years. With the spread of social media and online forums, individual users actively participate in the generation of online content in different languages from all over the world. Sharing of online content has become much easier than before with the advent of popular websites such as Twitter, Facebook etc. Such content is referred to as ‘User-Generated Content’ (UGC). Some examples of UGC are user reviews, customer feedback, tweets etc. In general, UGC is informal and noisy in terms of linguistic norms. Such noise does not create significant problems for human to understand the content, but it can pose challenges for several natural language processing applications such as parsing, sentiment analysis, machine translation (MT), etc. An additional challenge for MT is sparseness of bilingual (translated) parallel UGC corpora. In this research, we explore the general issues in MT of UGC and set some research goals from our findings. One of our main goals is to exploit comparable corpora in order to extract parallel or semantically similar sentences. To accomplish this task, we design a document alignment system to extract semantically similar bilingual document pairs using the bilingual comparable corpora. We then apply strategies to extract parallel or semantically similar sentences from comparable corpora by transforming the document alignment system into a sentence alignment system. We seek to improve the quality of parallel data extraction for UGC translation and assemble the extracted data with the existing human translated resources. Another objective of this research is to demonstrate the usefulness of MT-based sentiment analysis. However, when using openly available systems such as Google Translate, the translation process may alter the sentiment in the target language. To cope with this phenomenon, we instead build fine-grained sentiment translation models that focus on sentiment preservation in the target language during translation

    Towards effective cross-lingual search of user-generated internet speech

    Get PDF
    The very rapid growth in user-generated social spoken content on online platforms is creating new challenges for Spoken Content Retrieval (SCR) technologies. There are many potential choices for how to design a robust SCR framework for UGS content, but the current lack of detailed investigation means that there is a lack of understanding of the specifc challenges, and little or no guidance available to inform these choices. This thesis investigates the challenges of effective SCR for UGS content, and proposes novel SCR methods that are designed to cope with the challenges of UGS content. The work presented in this thesis can be divided into three areas of contribution as follows. The first contribution of this work is critiquing the issues and challenges that in influence the effectiveness of searching UGS content in both mono-lingual and cross-lingual settings. The second contribution is to develop an effective Query Expansion (QE) method for UGS. This research reports that, encountered in UGS content, the variation in the length, quality and structure of the relevant documents can harm the effectiveness of QE techniques across different queries. Seeking to address this issue, this work examines the utilisation of Query Performance Prediction (QPP) techniques for improving QE in UGS, and presents a novel framework specifically designed for predicting of the effectiveness of QE. Thirdly, this work extends the utilisation of QPP in UGS search to improve cross-lingual search for UGS by predicting the translation effectiveness. The thesis proposes novel methods to estimate the quality of translation for cross-lingual UGS search. An empirical evaluation that demonstrates the quality of the proposed method on alternative translation outputs extracted from several Machine Translation (MT) systems developed for this task. The research then shows how this framework can be integrated in cross-lingual UGS search to find relevant translations for improved retrieval performance

    Andrew W. Mellon Foundation - 1999 Annual Report

    Get PDF
    Contains president's message, program information, summary of foundation initiatives, grants list, financial statements, and list of staff

    Evaluating Information Retrieval and Access Tasks

    Get PDF
    This open access book summarizes the first two decades of the NII Testbeds and Community for Information access Research (NTCIR). NTCIR is a series of evaluation forums run by a global team of researchers and hosted by the National Institute of Informatics (NII), Japan. The book is unique in that it discusses not just what was done at NTCIR, but also how it was done and the impact it has achieved. For example, in some chapters the reader sees the early seeds of what eventually grew to be the search engines that provide access to content on the World Wide Web, today’s smartphones that can tailor what they show to the needs of their owners, and the smart speakers that enrich our lives at home and on the move. We also get glimpses into how new search engines can be built for mathematical formulae, or for the digital record of a lived human life. Key to the success of the NTCIR endeavor was early recognition that information access research is an empirical discipline and that evaluation therefore lay at the core of the enterprise. Evaluation is thus at the heart of each chapter in this book. They show, for example, how the recognition that some documents are more important than others has shaped thinking about evaluation design. The thirty-three contributors to this volume speak for the many hundreds of researchers from dozens of countries around the world who together shaped NTCIR as organizers and participants. This book is suitable for researchers, practitioners, and students—anyone who wants to learn about past and present evaluation efforts in information retrieval, information access, and natural language processing, as well as those who want to participate in an evaluation task or even to design and organize one

    The Information-seeking Strategies of Humanities Scholars Using Resources in Languages Other Than English

    Get PDF
    ABSTRACT THE INFORMATION-SEEKING STRATEGIES OF HUMANITIES SCHOLARS USING RESOURCES IN LANGUAGES OTHER THAN ENGLISH by Carol Sabbar The University of Wisconsin-Milwaukee, 2016 Under the Supervision of Dr. Iris Xie This dissertation explores the information-seeking strategies used by scholars in the humanities who rely on resources in languages other than English. It investigates not only the strategies they choose but also the shifts that they make among strategies and the role that language, culture, and geography play in the information-seeking context. The study used purposive sampling to engage 40 human subjects, all of whom are post-doctoral humanities scholars based in the United States who conduct research in a variety of languages. Data were collected through semi-structured interviews and research diaries in order to answer three research questions: What information-seeking strategies are used by scholars conducting research in languages other than English? What shifts do scholars make among strategies in routine, disruptive, and/or problematic situations? And In what ways do language, culture, and geography play a role in the information-seeking context, especially in the problematic situations? The data were then analyzed using grounded theory and the constant comparative method. A new conceptual model – the information triangle – was used and is presented in this dissertation to categorize and visually map the strategies and shifts. Based on data collected, thirty distinct strategies were identified and divided into four categories: formal system, informal resource, interactive human, and hybrid strategies. Three types of shifts were considered: planned, opportunistic, and alternative. Finally, factors related to language, culture, and geography were identified and analyzed according to their roles in the information-seeking context. This study is the first of its kind to combine the study of information-seeking behaviors with the factors of language, culture, and geography, and as such, it presents numerous methodological and practical implications along with many opportunities for future research

    Supporting Research in Area Studies: a guide for academic libraries

    Get PDF
    The study of other countries or regions of the world often crosses traditional disciplinary boundaries in the humanities and social sciences. Supporting Research in Area Studies is a comprehensive guide for academic libraries supporting these communities of researchers. This book explores the specialist requirements of these researchers in information resources, resource discovery tools, and information skills, and the challenges of working with materials in multiple languages. It makes the case that by adapting their systems and procedures to meet these needs, academic libraries find themselves better placed to support their institution's�� international agenda more widely. The first four chapters cover the academic landscape and its history, area studies librarianship and acquisitions. Subsequent chapters discuss collections management, digital products, and the digital humanities, and their role in academic projects. The final chapter explores information skills and the various disciplinary skills that facilitate the needs of researchers during their careers

    Tune your brown clustering, please

    Get PDF
    Brown clustering, an unsupervised hierarchical clustering technique based on ngram mutual information, has proven useful in many NLP applications. However, most uses of Brown clustering employ the same default configuration; the appropriateness of this configuration has gone predominantly unexplored. Accordingly, we present information for practitioners on the behaviour of Brown clustering in order to assist hyper-parametre tuning, in the form of a theoretical model of Brown clustering utility. This model is then evaluated empirically in two sequence labelling tasks over two text types. We explore the dynamic between the input corpus size, chosen number of classes, and quality of the resulting clusters, which has an impact for any approach using Brown clustering. In every scenario that we examine, our results reveal that the values most commonly used for the clustering are sub-optimal

    Visualising the intellectual and social structures of digital humanities using an invisible college model

    Get PDF
    This thesis explores the intellectual and social structures of an emerging field, Digital Humanities (DH). After around 70 years of development, DH claims to differentiate itself from the traditional Humanities for its inclusiveness, diversity, and collaboration. However, the ‘big tent’ concept not only limits our understandings of its research structure, but also results in a lack of empirical review and sustainable support. Under this umbrella, whether there are merely fragmented topics, or a consolidated knowledge system is still unknown. This study seeks to answer three research questions: a) Subject: What research topics is the DH subject composed of? b) Scholar: Who has contributed to the development of DH? c) Environment: How diverse are the backgrounds of DH scholars? The Invisible College research model is refined and applied as the methodological framework that produces four visualised networks. As the results show, DH currently contributes more towards the general historical literacy and information science, while longitudinally, it was heavily involved in computational linguistics. Humanistic topics are more popular and central, while technical topics are relatively peripheral and have stronger connections with non-Anglophone communities. DH social networks are at the early stages of development, and the formation is heavily influenced by non-academic and non-intellectual factors, e.g., language, working country, and informal relationships. Although male scholars have dominated the field, female scholars have encouraged more communication and built more collaborations. Despite the growing appeals for more diversity, the level of international collaboration in DH is more extensive than in many other disciplines. These findings can help us gain new understandings on the central and critical questions about DH. To the best of the candidate’s knowledge, this study is the first to investigate the formal and informal structures in DH with a well-grounded research model

    Disrupting the Digital Humanities

    Get PDF
    All too often, defining a discipline becomes more an exercise of exclusion than inclusion. Disrupting the Digital Humanities seeks to rethink how we map disciplinary terrain by directly confronting the gatekeeping impulse of many other so-called field-defining collections. What is most beautiful about the work of the Digital Humanities is exactly the fact that it can’t be tidily anthologized. In fact, the desire to neatly define the Digital Humanities (to filter the DH-y from the DH) is a way of excluding the radically diverse work that actually constitutes the field. This collection, then, works to push and prod at the edges of the Digital Humanities — to open the Digital Humanities rather than close it down. Ultimately, it’s exactly the fringes, the outliers, that make the Digital Humanities both heterogeneous and rigorous. This collection does not constitute yet another reservoir for the new Digital Humanities canon. Rather, its aim is less about assembling content as it is about creating new conversations. Building a truly communal space for the digital humanities requires that we all approach that space with a commitment to: 1) creating open and non-hierarchical dialogues; 2) championing non-traditional work that might not otherwise be recognized through conventional scholarly channels; 3) amplifying marginalized voices; 4) advocating for students and learners; and 5) sharing generously and openly to support the work of our peers
    corecore