62 research outputs found

    Evaluation of Computational Grammar Formalisms for Indian Languages

    Get PDF
    Natural Language Parsing has been the most prominent research area since the genesis of Natural Language Processing. Probabilistic Parsers are being developed to make the process of parser development much easier, accurate and fast. In Indian context, identification of which Computational Grammar Formalism is to be used is still a question which needs to be answered. In this paper we focus on this problem and try to analyze different formalisms for Indian languages

    Integrating Cultural Knowledge into Artificially Intelligent Systems: Human Experiments and Computational Implementations

    Get PDF
    With the advancement of Artificial Intelligence, it seems as if every aspect of our lives is impacted by AI in one way or the other. As AI is used for everything from driving vehicles to criminal justice, it becomes crucial that it overcome any biases that might hinder its fair application. We are constantly trying to make AI be more like humans. But most AI systems so far fail to address one of the main aspects of humanity: our culture and the differences between cultures. We cannot truly consider AI to have understood human reasoning without understanding culture. So it is important for cultural information to be embedded into AI systems in some way, as well as for the AI systems to understand the differences across these cultures. The main way I have chosen to do this are using two cultural markers: motifs and rituals. This is because they are both so inherently part of any culture. Motifs are things that are repeated often and are grounded in well-known stories, and tend to be very specific to individual cultures. Rituals are something that are part of every culture in some way, and while there are some that are constant across all cultures, some are very specific to individual ones. This makes them great to compare and to contrast. The first two parts of this dissertation talk about a couple of cognitive psychology studies I conducted. The first is to see how people understood motifs. Is is true that in-culture people identify motifs better than out-culture people? We see that my study shows this to indeed be the case. The second study attempts to test if motifs are recognizable in texts, regardless of whether or not people might understand their meaning. Our results confirm our hypothesis that motifs are recognizable. The third part of my work discusses the survey and data collection effort around rituals. I collected data about rituals from people from various national groups, and observed the differences in their responses. The main results from this was twofold: first, that cultural differences across groups are quantifiable, and that they are prevalent and observable with proper effort; and second, to collect and curate a substantial culturally sensitive dataset that can have a wide variety of use across various AI systems. The fourth part of the dissertation focuses on a system I built, called the motif association miner, which provides information about motifs present in input text, like associations, sources of motifs, connotations, etc. This information will be highly useful as this will enable future systems to use my output as input for their systems, and have a better understanding of motifs, especially as this shows an approach of bringing out meaning of motifs specific to certain culture to wider usage. As the final contribution, this thesis details my efforts to use the curated ritual data to improve existing Question Answering system, and show that this method helps systems perform better in situations which vary by culture. This data and approach, which will be made publicly available, will enable others in the field to take advantage of the information contained within to try and combat some bias in their systems

    Mining semantics for culturomics: towards a knowledge-based approach

    Get PDF
    The massive amounts of text data made available through the Google Books digitization project have inspired a new field of big-data textual research. Named culturomics, this field has attracted the attention of a growing number of scholars over recent years. However, initial studies based on these data have been criticized for not referring to relevant work in linguistics and language technology. This paper provides some ideas, thoughts and first steps towards a new culturomics initiative, based this time on Swedish data, which pursues a more knowledge-based approach than previous work in this emerging field. The amount of new Swedish text produced daily and older texts being digitized in cultural heritage projects grows at an accelerating rate. These volumes of text being available in digital form have grown far beyond the capacity of human readers, leaving automated semantic processing of the texts as the only realistic option for accessing and using the information contained in them. The aim of our recently initiated research program is to advance the state of the art in language technology resources and methods for semantic processing of Big Swedish text and focus on the theoretical and methodological advancement of the state of the art in extracting and correlating information from large volumes of Swedish text using a combination of knowledge-based and statistical methods

    Analyzing evolution of rare events through social media data

    Get PDF
    Recently, some researchers have attempted to find a relationship between the evolution of rare events and temporal-spatial patterns of social media activities. Their studies verify that the relationship exists in both time and spatial domains. However, few of those studies can accurately deduce a time point when social media activities are most highly affected by a rare event because producing an accurate temporal pattern of social media during the evolution of a rare event is very difficult. This work expands the current studies along three directions. Firstly, we focus on the intensity of information volume and propose an innovative clustering algorithm-based data processing method to characterize the evolution of a rare event by analyzing social media data. Secondly, novel feature extraction and fuzzy logic-based classification methods are proposed to distinguish and classify event-related and unrelated messages. Lastly, since many messages do not have ground truth, we execute four existing ground-truth inference algorithms to deduce the ground truth and compare their performances. Then, an Adaptive Majority Voting (Adaptive MV) method is proposed and compared with two of the existing algorithms based on a set containing manually-labeled social media data. Our case studies focus on Hurricane Sandy in 2012 and Hurricane Maria in 2017. Twitter data collected around them are used to verify the effectiveness of the proposed methods. Firstly, the results of the proposed data processing method not only verify that a rare event and social media activities have strong correlations, but also reveal that they have some time difference. Thus, it is conducive to investigate the temporal pattern of social media activities. Secondly, fuzzy logic-based feature extraction and classification methods are effective in identifying event-related and unrelated messages. Lastly, the Adaptive MV method deduces the ground truth well and performs better on datasets with noisy labels than other two methods, Positive Label Frequency Threshold and Majority Voting

    Information Extraction on Para-Relational Data.

    Full text link
    Para-relational data (such as spreadsheets and diagrams) refers to a type of nearly relational data that shares the important qualities of relational data but does not present itself in a relational format. Para-relational data often conveys highly valuable information and is widely used in many different areas. If we can convert para-relational data into the relational format, many existing tools can be leveraged for a variety of interesting applications, such as data analysis with relational query systems and data integration applications. This dissertation aims to convert para-relational data into a high-quality relational form with little user assistance. We have developed four standalone systems, each addressing a specific type of para-relational data. Senbazuru is a prototype spreadsheet database management system that extracts relational information from a large number of spreadsheets. Anthias is an extension of the Senbazuru system to convert a broader range of spreadsheets into a relational format. Lyretail is an extraction system to detect long-tail dictionary entities on webpages. Finally, DiagramFlyer is a web-based search system that obtains a large number of diagrams automatically extracted from web-crawled PDFs. Together, these four systems demonstrate that converting para-relational data into the relational format is possible today, and also suggest directions for future systems.PhDComputer Science and EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/120853/1/chenzhe_1.pd

    Automatic Extraction of Narrative Structure from Long Form Text

    Get PDF
    Automatic understanding of stories is a long-time goal of artificial intelligence and natural language processing research communities. Stories literally explain the human experience. Understanding our stories promotes the understanding of both individuals and groups of people; various cultures, societies, families, organizations, governments, and corporations, to name a few. People use stories to share information. Stories are told –by narrators– in linguistic bundles of words called narratives. My work has given computers awareness of narrative structure. Specifically, where are the boundaries of a narrative in a text. This is the task of determining where a narrative begins and ends, a non-trivial task, because people rarely tell one story at a time. People don’t specifically announce when we are starting or stopping our stories: We interrupt each other. We tell stories within stories. Before my work, computers had no awareness of narrative boundaries, essentially where stories begin and end. My programs can extract narrative boundaries from novels and short stories with an F1 of 0.65. Before this I worked on teaching computers to identify which paragraphs of text have story content, with an F1 of 0.75 (which is state of the art). Additionally, I have taught computers to identify the narrative point of view (POV; how the narrator identifies themselves) and diegesis (how involved in the story’s action is the narrator) with F1 of over 0.90 for both narrative characteristics. For the narrative POV, diegesis, and narrative level extractors I ran annotation studies, with high agreement, that allowed me to teach computational models to identify structural elements of narrative through supervised machine learning. My work has given computers the ability to find where stories begin and end in raw text. This allows for further, automatic analysis, like extraction of plot, intent, event causality, and event coreference. These tasks are impossible when the computer can’t distinguish between which stories are told in what spans of text. There are two key contributions in my work: 1) my identification of features that accurately extract elements of narrative structure and 2) the gold-standard data and reports generated from running annotation studies on identifying narrative structure

    Translation \u3cem\u3eal Mercato del Pesce\u3c/em\u3e: The Importance of Human Input for Machine Translation

    Get PDF
    This thesis investigates translation of Italian idioms and metaphors into English, and the difficulties encountered by Machine Translation in this process. I use a framework of foreign concepts to explain many of the difficulties, as well as interviews with native Italian and English speakers to provide further context for the cultural knowledge encoded in figurative language. I conclude that in Machine Translation a consistent human input interface as well as a continuous training in language corpora is crucial to improve the accuracy of translated metaphors and idioms, using Italian to English translation as a case study
    • …
    corecore