Search CORE

218 research outputs found

Deducing linguistic structure from the statistics of large corpora

Author: Beatrice Santorini
David Magerman
Eric Brill
Mitchell Marcus
Publication venue: Morgan
Publication date: 01/01/1990
Field of study

Within the last two years, approaches using both stochastic and symbolic techniques have proved adequate to deduce lexical ambiguity resolution rules with less than 3-4 % error rate, when trained on moderat

CiteSeerX

Crossref

Part-of-Speech Tagging Guidelines for the Penn Treebank Project (3rd Revision)

Author: Santorini Beatrice
Publication venue: ScholarlyCommons
Publication date: 01/07/1990
Field of study

This manual addresses the linguistic issues that arise in connection with annotating texts by part of speech ( tagging ). Section 2 is an alphabetical list of the parts of speech encoded in the annotation systems of the Penn Treebank Project, along with their corresponding abbreviations ( tags ) and some information concerning their definition. This section allows you to find an unfamiliar tag by looking up a familiar part of speech. Section 3 recapitulates the information in Section 2, but this time the information is alphabetically ordered by tags. This is the section to consult in order to find out what an unfamiliar tag means. Since the parts of speech are probably familiar to you from high school English, you should have little difficulty in assimilating the tags themselves. However, it is often quite difficult to decide which tag is appropriate in a particular context. The two sections 4 and 5 therefore include examples and guidelines on how to tag problematic cases. If you are uncertain about whether a given tag is correct or not, refer to these sections in order to ensure a consistently annotated text. Section 4 discusses parts of speech that are easily confused and gives guidelines on how to tag such cases, while Section 5 contains an alphabetical list of specific problematic words and collocations. Finally, Section 6 discusses some general tagging conventions. One general rule, however, is so important that we state it here. Many texts are not models of good prose, and some contain outright errors and slips of the pen. Do not be tempted to correct a tag to what it would be if the text were correct; rather, it is the incorrect word that should be tagged correctly

ScholarlyCommons@Penn

Enhancing Online Food Delivery Systems through Comprehensive Text Analytics and Strategic Data Integration

Author: Surabani Santorini
Publication venue: Jejaring Penelitian dan Pengabdian Masyarakat
Publication date: 10/01/2024
Field of study

Addressing challenges in the online food delivery system involves employing various data analytics techniques. Text Analytics, encompassing web analytics, social media analytics, stream analytics, and geospatial analytics, plays a pivotal role in managing and extracting valuable insights. The use of third-party systems by many companies to meet the demand for online food delivery presents issues related to control. Furthermore, information overload and poorly organized data contribute to observed problems. This research proposes effective data integration as a solution, facilitating strategic analytics for optimal system performance. Proper data sorting enables adaptive planning and priority shifts tailored to customer satisfaction. The framework of data integration is crucial in illustrating the comprehensive analysis of online food delivery systems. The report also delves into the challenges associated with implementing text analytics

International Journal of Information Technology and Computer Science Applications

Recommended from our members

Incremental Phrase Structure Generation and a Universal Theory of V2

Author: Rambow Owen
Santorini Beatrice
Publication venue: ScholarWorks@UMass Amherst
Publication date: 26/09/2020
Field of study

ScholarWorks@UMass Amherst

Remarks on Causatives and Passive

Author: Heycock Caroline
Santorini Beatrice
Publication venue: ScholarlyCommons
Publication date: 01/05/1988
Field of study

The investigation of causative constructions has been a topic of enduring interest among linguists, generative and non-generative alike. For one thing, the variability and sheer complexity of the relevant empirical domain, even within a group of closely related languages such as Romance, poses considerable and often daunting descriptive challenges. On the other hand, comparative work by linguists of various theoretical persuasions (Aissen 1974, Aissen 1979, Baker 1985, Comrie 1976, Marantz 1984, Zubizarreta 1982, Zubizarreta 1985, among many others) has shown that certain properties of causatives recur with striking regularity among unrelated and typologically otherwise diverse languages, in the absence of areal contact. This holds out the hope that the bewildering variety of data that we are faced with when we consider causative constructions can be understood with reference to a relatively small number of causative types. At first glance, the most salient distinction is that between syntactic and morphological causative formation. As is well known, in some languages the causative is expressed by means of syntactic complementation, as in the English example in (I), whereas in others it involves morphological affixation, as in the Japanese equivalent of (1) given in (2)

ScholarlyCommons@Penn

Data Analytics Application in Fashion Retail SMEs (A Case Study in Caracas Fashion Store)

Author: Rodriguez Bradlow
Surabani Santorini
Publication venue: Jejaring Penelitian dan Pengabdian Masyarakat
Publication date: 14/01/2023
Field of study

Data analytics plays a paramount role in maximizing productivity and profitability for businesses by deriving insights from pre-existing data to predict market trends and client habits to make better business decisions. In accordance with Industrial Revolution 4.0, most SMEs have begun to implement an e-commerce business model, thus, customer data is generated at an exponential rate, allowing SMEs to further develop their services for greater user satisfaction. However, the abundance of unsorted and ambiguous data leads to issues such as server overload and inefficient customer sales cycle tracking. This paper will explain the application of data analytics techniques and architectures to overcome these issues in a fashion retail SME, as well as the benefits and drawbacks of these solutions

International Journal of Information Technology and Computer Science Applications

Recommended from our members

Parsing Early Modern English for Linguistic Search

Author: Kulick Seth
Ryant Neville
Santorini Beatrice
Publication venue: ScholarWorks@UMass Amherst
Publication date: 01/02/2022
Field of study

This work addresses the question of whether the output of a state-of-the-art parser is accurate enough to support research in theoretical linguistics. In order to build reliable models of syntactic change, we aim to eventually parse the 1.5-billion-word Early English Books Online (EEBO) corpus. But since EEBO is not yet parsed, we begin by constructing and testing a parser on the 1.7-million-word Penn-Helsinki Parsed Corpus of Early Modern English (PPCEME). In order to obtain robust results, we define an 8-fold split on PPCEME. We then evaluate the parser with evalb and, more relevantly for us, with a task-specific metric - namely, its accuracy in parsing 6 sentence types necessary to track the rise of auxiliary do (as in They did not come vs. its historical precursor They came not). Retrieving the relevant sentences from the gold and test versions with CorpusSearch queries, we find that the parser\u27s accuracy promises to be sufficient for our purposes. A remaining concern is the variability of the output, which we plan to address with three pieces of future work sketched in the conclusion

ScholarWorks@UMass Amherst

Recommended from our members

Parsing Early English Books Online for Linguistic Search

Author: Kulick Seth
Ryant Neville
Santorini Beatrice
Publication venue: ScholarWorks@UMass Amherst
Publication date: 01/06/2023
Field of study

This work addresses the question of how to evaluate a state-of-the-art parser on Early English Books Online (EEBO), a 1.5-billion-word collection of unannotated text, for utility in linguistic research. Earlier work has trained and evaluated a parser on the 1.7-million-word Penn-Helsinki Parsed Corpus of Early Modern English (PPCEME) and defined a query-based evaluation to score the retrieval of 6 specific sentence types of interest. However, significant differences between EEBO and the manually-annotated PPCEME make it inappropriate to assume that these results will generalize to EEBO. Fortunately, an overlap of source material in PPCEME and EEBO allows us to establish a token alignment between them and to score the POS-tagging on EEBO. We use this alignment together with a more principled version of the query-based evaluation to score the recovery of sentence types on this subset of EEBO, thus allowing us to estimate the increase in error rate on EEBO compared to PPCEME. The increase is largely due to differences in sentence segmentation between the two corpora, pointing the way to further improvements

ScholarWorks@UMass Amherst

First Steps Towards an Annotated Database of American English

Author: Magerman David
Marcus Mitchell P.
Santorini Beatrice
Publication venue: ScholarlyCommons
Publication date: 01/07/1990
Field of study

This paper reports on one of the first steps in building a very large annotated database of American English. We present and discuss the results of an experiment comparing manual part-of-speech tagging with manual verification and correction of automatic stochastic tagging. The experiment shows that correcting is superior to tagging with respect to speed, consistency and accuracy

ScholarlyCommons@Penn