144 research outputs found

    Template-Based Metadata Extraction for Heterogeneous Collection

    Get PDF
    With the growth of the Internet and related tools, there has been a rapid growth of online resources. In particular, by using high-quality OCR (Optical Character Recognition) tools it has become easy to convert an existing corpus into digital form and make it available online. However, a number of organizations have legacy collections that lack metadata. The lack of metadata hampers not only the discovery and dispersion of these collections over the Web, but also their interoperability with other collections. Unfortunately, manual metadata creation is expensive and time-consuming for a large collection, and most existing automated metadata extraction approaches have focused on specific domains and homogeneous collections. Developing an approach to extract metadata automatically from a large number of challenges. In particular, the heterogeneous legacy collection poses a following issues need to be addressed: (1) Heterogeneity, i.e. how to achieve a high accuracy for a heterogeneous collection; (2) Scaling, i.e. how to apply an automated metadata extraction approach to a very large collection; (3) Evolution, i.e. how to process new documents added to a collection over time; (4) Adaptability, i.e. how to apply an approach to a new document collection; (5) Complexity, i.e. how many document features can be handled, and how complex the features should be. In this dissertation, we propose a template-based metadata extraction approach to address these issues. The key idea of addressing the heterogeneity is to classify documents into equivalent groups so that each document group contains similar documents only. Next, for each document group we create a template that contains a set of rules to instruct a template engine how to extract metadata from documents in the group. Templates are written in an XML-based language and kept in separate files. Our approach of decoupling rules from programming codes and representing them in a XML format is easy to adapt to another collection with documents in different styles. We developed our test bed by downloading about 10,000 documents from DTIC (Defense Technical Information Center) document collection that consists of scanned versions of documents in PDF (Portable Document Format) format. We have evaluated our approach on the test bed consisting of documents from DTIC collection, and our results are encouraging. We have also demonstrated how the extracted metadata can be utilized to integrate our test bed with an interoperable digital library framework based on OAI (Open Archives Initiative)

    RoBoSS: A Robust, Bounded, Sparse, and Smooth Loss Function for Supervised Learning

    Full text link
    In the domain of machine learning algorithms, the significance of the loss function is paramount, especially in supervised learning tasks. It serves as a fundamental pillar that profoundly influences the behavior and efficacy of supervised learning algorithms. Traditional loss functions, while widely used, often struggle to handle noisy and high-dimensional data, impede model interpretability, and lead to slow convergence during training. In this paper, we address the aforementioned constraints by proposing a novel robust, bounded, sparse, and smooth (RoBoSS) loss function for supervised learning. Further, we incorporate the RoBoSS loss function within the framework of support vector machine (SVM) and introduce a new robust algorithm named Lrbss\mathcal{L}_{rbss}-SVM. For the theoretical analysis, the classification-calibrated property and generalization ability are also presented. These investigations are crucial for gaining deeper insights into the performance of the RoBoSS loss function in the classification tasks and its potential to generalize well to unseen data. To empirically demonstrate the effectiveness of the proposed Lrbss\mathcal{L}_{rbss}-SVM, we evaluate it on 8888 real-world UCI and KEEL datasets from diverse domains. Additionally, to exemplify the effectiveness of the proposed Lrbss\mathcal{L}_{rbss}-SVM within the biomedical realm, we evaluated it on two medical datasets: the electroencephalogram (EEG) signal dataset and the breast cancer (BreaKHis) dataset. The numerical results substantiate the superiority of the proposed Lrbss\mathcal{L}_{rbss}-SVM model, both in terms of its remarkable generalization performance and its efficiency in training time

    Gaussian Processes for Text Regression

    Get PDF
    Text Regression is the task of modelling and predicting numerical indicators or response variables from textual data. It arises in a range of different problems, from sentiment and emotion analysis to text-based forecasting. Most models in the literature apply simple text representations such as bag-of-words and predict response variables in the form of point estimates. These simplifying assumptions ignore important information coming from the data such as the underlying uncertainty present in the outputs and the linguistic structure in the textual inputs. The former is particularly important when the response variables come from human annotations while the latter can capture linguistic phenomena that go beyond simple lexical properties of a text. In this thesis our aim is to advance the state-of-the-art in Text Regression by improving these two aspects, better uncertainty modelling in the response variables and improved text representations. Our main workhorse to achieve these goals is Gaussian Processes (GPs), a Bayesian kernelised probabilistic framework. GP-based regression models the response variables as well-calibrated probability distributions, providing additional information in predictions which in turn can improve subsequent decision making. They also model the data using kernels, enabling richer representations based on similarity measures between texts. To be able to reach our main goals we propose new kernels for text which aim at capturing richer linguistic information. These kernels are then parameterised and learned from the data using efficient model selection procedures that are enabled by the GP framework. Finally we also capitalise on recent advances in the GP literature to better capture uncertainty in the response variables, such as multi-task learning and models that can incorporate non-Gaussian variables through the use of warping functions. Our proposed architectures are benchmarked in two Text Regression applications: Emotion Analysis and Machine Translation Quality Estimation. Overall we are able to obtain better results compared to baselines while also providing uncertainty estimates for predictions in the form of posterior distributions. Furthermore we show how these models can be probed to obtain insights about the relation between the data and the response variables and also how to apply predictive distributions in subsequent decision making procedures

    Gaussian Processes for Text Regression

    Get PDF
    Text Regression is the task of modelling and predicting numerical indicators or response variables from textual data. It arises in a range of different problems, from sentiment and emotion analysis to text-based forecasting. Most models in the literature apply simple text representations such as bag-of-words and predict response variables in the form of point estimates. These simplifying assumptions ignore important information coming from the data such as the underlying uncertainty present in the outputs and the linguistic structure in the textual inputs. The former is particularly important when the response variables come from human annotations while the latter can capture linguistic phenomena that go beyond simple lexical properties of a text. In this thesis our aim is to advance the state-of-the-art in Text Regression by improving these two aspects, better uncertainty modelling in the response variables and improved text representations. Our main workhorse to achieve these goals is Gaussian Processes (GPs), a Bayesian kernelised probabilistic framework. GP-based regression models the response variables as well-calibrated probability distributions, providing additional information in predictions which in turn can improve subsequent decision making. They also model the data using kernels, enabling richer representations based on similarity measures between texts. To be able to reach our main goals we propose new kernels for text which aim at capturing richer linguistic information. These kernels are then parameterised and learned from the data using efficient model selection procedures that are enabled by the GP framework. Finally we also capitalise on recent advances in the GP literature to better capture uncertainty in the response variables, such as multi-task learning and models that can incorporate non-Gaussian variables through the use of warping functions. Our proposed architectures are benchmarked in two Text Regression applications: Emotion Analysis and Machine Translation Quality Estimation. Overall we are able to obtain better results compared to baselines while also providing uncertainty estimates for predictions in the form of posterior distributions. Furthermore we show how these models can be probed to obtain insights about the relation between the data and the response variables and also how to apply predictive distributions in subsequent decision making procedures

    An efficiency curve for evaluating imbalanced classifiers considering intrinsic data characteristics: Experimental analysis

    Get PDF
    Balancing the accuracy rates of the majority and minority classes is challenging in imbalanced classification. Furthermore, data characteristics have a significant impact on the performance of imbalanced classifiers, which are generally neglected by existing evaluation methods. The objective of this study is to introduce a new criterion to comprehensively evaluate imbalanced classifiers. Specifically, we introduce an efficiency curve that is established using data envelopment analysis without explicit inputs (DEA-WEI), to determine the trade-off between the benefits of improved minority class accuracy and the cost of reduced majority class accuracy. In sequence, we analyze the impact of the imbalanced ratio and typical imbalanced data characteristics on the efficiency of the classifiers. Empirical analyses using 68 imbalanced data reveal that traditional classifiers such as C4.5 and the k-nearest neighbor are more effective on disjunct data, whereas ensemble and undersampling techniques are more effective for overlapping and noisy data. The efficiency of cost-sensitive classifiers decreases dramatically when the imbalanced ratio increases. Finally, we investigate the reasons for the different efficiencies of classifiers on imbalanced data and recommend steps to select appropriate classifiers for imbalanced data based on data characteristics.National Natural Science Foundation of China (NSFC) 71874023 71725001 71771037 7197104

    A comparative study of edge detection techniques

    Get PDF
    The problem of detecting edges in gray level digital images is considered. A literature survey of the existing methods is presented. Based on the survey, two methods that are well accepted by a majority of investigators are identified. The methods selected are: 1) Laplacian of Gaussian (LoG) operator, and 2) An optimal detector based on maxima in gradient magnitude of a Gaussian-smoothed image. The latter has been proposed by Canny[], and will be referred as Canny\u27s method. The purpose of the thesis is to compare the performance of these popular methods. In order to increase the scope of such comparison, two additional methods are considered. First is one of the simplest methods, based on the first order approximation of the first derivative of the image. This method has the advantage of relatively low amount of computations. Second is an attempt to develop an edge fitting method based on eigenvector least-squared error fitting of an intensity profile. This method is developed with an intent to keep the edge localization errors small. All the four methods are coded and applied on several digital images, actual as well as synthesized. Results show that the LoG method and Canny\u27s method perform quite well in general, and that demonstrates popularity of these methods. On the other hand, even the simplest method of first derivative is found to perform well if applied properly. Based on the results of the comparative study several critical issues related to edge detection are pointed out. Results also indicate feasibility of the proposed method based on eigenvector fit. Improvements and recommendation for further work are made

    Parametric, Nonparametric, and Semiparametric Linear Regression in Classical and Bayesian Statistical Quality Control

    Get PDF
    Statistical process control (SPC) is used in many fields to understand and monitor desired processes, such as manufacturing, public health, and network traffic. SPC is categorized into two phases; in Phase I historical data is used to inform parameter estimates for a statistical model and Phase II implements this statistical model to monitor a live ongoing process. Within both phases, profile monitoring is a method to understand the functional relationship between response and explanatory variables by estimating and tracking its parameters. In profile monitoring, control charts are often used as graphical tools to visually observe process behaviors. We construct a practitioner’s guide to provide a stepby- step application for parametric, nonparametric, and semiparametric methods in profile monitoring, creating an in-depth guideline for novice practitioners. We then consider the commonly used cumulative sum (CUSUM), multivariate CUSUM (mCUSUM), exponentially weighted moving average (EWMA), multivariate EWMA (mEWMA) charts under a Bayesian framework for monitoring respiratory disease related hospitalizations and global suicide rates with parametric, nonparametric, and semiparametric linear models

    2023 SDSU Data Science Symposium Presentation Abstracts

    Get PDF
    This document contains abstracts for presentations and posters 2023 SDSU Data Science Symposium

    2023 SDSU Data Science Symposium Presentation Abstracts

    Get PDF
    This document contains abstracts for presentations and posters 2023 SDSU Data Science Symposium
    • …
    corecore