487 research outputs found

    Estimation of semiparametric stochastic frontiers under shape constraints with application to pollution generating technologies

    Get PDF
    A number of studies have explored the semi- and nonparametric estimation of stochastic frontier models by using kernel regression or other nonparametric smoothing techniques. In contrast to popular deterministic nonparametric estimators, these approaches do not allow one to impose any shape constraints (or regularity conditions) on the frontier function. On the other hand, as many of the previous techniques are based on the nonparametric estimation of the frontier function, the convergence rate of frontier estimators can be sensitive to the number of inputs, which is generally known as “the curse of dimensionality” problem. This paper proposes a new semiparametric approach for stochastic frontier estimation that avoids the curse of dimensionality and allows one to impose shape constraints on the frontier function. Our approach is based on the singleindex model and applies both single-index estimation techniques and shape-constrained nonparametric least squares. In addition to production frontier and technical efficiency estimation, we show how the technique can be used to estimate pollution generating technologies. The new approach is illustrated by an empirical application to the environmental adjusted performance evaluation of U.S. coal-fired electric power plants.stochastic frontier analysis (SFA), nonparametric least squares, single-index model, sliced inverse regression, monotone rank correlation estimator, environmental efficiency

    Sentiment analysis on Twitter

    Get PDF
    In recent years more and more people have been connecting with Social Networks. One of the most used is Twitter. This huge amount of information is attracting the interest of companies. One reason is that this huge source of information can be used to detect public opinion about their brands and thus improve their business values. In order to transform the information present in the Social Networks into knowledge several steps are required. This project aim to describe them and provide tools that are able to perform this task. The first problem is how to retrieve the data. Several ways are available, each one with its own pros and cons. After that it is necessary to study and define proper queries in order to retrieve the information needed. Once the data is retrieved you may need to filter and explore your data. For this task a Topic Model Algorithm ( LDA ) has been studied and analyzed. LDA has shown positive results when it is tuned in the proper way and it is combined with appropriate visualization techniques. The difference between a Topic Model Algorithm and other Clustering/Segmentation techniques is that Topic Models allows each ”document” ( instance ) to belong to more than one topic ( cluster ). LDA doesn’t natively work well on Twitter due to the very short length of the tweets. An investigation in the literature has revealed a solution to this problem. Another problem that is common in clustering is how to validate the Algorithm and how to choose the proper number of topics ( clusters), for this problem several metrics in the literature have been explored. Afterwards, Sentiment Analysis techniques can be applied in order to measure the opinion of the users . The literature presents several approaches and ways to solving this problem. This work is focused in solving the Polarity Detection task, with three classes , so, classify if a tweet express a positive , a negative or a neutral sentiment. Here reach accurate results can be challenging, due to the messy nature of the twitter posts. Several approaches have been tested and compared. The baseline method tested is the use of sentiment dictionaries, after that , since the real sentiment of the twitter posts is not available, a sample has been manually labeled and several Supervised approaches combined with various Feature Selection/Transformation techniques have been tested. Finally, a totally new experimental approach, inspired from the Soft Labeling technique present in the literature, has been defined and tested. This method try to avoid the costly task to manually label a sample in order to validate a model. In the literature this problem is solved for the two-class problem, so by considering only positive and negative tweets. This work try to extend the soft-labeling approach to the three class problem

    Novel Computational Methods for Censored Data and Regression

    Get PDF
    This dissertation can be divided into three topics. In the first topic, we derived a recursive algorithm for the constrained Kaplan-Meier estimator, which promotes the computation speed up to fifty times compared to the current method that uses EM algorithm. We also showed how this leads to the vast improvement of empirical likelihood analysis with right censored data. After a brief review of regularized regressions, we investigated the computational problems in the parametric/non-parametric hybrid accelerated failure time models and its regularization in a high dimensional setting. We also illustrated that, when the number of pieces increases, the discussed models are close to a nonparametric one. In the last topic, we discussed a semi-parametric approach of hypothesis testing problem in the binary choice model. The major tools used are Buckley-James like algorithm and empirical likelihood. The essential idea, which is similar to the first topic, is iteratively computing linear constrained empirical likelihood using optimization algorithms including EM, and iterative convex minorant algorithm

    Multidimensional Scaling Using Majorization: SMACOF in R

    Get PDF
    In this paper we present the methodology of multidimensional scaling problems (MDS) solved by means of the majorization algorithm. The objective function to be minimized is known as stress and functions which majorize stress are elaborated. This strategy to solve MDS problems is called SMACOF and it is implemented in an R package of the same name which is presented in this article. We extend the basic SMACOF theory in terms of configuration constraints, three-way data, unfolding models, and projection of the resulting configurations onto spheres and other quadratic surfaces. Various examples are presented to show the possibilities of the SMACOF approach offered by the corresponding package.

    Investigating the attainment of optimum data quality for EHR Big Data: proposing a new methodological approach

    Get PDF
    The value derivable from the use of data is continuously increasing since some years. Both commercial and non-commercial organisations have realised the immense benefits that might be derived if all data at their disposal could be analysed and form the basis of decision taking. The technological tools required to produce, capture, store, transmit and analyse huge amounts of data form the background to the development of the phenomenon of Big Data. With Big Data, the aim is to be able to generate value from huge amounts of data, often in non-structured format and produced extremely frequently. However, the potential value derivable depends on general level of governance of data, more precisely on the quality of the data. The field of data quality is well researched for traditional data uses but is still in its infancy for the Big Data context. This dissertation focused on investigating effective methods to enhance data quality for Big Data. The principal deliverable of this research is in the form of a methodological approach which can be used to optimize the level of data quality in the Big Data context. Since data quality is contextual, (that is a non-generalizable field), this research study focuses on applying the methodological approach in one use case, in terms of the Electronic Health Records (EHR). The first main contribution to knowledge of this study systematically investigates which data quality dimensions (DQDs) are most important for EHR Big Data. The two most important dimensions ascertained by the research methods applied in this study are accuracy and completeness. These are two well-known dimensions, and this study confirms that they are also very important for EHR Big Data. The second important contribution to knowledge is an investigation into whether Artificial Intelligence with a special focus upon machine learning could be used in improving the detection of dirty data, focusing on the two data quality dimensions of accuracy and completeness. Regression and clustering algorithms proved to be more adequate for accuracy and completeness related issues respectively, based on the experiments carried out. However, the limits of implementing and using machine learning algorithms for detecting data quality issues for Big Data were also revealed and discussed in this research study. It can safely be deduced from the knowledge derived from this part of the research study that use of machine learning for enhancing data quality issues detection is a promising area but not yet a panacea which automates this entire process. The third important contribution is a proposed guideline to undertake data repairs most efficiently for Big Data; this involved surveying and comparing existing data cleansing algorithms against a prototype developed for data reparation. Weaknesses of existing algorithms are highlighted and are considered as areas of practice which efficient data reparation algorithms must focus upon. Those three important contributions form the nucleus for a new data quality methodological approach which could be used to optimize Big Data quality, as applied in the context of EHR. Some of the activities and techniques discussed through the proposed methodological approach can be transposed to other industries and use cases to a large extent. The proposed data quality methodological approach can be used by practitioners of Big Data Quality who follow a data-driven strategy. As opposed to existing Big Data quality frameworks, the proposed data quality methodological approach has the advantage of being more precise and specific. It gives clear and proven methods to undertake the main identified stages of a Big Data quality lifecycle and therefore can be applied by practitioners in the area. This research study provides some promising results and deliverables. It also paves the way for further research in the area. Technical and technological changes in Big Data is rapidly evolving and future research should be focusing on new representations of Big Data, the real-time streaming aspect, and replicating same research methods used in this current research study but on new technologies to validate current results

    Operational research IO 2021—analytics for a better world. XXI Congress of APDIO, Figueira da Foz, Portugal, November 7–8, 2021

    Get PDF
    This book provides the current status of research on the application of OR methods to solve emerging and relevant operations management problems. Each chapter is a selected contribution of the IO2021 - XXI Congress of APDIO, the Portuguese Association of Operational Research, held in Figueira da Foz from 7 to 8 November 2021. Under the theme of analytics for a better world, the book presents interesting results and applications of OR cutting-edge methods and techniques to various real-world problems. Of particular importance are works applying nonlinear, multi-objective optimization, hybrid heuristics, multicriteria decision analysis, data envelopment analysis, simulation, clustering techniques and decision support systems, in different areas such as supply chain management, production planning and scheduling, logistics, energy, telecommunications, finance and health. All chapters were carefully reviewed by the members of the scientific program committee.info:eu-repo/semantics/publishedVersio

    Monotone Models for Prediction in Data Mining.

    Get PDF
    This dissertation studies the incorporation of monotonicity constraints as a type of domain knowledge into a data mining process. Monotonicity constraints are enforced at two stagesÂżdata preparation and data modeling. The main contributions of the research are a novel procedure to test the degree of monotonicity of a real data set, a greedy algorithm to transform non-monotone into monotone data, and extended and novel approaches for building monotone decision models. The results from simulation and real case studies show that enforcing monotonicity can considerably improve knowledge discovery and facilitate the decision-making process for end-users by deriving more accurate, stable and plausible decision models.
    • …
    corecore