1,492 research outputs found

    Global Entropy Based Greedy Algorithm for discretization

    Get PDF
    Discretization algorithm is a crucial step to not only achieve summarization of continuous attributes but also better performance in classification that requires discrete values as input. In this thesis, I propose a supervised discretization method, Global Entropy Based Greedy algorithm, which is based on the Information Entropy Minimization. Experimental results show that the proposed method outperforms state of the art methods with well-known benchmarking datasets. To further improve the proposed method, a new approach for stop criterion that is based on the change rate of entropy was also explored. From the experimental analysis, it is noticed that the threshold based on the decreasing rate of entropy could be more effective than a constant number of intervals in the classification such as C5.0

    A distributional and syntactic approach to fine-grained opinion mining

    Get PDF
    This thesis contributes to a larger social science research program of analyzing the diffusion of IT innovations. We show how to automatically discriminate portions of text dealing with opinions about innovations by finding {source, target, opinion} triples in text. In this context, we can discern a list of innovations as targets from the domain itself. We can then use this list as an anchor for finding the other two members of the triple at a ``fine-grained'' level---paragraph contexts or less. We first demonstrate a vector space model for finding opinionated contexts in which the innovation targets are mentioned. We can find paragraph-level contexts by searching for an ``expresses-an-opinion-about'' relation between sources and targets using a supervised model with an SVM that uses features derived from a general-purpose subjectivity lexicon and a corpus indexing tool. We show that our algorithm correctly filters the domain relevant subset of subjectivity terms so that they are more highly valued. We then turn to identifying the opinion. Typically, opinions in opinion mining are taken to be positive or negative. We discuss a crowd sourcing technique developed to create the seed data describing human perception of opinion bearing language needed for our supervised learning algorithm. Our user interface successfully limited the meta-subjectivity inherent in the task (``What is an opinion?'') while reliably retrieving relevant opinionated words using labour not expert in the domain. Finally, we developed a new data structure and modeling technique for connecting targets with the correct within-sentence opinionated language. Syntactic relatedness tries (SRTs) contain all paths from a dependency graph of a sentence that connect a target expression to a candidate opinionated word. We use factor graphs to model how far a path through the SRT must be followed in order to connect the right targets to the right words. It turns out that we can correctly label significant portions of these tries with very rudimentary features such as part-of-speech tags and dependency labels with minimal processing. This technique uses the data from the crowdsourcing technique we developed as training data. We conclude by placing our work in the context of a larger sentiment classification pipeline and by describing a model for learning from the data structures produced by our work. This work contributes to computational linguistics by proposing and verifying new data gathering techniques and applying recent developments in machine learning to inference over grammatical structures for highly subjective purposes. It applies a suffix tree-based data structure to model opinion in a specific domain by imposing a restriction on the order in which the data is stored in the structure

    Appearance of Corporate Innovation in Financial Reports : A Text-Based Analysis

    Get PDF
    Innovations are important drivers of economic growth and firm profitability. Firms need funding to generate profitable innovations, which is why it is important to reliably distinguish innovative firms. Innovation indicators are used to measure this innovativeness, and consequently, it is important that the used indicator is reliable and measures innovation as desired. Patents, research and development expenditure and innovation surveys are examples of popular innovation indicators in research literature. However, these indicators have weaknesses, which is why new innovation indicators have been developed. This thesis studies the text-based innovation indicator developed by Bellstam et al. (2019) with a new type of data. Bellstam et al. (2019) created a new text-based innovation indicator that compares corporations’ analyst reports with an innovation textbook as the basis for the indicator. The similarity between these texts created the measurement for innovativeness. Analyst reports are usu-ally subject to charge. However, the 10-K reports used as data for this study are publicly available, and their functionality as the basis of the innovation indicator would mean good availability for the indicator. The study begins by training a Latent Dirichlet allocation (LDA) model with a sample of 10-K documents from 2008-2018. LDA-model is an unsupervised machine learning method, it finds topics in the text documents based on the probabilities of different words. The LDA-model was trained to find 15 topic allocations in the data and the output of the model is the distribution of these topics for each document. The same topic distributions were also allocated for eight samples from innovation textbooks. When the topic distributions were allocated, a Kullback-Leibler-divergence (KL-divergence) was calculated between each text sample and 10-K document. Thus, the KL-divergence calculated is the lowest for those reports that are the most similar to the innovation text and works as the text-based innovation indicator. Finally, the text-based innovation indicator was validated with regression analysis, in other words, it was confirmed that the indicator measures innovation. The text-based indicator was compared with research and development costs and the balance sheet value of brands and patents in different linear regressions. Out of the eight innovation measurements, most had a statistically significant correlation with one or both of the other innovation indicators. The ability of the text-based indicator to predict the development of sales in the next year was studied with regression analysis as well and all of the measurements had a significant effect on this. The most significant findings of this thesis are the relationship of the text-based innovation indicator and other indicators and its ability to predict firms’ sales.Innovaatiot ovat tärkeitä talouskasvun ja yritysten kannattavuuden ajureita. Tuottavien innovaatioiden syntymiseksi yritykset tarvitsevat rahoitusta, minkä takia onkin tärkeää, että innovatiiviset yritykset pystytään tunnistamaan luotettavasti. Innovaatioindikaattoreita käytetään tähän innovatiivisuuden mittaamiseen ja on siksi tärkeää, että käytetty indikaattori on luotettava ja mittaa innovatiivisuutta oikealla tavalla. Kirjallisuudessa paljon käytettyjä innovaatioindikaattoreita ovat esimerkiksi patentit, tutkimus- ja kehitysmenot sekä innovaatiokyselyt. Näissä indikaattoreissa on kuitenkin myös heikkouksia, joiden takia uusia indikaattoreita on alettu kehittää. Tässä tutkielmassa tutkitaan Bellstamin ja muiden (2019) luomaa tekstipohjaista innovaatioindikaattoria erilaisella datalla. Bellstam ja muut (2019) loivat uuden innovaatioindikaattorin, jonka pohjana oli yritysten ana-lyytikkoraporttien vertailu innovaatio-oppikirjan tekstin kanssa, näiden samankaltaisuusver-tailusta saatiin innovaatiomittari. Analyytikkoraportit ovat usein maksullisia. Tässä tutkimuk-sessa aineistona on käytetty lakisääteisiä tilinpäätösraportteja, jotka ovat julkisia tiedostoja, joten niiden toimivuus innovaatioindikaattorin pohjana tarkoittaisi hyvää saatavuutta indi-kaattorille. Tutkimus alkaa Latent Dirichlet allocation (LDA) –mallin harjoittamisella Yhdysvaltalaisten yritysten 10-K, eli tilinpäätösraporteilla vuosilta 2008-2018. LDA-malli on valvomaton koneoppimismenetelmä, eli se etsii datasta itse aihepiirejä sanojen todennäköisyyksien perusteella. LDA-malli asetettiin etsimään datasta 15 eri aihepiiriä raporteissa käytettyjen aiheiden perusteella ja mallin tuloksena on näiden aihepiirien jakautuminen jokaisessa dokumentissa. Samat aihepiirijakaumat haettiin myös kahdeksalle tekstiotokselle innovaatio-oppikirjoista. Aihepiirijakaumien ollessa valmiit, laskettiin Kullback-Leibler-divergenssi (KL-divergenssi) tilinpäätösraporttien ja innovaatio-oppikirjojen tekstiotosten aihepiirijakaumien välille. Laskettu KL-divergenssi on siten matalin niille tilinpäätösraporteille, joiden teksti on lähimpänä kunkin innovaatio-oppikirjan tekstiä ja toimii tekstipohjaisena innovaatioindikaattorina. Lopuksi indikaattorin toimivuus vahvistetaan regressioanalyysillä, eli tutkitaan, että se mittaa innovatiivisuuta. Regressioanalyysillä tutkitaan innovaatiomittarien yhteyttä yritysten tutkimus- ja kehitystoiminnan kuluihin sekä patenttien ja brändien tasearvoon. Kahdeksasta innovaatiomittarista suurimmalla osalla oli tilastollisesti merkitsevä yhteys muuttujista toiseen tai molempiin. Myös uuden innovaatiomittarin kykyä ennustaa yritysten seuraavan vuoden myyntiä tutkittiin regressioanalyysillä ja jokaisella mittarilla oli tilastollisesti merkitsevä yhteys yritysten liikevaihdon muutokseen. Tutkimuksen merkittävin löydös oli tekstipohjaisen innovaatiomittarin yhteys muihin innovaatiomittareihin ja yritysten liikevaihdon kehitykseen

    Optimal institutional Mechanisms for Funding Generic Advertising: An Experimental Analysis

    Full text link
    NICPRE 04-05; R.B. 2004-12Given the uncertain legal status of generic advertising programs for agricultural commodities, alternative voluntary funding institutions are investigated hat could provide a high level of benefits to producers. This experimental study simulates key economic and psychological factors that affect producer contributions to generic advertising. The results suggests that producer referendum play a critical role in increasing contributions and that producer surplus is maximized by a Provision Point Mechanism instituted by producer referendum with thresholds ranging from 68% to 90%, and expected funding from 47% to 77% of the time, depending on the level of advertising effectiveness

    Novel statistical approaches to text classification, machine translation and computer-assisted translation

    Full text link
    Esta tesis presenta diversas contribuciones en los campos de la clasificación automática de texto, traducción automática y traducción asistida por ordenador bajo el marco estadístico. En clasificación automática de texto, se propone una nueva aplicación llamada clasificación de texto bilingüe junto con una serie de modelos orientados a capturar dicha información bilingüe. Con tal fin se presentan dos aproximaciones a esta aplicación; la primera de ellas se basa en una asunción naive que contempla la independencia entre las dos lenguas involucradas, mientras que la segunda, más sofisticada, considera la existencia de una correlación entre palabras en diferentes lenguas. La primera aproximación dió lugar al desarrollo de cinco modelos basados en modelos de unigrama y modelos de n-gramas suavizados. Estos modelos fueron evaluados en tres tareas de complejidad creciente, siendo la más compleja de estas tareas analizada desde el punto de vista de un sistema de ayuda a la indexación de documentos. La segunda aproximación se caracteriza por modelos de traducción capaces de capturar correlación entre palabras en diferentes lenguas. En nuestro caso, el modelo de traducción elegido fue el modelo M1 junto con un modelo de unigramas. Este modelo fue evaluado en dos de las tareas más simples superando la aproximación naive, que asume la independencia entre palabras en differentes lenguas procedentes de textos bilingües. En traducción automática, los modelos estadísticos de traducción basados en palabras M1, M2 y HMM son extendidos bajo el marco de la modelización mediante mixturas, con el objetivo de definir modelos de traducción dependientes del contexto. Asimismo se extiende un algoritmo iterativo de búsqueda basado en programación dinámica, originalmente diseñado para el modelo M2, para el caso de mixturas de modelos M2. Este algoritmo de búsqueda nCivera Saiz, J. (2008). Novel statistical approaches to text classification, machine translation and computer-assisted translation [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/2502Palanci

    AICPA Technical Practice Aids, as of June 1, 1995

    Get PDF
    https://egrove.olemiss.edu/aicpa_guides/2332/thumbnail.jp

    Utilizing Consumer Health Posts for Pharmacovigilance: Identifying Underlying Factors Associated with Patients’ Attitudes Towards Antidepressants

    Get PDF
    Non-adherence to antidepressants is a major obstacle to antidepressants therapeutic benefits, resulting in increased risk of relapse, emergency visits, and significant burden on individuals and the healthcare system. Several studies showed that non-adherence is weakly associated with personal and clinical variables, but strongly associated with patients’ beliefs and attitudes towards medications. The traditional methods for identifying the key dimensions of patients’ attitudes towards antidepressants are associated with some methodological limitations, such as concern about confidentiality of personal information. In this study, attempts have been made to address the limitations by utilizing patients’ self report experiences in online healthcare forums to identify underlying factors affecting patients attitudes towards antidepressants. The data source of the study was a healthcare forum called “askapatients.com”. 892 patients’ reviews were randomly collected from the forum for the four most commonly prescribed antidepressants including Sertraline (Zoloft) and Escitalopram (Lexapro) from SSRI class, and Venlafaxine (Effexor) and duloxetine (Cymbalta) from SNRI class. Methodology of this study is composed of two main phases: I) generating structured data from unstructured patients’ drug reviews and testing hypotheses concerning attitude, II) identification and normalization of Adverse Drug Reactions (ADRs), Withdrawal Symptoms (WDs) and Drug Indications (DIs) from the posts, and mapping them to both The UMLS and SNOMED CT concepts. Phase II also includes testing the association between ADRs and attitude. The result of the first phase of this study showed that “experience of adverse drug reactions”, “perceived distress received from ADRs”, “lack of knowledge about medication’s mechanism”, “withdrawal experience”, “duration of usage”, and “drug effectiveness” are strongly associated with patients attitudes. However, demographic variables including “age” and “gender” are not associated with attitude. Analysis of the data in second phase of the study showed that from 6,534 identified entities, 73% are ADRs, 12% are WDs, and 15 % are drug indications. In addition, psychological and cognitive expressions have higher variability than physiological expressions. All three types of entities were mapped to 811 UMLS and SNOMED CT concepts. Testing the association between ADRs and attitude showed that from twenty-one physiological ADRs specified in the ASEC questionnaire, “dry mouth”, “increased appetite”, “disorientation”, “yawning”, “weight gain”, and “problem with sexual dysfunction” are associated with attitude. A set of psychological and cognitive ADRs, such as “emotional indifference” and “memory problem were also tested that showed significance association between these types of ADRs and attitude. The findings of this study have important implications for designing clinical interventions aiming to improve patients\u27 adherence towards antidepressants. In addition, the dataset generated in this study has significant implications for improving performance of text-mining algorithms aiming to identify health related information from consumer health posts. Moreover, the dataset can be used for generating and testing hypotheses related to ADRs associated with psychiatric mediations, and identifying factors associated with discontinuation of antidepressants. The dataset and guidelines of this study are available at https://sites.google.com/view/pharmacovigilanceinpsychiatry/hom

    AICPA Technical Practice Aids, as of June 1, 1997

    Get PDF
    https://egrove.olemiss.edu/aicpa_guides/2334/thumbnail.jp

    AICPA Technical Practice Aids, as of June 1, 1996

    Get PDF
    https://egrove.olemiss.edu/aicpa_guides/2333/thumbnail.jp
    corecore