1,492 research outputs found
Global Entropy Based Greedy Algorithm for discretization
Discretization algorithm is a crucial step to not only achieve summarization of continuous attributes but also better performance in classification that requires discrete values as input. In this thesis, I propose a supervised discretization method, Global Entropy Based Greedy algorithm, which is based on the Information Entropy Minimization. Experimental results show that the proposed method outperforms state of the art methods with well-known benchmarking datasets. To further improve the proposed method, a new approach for stop criterion that is based on the change rate of entropy was also explored. From the experimental analysis, it is noticed that the threshold based on the decreasing rate of entropy could be more effective than a constant number of intervals in the classification such as C5.0
A distributional and syntactic approach to fine-grained opinion mining
This thesis contributes to a larger social science research program of
analyzing the diffusion of IT innovations. We show how to
automatically discriminate portions of text dealing with opinions
about innovations by finding {source, target, opinion} triples in text.
In this context, we can discern a list of innovations as targets from
the domain itself. We can then use this list as an anchor for finding
the other two members of the triple at a ``fine-grained''
level---paragraph contexts or less.
We first demonstrate a vector space model for finding opinionated
contexts in which the innovation targets are mentioned. We can find
paragraph-level contexts by searching for an
``expresses-an-opinion-about'' relation between sources and targets
using a supervised model with an SVM that uses features derived from a
general-purpose subjectivity lexicon and a corpus indexing tool. We
show that our algorithm correctly filters the domain relevant subset
of subjectivity terms so that they are more highly valued.
We then turn to identifying the opinion. Typically, opinions in
opinion mining are taken to be positive or negative. We discuss a
crowd sourcing technique developed to create the seed data describing
human perception of opinion bearing language needed for our supervised
learning algorithm. Our user interface successfully limited the
meta-subjectivity inherent in the task (``What is an opinion?'') while
reliably retrieving relevant opinionated words using labour not expert
in the domain.
Finally, we developed a new data structure and modeling technique for
connecting targets with the correct within-sentence opinionated
language. Syntactic relatedness tries (SRTs) contain all paths from a
dependency graph of a sentence that connect a target expression to a
candidate opinionated word. We use factor graphs to model how far a
path through the SRT must be followed in order to connect the right
targets to the right words. It turns out that we can correctly label
significant portions of these tries with very rudimentary features
such as part-of-speech tags and dependency labels with minimal
processing. This technique uses the data from the crowdsourcing
technique we developed as training data.
We conclude by placing our work in the context of a larger sentiment
classification pipeline and by describing a model for learning from
the data structures produced by our work. This work contributes to
computational linguistics by proposing and verifying new data
gathering techniques and applying recent developments in machine
learning to inference over grammatical structures for highly
subjective purposes. It applies a suffix tree-based data structure to
model opinion in a specific domain by imposing a restriction on the
order in which the data is stored in the structure
Appearance of Corporate Innovation in Financial Reports : A Text-Based Analysis
Innovations are important drivers of economic growth and firm profitability. Firms need funding to generate profitable innovations, which is why it is important to reliably distinguish innovative firms. Innovation indicators are used to measure this innovativeness, and consequently, it is important that the used indicator is reliable and measures innovation as desired.
Patents, research and development expenditure and innovation surveys are examples of popular innovation indicators in research literature. However, these indicators have weaknesses, which is why new innovation indicators have been developed. This thesis studies the text-based innovation indicator developed by Bellstam et al. (2019) with a new type of data. Bellstam et al. (2019) created a new text-based innovation indicator that compares corporations’ analyst reports with an innovation textbook as the basis for the indicator. The similarity between these texts created the measurement for innovativeness. Analyst reports are usu-ally subject to charge. However, the 10-K reports used as data for this study are publicly available, and their functionality as the basis of the innovation indicator would mean good availability for the indicator.
The study begins by training a Latent Dirichlet allocation (LDA) model with a sample of 10-K documents from 2008-2018. LDA-model is an unsupervised machine learning method, it finds topics in the text documents based on the probabilities of different words. The LDA-model was trained to find 15 topic allocations in the data and the output of the model is the distribution of these topics for each document. The same topic distributions were also allocated for eight samples from innovation textbooks. When the topic distributions were allocated, a Kullback-Leibler-divergence (KL-divergence) was calculated between each text sample and 10-K document. Thus, the KL-divergence calculated is the lowest for those reports that are the most similar to the innovation text and works as the text-based innovation indicator.
Finally, the text-based innovation indicator was validated with regression analysis, in other words, it was confirmed that the indicator measures innovation. The text-based indicator was compared with research and development costs and the balance sheet value of brands and patents in different linear regressions. Out of the eight innovation measurements, most had a statistically significant correlation with one or both of the other innovation indicators. The ability of the text-based indicator to predict the development of sales in the next year was studied with regression analysis as well and all of the measurements had a significant effect on this. The most significant findings of this thesis are the relationship of the text-based innovation indicator and other indicators and its ability to predict firms’ sales.Innovaatiot ovat tärkeitä talouskasvun ja yritysten kannattavuuden ajureita. Tuottavien innovaatioiden syntymiseksi yritykset tarvitsevat rahoitusta, minkä takia onkin tärkeää, että innovatiiviset yritykset pystytään tunnistamaan luotettavasti. Innovaatioindikaattoreita käytetään tähän innovatiivisuuden mittaamiseen ja on siksi tärkeää, että käytetty indikaattori on luotettava ja mittaa innovatiivisuutta oikealla tavalla.
Kirjallisuudessa paljon käytettyjä innovaatioindikaattoreita ovat esimerkiksi patentit, tutkimus- ja kehitysmenot sekä innovaatiokyselyt. Näissä indikaattoreissa on kuitenkin myös heikkouksia, joiden takia uusia indikaattoreita on alettu kehittää. Tässä tutkielmassa tutkitaan Bellstamin ja muiden (2019) luomaa tekstipohjaista innovaatioindikaattoria erilaisella datalla. Bellstam ja muut (2019) loivat uuden innovaatioindikaattorin, jonka pohjana oli yritysten ana-lyytikkoraporttien vertailu innovaatio-oppikirjan tekstin kanssa, näiden samankaltaisuusver-tailusta saatiin innovaatiomittari. Analyytikkoraportit ovat usein maksullisia. Tässä tutkimuk-sessa aineistona on käytetty lakisääteisiä tilinpäätösraportteja, jotka ovat julkisia tiedostoja, joten niiden toimivuus innovaatioindikaattorin pohjana tarkoittaisi hyvää saatavuutta indi-kaattorille.
Tutkimus alkaa Latent Dirichlet allocation (LDA) –mallin harjoittamisella Yhdysvaltalaisten yritysten 10-K, eli tilinpäätösraporteilla vuosilta 2008-2018. LDA-malli on valvomaton koneoppimismenetelmä, eli se etsii datasta itse aihepiirejä sanojen todennäköisyyksien perusteella. LDA-malli asetettiin etsimään datasta 15 eri aihepiiriä raporteissa käytettyjen aiheiden perusteella ja mallin tuloksena on näiden aihepiirien jakautuminen jokaisessa dokumentissa. Samat aihepiirijakaumat haettiin myös kahdeksalle tekstiotokselle innovaatio-oppikirjoista. Aihepiirijakaumien ollessa valmiit, laskettiin Kullback-Leibler-divergenssi (KL-divergenssi) tilinpäätösraporttien ja innovaatio-oppikirjojen tekstiotosten aihepiirijakaumien välille. Laskettu KL-divergenssi on siten matalin niille tilinpäätösraporteille, joiden teksti on lähimpänä kunkin innovaatio-oppikirjan tekstiä ja toimii tekstipohjaisena innovaatioindikaattorina.
Lopuksi indikaattorin toimivuus vahvistetaan regressioanalyysillä, eli tutkitaan, että se mittaa innovatiivisuuta. Regressioanalyysillä tutkitaan innovaatiomittarien yhteyttä yritysten tutkimus- ja kehitystoiminnan kuluihin sekä patenttien ja brändien tasearvoon. Kahdeksasta innovaatiomittarista suurimmalla osalla oli tilastollisesti merkitsevä yhteys muuttujista toiseen tai molempiin. Myös uuden innovaatiomittarin kykyä ennustaa yritysten seuraavan vuoden myyntiä tutkittiin regressioanalyysillä ja jokaisella mittarilla oli tilastollisesti merkitsevä yhteys yritysten liikevaihdon muutokseen. Tutkimuksen merkittävin löydös oli tekstipohjaisen innovaatiomittarin yhteys muihin innovaatiomittareihin ja yritysten liikevaihdon kehitykseen
Optimal institutional Mechanisms for Funding Generic Advertising: An Experimental Analysis
NICPRE 04-05; R.B. 2004-12Given the uncertain legal status of generic advertising programs for agricultural commodities, alternative voluntary funding institutions are investigated hat could provide a high level of benefits to producers. This experimental study simulates key economic and psychological factors that affect producer contributions to generic advertising. The results suggests that producer referendum play a critical role in increasing contributions and that producer surplus is maximized by a Provision Point Mechanism instituted by producer referendum with thresholds ranging from 68% to 90%, and expected funding from 47% to 77% of the time, depending on the level of advertising effectiveness
Novel statistical approaches to text classification, machine translation and computer-assisted translation
Esta tesis presenta diversas contribuciones en los campos de la
clasificación automática de texto, traducción automática y traducción
asistida por ordenador bajo el marco estadístico.
En clasificación automática de texto, se propone una nueva aplicación
llamada clasificación de texto bilingüe junto con una serie de modelos
orientados a capturar dicha información bilingüe. Con tal fin se
presentan dos aproximaciones a esta aplicación; la primera de ellas se
basa en una asunción naive que contempla la independencia entre las
dos lenguas involucradas, mientras que la segunda, más sofisticada,
considera la existencia de una correlación entre palabras en
diferentes lenguas. La primera aproximación dió lugar al desarrollo de
cinco modelos basados en modelos de unigrama y modelos de n-gramas
suavizados. Estos modelos fueron evaluados en tres tareas de
complejidad creciente, siendo la más compleja de estas tareas
analizada desde el punto de vista de un sistema de ayuda a la
indexación de documentos. La segunda aproximación se caracteriza por
modelos de traducción capaces de capturar correlación entre palabras
en diferentes lenguas. En nuestro caso, el modelo de traducción
elegido fue el modelo M1 junto con un modelo de unigramas. Este
modelo fue evaluado en dos de las tareas más simples superando la
aproximación naive, que asume la independencia entre palabras en
differentes lenguas procedentes de textos bilingües.
En traducción automática, los modelos estadísticos de traducción
basados en palabras M1, M2 y HMM son extendidos bajo el marco de la
modelización mediante mixturas, con el objetivo de definir modelos de
traducción dependientes del contexto. Asimismo se extiende un
algoritmo iterativo de búsqueda basado en programación dinámica,
originalmente diseñado para el modelo M2, para el caso de mixturas de
modelos M2. Este algoritmo de búsqueda nCivera Saiz, J. (2008). Novel statistical approaches to text classification, machine translation and computer-assisted translation [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/2502Palanci
AICPA Technical Practice Aids, as of June 1, 1995
https://egrove.olemiss.edu/aicpa_guides/2332/thumbnail.jp
Utilizing Consumer Health Posts for Pharmacovigilance: Identifying Underlying Factors Associated with Patients’ Attitudes Towards Antidepressants
Non-adherence to antidepressants is a major obstacle to antidepressants therapeutic benefits, resulting in increased risk of relapse, emergency visits, and significant burden on individuals and the healthcare system. Several studies showed that non-adherence is weakly associated with personal and clinical variables, but strongly associated with patients’ beliefs and attitudes towards medications. The traditional methods for identifying the key dimensions of patients’ attitudes towards antidepressants are associated with some methodological limitations, such as concern about confidentiality of personal information. In this study, attempts have been made to address the limitations by utilizing patients’ self report experiences in online healthcare forums to identify underlying factors affecting patients attitudes towards antidepressants. The data source of the study was a healthcare forum called “askapatients.com”. 892 patients’ reviews were randomly collected from the forum for the four most commonly prescribed antidepressants including Sertraline (Zoloft) and Escitalopram (Lexapro) from SSRI class, and Venlafaxine (Effexor) and duloxetine (Cymbalta) from SNRI class. Methodology of this study is composed of two main phases: I) generating structured data from unstructured patients’ drug reviews and testing hypotheses concerning attitude, II) identification and normalization of Adverse Drug Reactions (ADRs), Withdrawal Symptoms (WDs) and Drug Indications (DIs) from the posts, and mapping them to both The UMLS and SNOMED CT concepts. Phase II also includes testing the association between ADRs and attitude. The result of the first phase of this study showed that “experience of adverse drug reactions”, “perceived distress received from ADRs”, “lack of knowledge about medication’s mechanism”, “withdrawal experience”, “duration of usage”, and “drug effectiveness” are strongly associated with patients attitudes. However, demographic variables including “age” and “gender” are not associated with attitude. Analysis of the data in second phase of the study showed that from 6,534 identified entities, 73% are ADRs, 12% are WDs, and 15 % are drug indications. In addition, psychological and cognitive expressions have higher variability than physiological expressions. All three types of entities were mapped to 811 UMLS and SNOMED CT concepts. Testing the association between ADRs and attitude showed that from twenty-one physiological ADRs specified in the ASEC questionnaire, “dry mouth”, “increased appetite”, “disorientation”, “yawning”, “weight gain”, and “problem with sexual dysfunction” are associated with attitude. A set of psychological and cognitive ADRs, such as “emotional indifference” and “memory problem were also tested that showed significance association between these types of ADRs and attitude. The findings of this study have important implications for designing clinical interventions aiming to improve patients\u27 adherence towards antidepressants. In addition, the dataset generated in this study has significant implications for improving performance of text-mining algorithms aiming to identify health related information from consumer health posts. Moreover, the dataset can be used for generating and testing hypotheses related to ADRs associated with psychiatric mediations, and identifying factors associated with discontinuation of antidepressants. The dataset and guidelines of this study are available at https://sites.google.com/view/pharmacovigilanceinpsychiatry/hom
AICPA Technical Practice Aids, as of June 1, 1997
https://egrove.olemiss.edu/aicpa_guides/2334/thumbnail.jp
AICPA Technical Practice Aids, as of June 1, 1996
https://egrove.olemiss.edu/aicpa_guides/2333/thumbnail.jp
- …