31 research outputs found

    COMET: A Recipe for Learning and Using Large Ensembles on Massive Data

    Full text link
    COMET is a single-pass MapReduce algorithm for learning on large-scale data. It builds multiple random forest ensembles on distributed blocks of data and merges them into a mega-ensemble. This approach is appropriate when learning from massive-scale data that is too large to fit on a single machine. To get the best accuracy, IVoting should be used instead of bagging to generate the training subset for each decision tree in the random forest. Experiments with two large datasets (5GB and 50GB compressed) show that COMET compares favorably (in both accuracy and training time) to learning on a subsample of data using a serial algorithm. Finally, we propose a new Gaussian approach for lazy ensemble evaluation which dynamically decides how many ensemble members to evaluate per data point; this can reduce evaluation cost by 100X or more

    Model-based classification for subcellular localization prediction of proteins

    Get PDF

    Combining classification algorithms

    Get PDF
    Dissertação de Doutoramento em Ciência de Computadores apresentada à Faculdade de Ciências da Universidade do PortoA capacidade de um algoritmo de aprendizagem induzir, para um determinado problema, uma boa generalização depende da linguagem de representação usada para generalizar os exemplos. Como diferentes algoritmos usam diferentes linguagens de representação e estratégias de procura, são explorados espaços diferentes e são obtidos resultados diferentes. O problema de encontrar a representação mais adequada para o problema em causa, é uma área de investigação bastante activa. Nesta dissertação, em vez de procurar métodos que fazem o ajuste aos dados usando uma única linguagem de representação, apresentamos uma família de algoritmos, sob a designação genérica de Generalização em Cascata, onde o espaço de procura contem modelos que utilizam diferentes linguagens de representação. A ideia básica do método consiste em utilizar os algoritmos de aprendizagem em sequência. Em cada iteração ocorre um processo com dois passos. No primeiro passo, um classificador constrói um modelo. No segundo passo, o espaço definido pelos atributos é estendido pela inserção de novos atributos gerados utilizando este modelo. Este processo de construção de novos atributos constrói atributos na linguagem de representação do classificador usado para construir o modelo. Se posteriormente na sequência, um classificador utiliza um destes novos atributos para construir o seu modelo, a sua capacidade de representação foi estendida. Desta forma as restrições da linguagem de representação dosclassificadores utilizados a mais alto nível na sequência, são relaxadas pela incorporação de termos da linguagem derepresentação dos classificadores de base. Esta é a metodologia base subjacente ao sistema Ltree e à arquitecturada Generalização em Cascata.O método é apresentado segundo duas perspectivas. Numa primeira parte, é apresentado como uma estratégia paraconstruir árvores de decisão multivariadas. É apresentado o sistema Ltree que utiliza como operador para a construção de atributos um discriminante linear. ..

    決策樹形式知識之線上預測系統架構 | An On-Line Decision Tree-Based Predictive System Architecture

    Get PDF
    <p>頁次:60-76</p><p class="MsoNormal" style="margin: 0cm 0cm 0pt; mso-layout-grid-align: none;"><span style="font-size: small;"><span style="font-family: "新細明體","serif"; mso-ascii-font-family: 'Times New Roman'; mso-font-kerning: 0pt; mso-fareast-theme-font: minor-fareast; mso-hansi-theme-font: minor-fareast;">本研究提出一個決策樹形式知識的線上預測系統架構,其主要的目在於提供一個</span><span style="mso-fareast-font-family: 新細明體; mso-font-kerning: 0pt; mso-fareast-theme-font: minor-fareast;" lang="EN-US"><span style="font-family: Times New Roman;">Web-Based</span></span><span style="font-family: "新細明體","serif"; mso-ascii-font-family: 'Times New Roman'; mso-font-kerning: 0pt; mso-fareast-theme-font: minor-fareast; mso-hansi-theme-font: minor-fareast;">的知識發掘</span><span style="mso-fareast-font-family: 新細明體; mso-font-kerning: 0pt; mso-fareast-theme-font: minor-fareast;" lang="EN-US"><span style="font-family: Times New Roman;">(Knowledge Discovery, KD)</span></span><span style="font-family: "新細明體","serif"; mso-ascii-font-family: 'Times New Roman'; mso-font-kerning: 0pt; mso-fareast-theme-font: minor-fareast; mso-hansi-theme-font: minor-fareast;">及線上預測系統,而我們藉由使用這個系統可以進行歸納學習出決策樹形式的知識,並且在線上使用決策樹的知識來做分類和預測的工作。它的組成元件包含三個子系統:知識學習子系統、合併選擇決策樹子系統、線上預測子系統;三個儲存庫:決策樹知識法則庫、例子資料庫、和歷史知識法則庫;以及三個導入知識法則的介面:上傳例子集資料介面、輸入決策樹知識法則介面、及轉換決策樹</span><span style="mso-fareast-font-family: 新細明體; mso-font-kerning: 0pt; mso-fareast-theme-font: minor-fareast;" lang="EN-US"><span style="font-family: Times New Roman;">PMML(Predictive Model Markup Language)</span></span><span style="font-family: "新細明體","serif"; mso-ascii-font-family: 'Times New Roman'; mso-font-kerning: 0pt; mso-fareast-theme-font: minor-fareast; mso-hansi-theme-font: minor-fareast;">文件模組等。就整體系統運作流程而言,在知識學習方面,我們首先上傳例子集,接著使用知識學習子系統來發掘出知識,然後直接儲存於知識法則庫內。而在知識使用方面,我們可以利用線上預測子系統來存取知識法則庫內的知識以進行分類和預測的工作。在知識溝通方面,本系統提供一個轉換</span><span style="mso-fareast-font-family: 新細明體; mso-font-kerning: 0pt; mso-fareast-theme-font: minor-fareast;" lang="EN-US"><span style="font-family: Times New Roman;">PMML</span></span><span style="font-family: "新細明體","serif"; mso-ascii-font-family: 'Times New Roman'; mso-font-kerning: 0pt; mso-fareast-theme-font: minor-fareast; mso-hansi-theme-font: minor-fareast;">格式文件的模組,方便導入其他採礦工具所歸納學習出之決策樹形式的知識。而在知識整合方面,本系統使用合併選擇決策樹子系統來合併多棵決策樹形式的知識而成一棵決策樹。運用這個子系統有助於維護決策樹法則知識庫內的知識,而讓決策樹形式的知識在保有簡單樹狀結構下,進行知識法則的擴充,並且簡單樹狀結構有助於線上預測子系統對於系統預測結果之解釋和說明。有關後續研究方面,本研究擬實作此架構的元件,且對於合併決策樹方面,提出一些修剪策略來提昇決策樹之預測準確度,以及如何有效維護決策樹知識法則庫內的知識等課題。</span><span style="mso-fareast-font-family: 新細明體; mso-font-kerning: 0pt; mso-fareast-theme-font: minor-fareast;" lang="EN-US"></span></span></p><p class="MsoNormal" style="margin: 0cm 0cm 0pt; mso-layout-grid-align: none;"><span style="mso-fareast-font-family: 新細明體; mso-font-kerning: 0pt; mso-fareast-theme-font: minor-fareast;" lang="EN-US"><span style="font-size: small;"><span style="font-family: Times New Roman;">This paper presents an on-line decision tree-based predictive system architecture. The architecture contains nine components, including a database of the examples, a learning system of the decision trees, a knowledge base, a historical knowledge base, a maintaining interface of the decision trees, an interface to upload training and testing examples, a PMML (Predictive Model Markup Language) translator, an on-line predictive system, and a merging optional decision trees system. There are three channels to import knowledge in the architecture; the developers can upload the examples to the learning system to induce the decision tree, directly input the information of decision trees through the user interface, or import the decision trees in PMML format. In order to integrate the knowledge of the decision trees, we added the merging optional decision trees system into this architecture. The merging optional decision trees system can combine multiple decision trees into a single decision tree to integrate the knowledge of the trees. In the future research, we will implement this architecture as a real system in the web-based platform to do some empirical analyses. And in order to improve the performance of the merging decision trees, we will also develop some pruning strategies in the merging optional decision trees system.</span></span></span></p&gt

    Quantitative Assessment of Factors in Sentiment Analysis

    Get PDF
    Sentiment can be defined as a tendency to experience certain emotions in relation to a particular object or person. Sentiment may be expressed in writing, in which case determining that sentiment algorithmically is known as sentiment analysis. Sentiment analysis is often applied to Internet texts such as product reviews, websites, blogs, or tweets, where automatically determining published feeling towards a product, or service is very useful to marketers or opinion analysts. The main goal of sentiment analysis is to identify the polarity of natural language text. This thesis sets out to examine quantitatively the factors that have an effect on sentiment analysis. The factors that are commonly used in sentiment analysis are text features, sentiment lexica or resources, and the machine learning algorithms employed. The main aim of this thesis is to investigate systematically the interaction between sentiment analysis factors and machine learning algorithms in order to improve sentiment analysis performance as compared to the opinions of human assessors. A software system known as TJP was designed and developed to support this investigation. The research reported here has three main parts. Firstly, the role of data pre-processing was investigated with TJP using a combination of features together with publically available datasets. This considers the relationship and relative importance of superficial text features such as emoticons, n-grams, negations, hashtags, repeated letters, special characters, slang, and stopwords. The resulting statistical analysis suggests that a combination of all of these features achieves better accuracy with the dataset, and had a considerable effect on system performance. Secondly, the effect of human marked up training data was considered, since this is required by supervised machine learning algorithms. The results gained from TJP suggest that training data greatly augments sentiment analysis performance. However, the combination of training data and sentiment lexica seems to provide optimal performance. Nevertheless, one particular sentiment lexicon, AFINN, contributed better than others in the absence of training data, and therefore would be appropriate for unsupervised approaches to sentiment analysis. Finally, the performance of two sophisticated ensemble machine learning algorithms was investigated. Both the Arbiter Tree and Combiner Tree were chosen since neither of them has previously been used with sentiment analysis. The objective here was to demonstrate their applicability and effectiveness compared to that of the leading single machine learning algorithms, Naïve Bayes, and Support Vector Machines. The results showed that whilst either can be applied to sentiment analysis, the Arbiter Tree ensemble algorithm achieved better accuracy performance than either the Combiner Tree or any single machine learning algorithm
    corecore