8 research outputs found

    Extraction of Key Phrases from Biomedical Full Text with Supervised Learning Techniques

    Get PDF
    Key-phrase extraction plays useful a role in the research area of Information Systems (IS) such as digital libraries. Short metadata like key phrases could be beneficial for searchers to understand the concepts of documents’ concept. This paper evaluates the effectiveness of different supervised learning techniques on biomedical full-text: Naïve Bayes, linear regression, SVMs (reg1/2), all of which could be embedded inside an IS for document search. We use these techniques to extract key phrases from PubMed. We evaluate the performance of these systems using the well-established holdout validation method. The contributions of the paper are comparison among different classifier techniques, and a comparison of performance differences between full-text and abstract. We conducted experiments and found that SVMreg-1 improves the performance of key-phrase extraction from full-text while Naïve Bayes improves from the abstracts. These techniques should be considered for use in information system search functionality. Additional research issues also are identified

    Comprehensible and Robust Knowledge Discovery from Small Datasets

    Get PDF
    Die Wissensentdeckung in Datenbanken (“Knowledge Discovery in Databases”, KDD) zielt darauf ab, nützliches Wissen aus Daten zu extrahieren. Daten können eine Reihe von Messungen aus einem realen Prozess repräsentieren oder eine Reihe von Eingabe- Ausgabe-Werten eines Simulationsmodells. Zwei häufig widersprüchliche Anforderungen an das erworbene Wissen sind, dass es (1) die Daten möglichst exakt zusammenfasst und (2) in einer gut verständlichen Form vorliegt. Entscheidungsbäume (“Decision Trees”) und Methoden zur Entdeckung von Untergruppen (“Subgroup Discovery”) liefern Wissenszusammenfassungen in Form von Hyperrechtecken; diese gelten als gut verständlich. Um die Bedeutung einer verständlichen Datenzusammenfassung zu demonstrieren, erforschen wir Dezentrale intelligente Netzsteuerung — ein neues System, das die Bedarfsreaktion in Stromnetzen ohne wesentliche Änderungen in der Infrastruktur implementiert. Die bisher durchgeführte konventionelle Analyse dieses Systems beschränkte sich auf die Berücksichtigung identischer Teilnehmer und spiegelte daher die Realität nicht ausreichend gut wider. Wir führen viele Simulationen mit unterschiedlichen Eingabewerten durch und wenden Entscheidungsbäume auf die resultierenden Daten an. Mit den daraus resultierenden verständlichen Datenzusammenfassung konnten wir neue Erkenntnisse zum Verhalten der Dezentrale intelligente Netzsteuerung gewinnen. Entscheidungsbäume ermöglichen die Beschreibung des Systemverhaltens für alle Eingabekombinationen. Manchmal ist man aber nicht daran interessiert, den gesamten Eingaberaum zu partitionieren, sondern Bereiche zu finden, die zu bestimmten Ausgabe führen (sog. Untergruppen). Die vorhandenen Algorithmen zum Erkennen von Untergruppen erfordern normalerweise große Datenmengen, um eine stabile und genaue Ausgabe zu erzielen. Der Datenerfassungsprozess ist jedoch häufig kostspielig. Unser Hauptbeitrag ist die Verbesserung der Untergruppenerkennung aus Datensätzen mit wenigen Beobachtungen. Die Entdeckung von Untergruppen in simulierten Daten wird als Szenarioerkennung bezeichnet. Ein häufig verwendeter Algorithmus für die Szenarioerkennung ist PRIM (Patient Rule Induction Method). Wir schlagen REDS (Rule Extraction for Discovering Scenarios) vor, ein neues Verfahren für die Szenarioerkennung. Für REDS, trainieren wir zuerst ein statistisches Zwischenmodell und verwenden dieses, um eine große Menge neuer Daten für PRIM zu erstellen. Die grundlegende statistische Intuition beschrieben wir ebenfalls. Experimente zeigen, dass REDS viel besser funktioniert als PRIM für sich alleine: Es reduziert die Anzahl der erforderlichen Simulationsläufe um 75% im Durchschnitt. Mit simulierten Daten hat man perfekte Kenntnisse über die Eingangsverteilung — eine Voraussetzung von REDS. Um REDS auf realen Messdaten anwendbar zu machen, haben wir es mit Stichproben aus einer geschätzten multivariate Verteilung der Daten kombiniert. Wir haben die resultierende Methode in Kombination mit verschiedenen Methoden zur Generierung von Daten experimentell evaluiert. Wir haben dies für PRIM und BestInterval — eine weitere repräsentative Methode zur Erkennung von Untergruppen — gemacht. In den meisten Fällen hat unsere Methodik die Qualität der entdeckten Untergruppen erhöht

    Erkennung menschlicher Aktivitäten zur Belehrung von Robotern

    Get PDF
    Für die Verwendung im Rahmen des Programmieren durch Vormachen-Paradigmas zur Programmierung von Robotern wurde ein Ansatz zur Klassifikation und Interpretation von menschlichen Bewegungen entwickelt. Dazu wurden erweiterte Methoden zur Beobachtung von Bewegungen untersucht und eine Prozesskette entwickelt, die unter Einsatz von Hintergrundwissen Bewegungssequenzen auf Aktivitäts-abhängig geeignete Merkmale abbildet und diese zur Erkennung von Aktivitäten nutzt

    The role of classifiers in feature selection : number vs nature

    Get PDF
    Wrapper feature selection approaches are widely used to select a small subset of relevant features from a dataset. However, Wrappers suffer from the fact that they only use a single classifier when selecting the features. The problem of using a single classifier is that each classifier is of a different nature and will have its own biases. This means that each classifier will select different feature subsets. To address this problem, this thesis aims to investigate the effects of using different classifiers for Wrapper feature selection. More specifically, it aims to investigate the effects of using different number of classifiers and classifiers of different nature. This aim is achieved by proposing a new data mining method called Wrapper-based Decision Trees (WDT). The WDT method has the ability to combine multiple classifiers from four different families, including Bayesian Network, Decision Tree, Nearest Neighbour and Support Vector Machine, to select relevant features and visualise the relationships among the selected features using decision trees. Specifically, the WDT method is applied to investigate three research questions of this thesis: (1) the effects of number of classifiers on feature selection results; (2) the effects of nature of classifiers on feature selection results; and (3) which of the two (i.e., number or nature of classifiers) has more of an effect on feature selection results. Two types of user preference datasets derived from Human-Computer Interaction (HCI) are used with WDT to assist in answering these three research questions. The results from the investigation revealed that the number of classifiers and nature of classifiers greatly affect feature selection results. In terms of number of classifiers, the results showed that few classifiers selected many relevant features whereas many classifiers selected few relevant features. In addition, it was found that using three classifiers resulted in highly accurate feature subsets. In terms of nature of classifiers, it was showed that Decision Tree, Bayesian Network and Nearest Neighbour classifiers caused signficant differences in both the number of features selected and the accuracy levels of the features. A comparison of results regarding number of classifiers and nature of classifiers revealed that the former has more of an effect on feature selection than the latter. The thesis makes contributions to three communities: data mining, feature selection, and HCI. For the data mining community, this thesis proposes a new method called WDT which integrates the use of multiple classifiers for feature selection and decision trees to effectively select and visualise the most relevant features within a dataset. For the feature selection community, the results of this thesis have showed that the number of classifiers and nature of classifiers can truly affect the feature selection process. The results and suggestions based on the results can provide useful insight about classifiers when performing feature selection. For the HCI community, this thesis has showed the usefulness of feature selection for identifying a small number of highly relevant features for determining the preferences of different users.EThOS - Electronic Theses Online ServiceGBUnited Kingdo

    Bridging the Gap between Distance and Generalization

    Full text link
    Distance-based and generalization-based methods are two families of artificial intelligence techniques that have been successfully used over a wide range of real-world problems. In the first case, general algorithms can be applied to any data representation by just changing the distance. The metric space sets the search and learning space, which is generally instance-oriented. In the second case, models can be obtained for a given pattern language, which can be comprehensible. The generality-ordered space sets the search and learning space, which is generally model-oriented. However, the concepts of distance and generalization clash in many different ways, especially when knowledge representation is complex (e.g., structured data). This work establishes a framework where these two fields can be integrated in a consistent way. We introduce the concept of distance-based generalization, which connects all the generalized examples in such a way that all of them are reachable inside the generalization by using straight paths in the metric space. This makes the metric space and the generality-ordered space coherent (or even dual). Additionally, we also introduce a definition of minimal distance-based generalization that can be seen as the first formulation of the Minimum Description Length (MDL)/Minimum Message Length (MML) principle in terms of a distance function. We instantiate and develop the framework for the most common data representations and distances, where we show that consistent instances can be found for numerical data, nominal data, sets, lists, tuples, graphs, first-order atoms, and clauses. As a result, general learning methods that integrate the best from distance-based and generalization-based methods can be defined and adapted to any specific problem by appropriately choosing the distance, the pattern language and the generalization operator.We would like to thank the anonymous reviewers for their insightful comments. This work has been partially supported by the EU (FEDER) and the Spanish MICINN, under grant TIN2010-21062-C02-02, the Spanish project "Agreement Technologies" (Consolider Ingenio CSD2007-00022) and the GVA project PROMETEO/2008/051.Estruch Gregori, V.; Ferri Ramírez, C.; José Hernández-Orallo; Ramírez Quintana, MJ. (2012). Bridging the Gap between Distance and Generalization. Computational Intelligence. https://doi.org/10.1111/coin.12004SArmengol , E. E. Plaza S. Ontanón 2004 Explaining similarity in CBR In ECCBR 2004 Workshop Proceedings 155 164Bargiela, A., & Pedrycz, W. (2003). Granular Computing. doi:10.1007/978-1-4615-1033-8Bunke, H. (1997). On a relation between graph edit distance and maximum common subgraph. Pattern Recognition Letters, 18(8), 689-694. doi:10.1016/s0167-8655(97)00060-3Cover, T., & Hart, P. (1967). Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13(1), 21-27. doi:10.1109/tit.1967.1053964Develin, M. (2006). Dimensions of Tight Spans. Annals of Combinatorics, 10(1), 53-61. doi:10.1007/s00026-006-0273-yDomingos, P. (1996). Unifying instance-based and rule-based induction. Machine Learning, 24(2), 141-168. doi:10.1007/bf00058656Driessens, K., & Džeroski, S. (2005). Combining model-based and instance-based learning for first order regression. Proceedings of the 22nd international conference on Machine learning - ICML ’05. doi:10.1145/1102351.1102376Eiter, T., & Mannila, H. (1997). Distance measures for point sets and their computation. Acta Informatica, 34(2), 109-133. doi:10.1007/s002360050075Estruch , V. 2008 Bridging the gap between distance and generalisation: Symbolic learning in metric spaces Ph. D. Thesis http://www.dsic.upv.es/~flip/papers/thesisvestruch.pdfEstruch , V. C. Ferri J. Hernández-Orallo M. Ramírez-Quintana 2010 Generalisation operators for lists embedded in a metric space In Approaches and Applications of Inductive Programming, Third International Workshop, AAIP 2009 5812 117 139Estruch , V. C. Ferri J. Hernández-Orallo M. J. Ramírez-Quintana 2005 Distance based generalisation In the 15th International Conference on Inductive Logic Programming, Volume 3625 of LNCS 87 102Estruch , V. C. Ferri J. Hernández-Orallo M. J. Ramírez-Quintana 2006a Minimal distance-based generalisation operators for first-order objects In the 16th International Conference on Inductive Logic Programming 169 183Estruch, V., Ferri, C., Hernández-Orallo, J., & Ramírez-Quintana, M. J. (2006). Web Categorisation Using Distance-Based Decision Trees. Electronic Notes in Theoretical Computer Science, 157(2), 35-40. doi:10.1016/j.entcs.2005.12.043Finnie, G., & Sun, Z. (2002). Similarity and metrics in case-based reasoning. International Journal of Intelligent Systems, 17(3), 273-287. doi:10.1002/int.10021Frank , A. A. Asuncion 2010 UCI machine learning repository http://archive.ics.uci.edu/mlFunes, A., Ferri, C., Hernández-Orallo, J., & Ramírez-Quintana, M. J. (2009). An Instantiation of Hierarchical Distance-Based Conceptual Clustering for Propositional Learning. Lecture Notes in Computer Science, 637-646. doi:10.1007/978-3-642-01307-2_63Gärtner, T., Lloyd, J. W., & Flach, P. A. (2004). Kernels and Distances for Structured Data. Machine Learning, 57(3), 205-232. doi:10.1023/b:mach.0000039777.23772.30Gao , B 2006 Hyper-rectangle-based discriminative data generalization and applications in data mining Ph. D. Thesis Simon Frasier UniversityGolding , A. P. Rosenbloom 1991 Improving rule-based systems through case-based reasoning In National Conference on Artificial Intelligence 22 27Hahn, U., Chater, N., & Richardson, L. B. (2003). Similarity as transformation. Cognition, 87(1), 1-32. doi:10.1016/s0010-0277(02)00184-1Hu , C. 2008 Interval rule matrices for decision making In Knowledge Processing with Interval and Soft Computing, Chapter 6 Edited by Springer 135 146Juszczak, P., Tax, D. M. J., Pe¸kalska, E., & Duin, R. P. W. (2009). Minimum spanning tree based one-class classifier. Neurocomputing, 72(7-9), 1859-1869. doi:10.1016/j.neucom.2008.05.003Kearfott , R. C. Hu 2008 Fundamentals of interval computing In Knowledge Processing with Interval and Soft Computing, Chapter 1 Edited by Spinger 1 12Muggleton, S. (1999). Inductive Logic Programming: Issues, results and the challenge of Learning Language in Logic. Artificial Intelligence, 114(1-2), 283-296. doi:10.1016/s0004-3702(99)00067-3Piramuthu, S., & Sikora, R. T. (2009). Iterative feature construction for improving inductive learning algorithms. Expert Systems with Applications, 36(2), 3401-3406. doi:10.1016/j.eswa.2008.02.010De Raedt, L., & Ramon, J. (2009). Deriving distance metrics from generality relations. Pattern Recognition Letters, 30(3), 187-191. doi:10.1016/j.patrec.2008.09.007Ramon , J. M. Bruynooghe 1998 A framework for defining distances between first-order logic objects In Proceedings of the International Conference on Inductive Logic Programming, Volume, 1446 of LNCS 271 280Rissanen, J. (1999). Hypothesis Selection and Testing by the MDL Principle. The Computer Journal, 42(4), 260-269. doi:10.1093/comjnl/42.4.260Salzberg, S. (1991). A nearest hyperrectangle learning method. Machine Learning, 6(3), 251-276. doi:10.1007/bf00114779Stanfill, C., & Waltz, D. (1986). Toward memory-based reasoning. Communications of the ACM, 29(12), 1213-1228. doi:10.1145/7902.7906Vapnik, V. N., & Chervonenkis, A. Y. (1971). On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities. Theory of Probability & Its Applications, 16(2), 264-280. doi:10.1137/1116025Wallace, C. S. (1999). Minimum Message Length and Kolmogorov Complexity. The Computer Journal, 42(4), 270-283. doi:10.1093/comjnl/42.4.27
    corecore