8 research outputs found
Extraction of Key Phrases from Biomedical Full Text with Supervised Learning Techniques
Key-phrase extraction plays useful a role in the research area of Information Systems (IS) such as digital libraries. Short metadata like key phrases could be beneficial for searchers to understand the concepts of documents’ concept. This paper evaluates the effectiveness of different supervised learning techniques on biomedical full-text: Naïve Bayes, linear regression, SVMs (reg1/2), all of which could be embedded inside an IS for document search. We use these techniques to extract key phrases from PubMed. We evaluate the performance of these systems using the well-established holdout validation method. The contributions of the paper are comparison among different classifier techniques, and a comparison of performance differences between full-text and abstract. We conducted experiments and found that SVMreg-1 improves the performance of key-phrase extraction from full-text while Naïve Bayes improves from the abstracts. These techniques should be considered for use in information system search functionality. Additional research issues also are identified
Comprehensible and Robust Knowledge Discovery from Small Datasets
Die Wissensentdeckung in Datenbanken (“Knowledge Discovery in Databases”, KDD) zielt darauf ab, nützliches Wissen aus Daten zu extrahieren. Daten können eine Reihe
von Messungen aus einem realen Prozess repräsentieren oder eine Reihe von Eingabe-
Ausgabe-Werten eines Simulationsmodells. Zwei häufig widersprüchliche Anforderungen
an das erworbene Wissen sind, dass es (1) die Daten möglichst exakt zusammenfasst und
(2) in einer gut verständlichen Form vorliegt. Entscheidungsbäume (“Decision Trees”) und
Methoden zur Entdeckung von Untergruppen (“Subgroup Discovery”) liefern Wissenszusammenfassungen in Form von Hyperrechtecken; diese gelten als gut verständlich.
Um die Bedeutung einer verständlichen Datenzusammenfassung zu demonstrieren,
erforschen wir Dezentrale intelligente Netzsteuerung — ein neues System, das die Bedarfsreaktion in Stromnetzen ohne wesentliche Änderungen in der Infrastruktur implementiert.
Die bisher durchgeführte konventionelle Analyse dieses Systems beschränkte sich auf
die Berücksichtigung identischer Teilnehmer und spiegelte daher die Realität nicht ausreichend gut wider. Wir führen viele Simulationen mit unterschiedlichen Eingabewerten durch und wenden Entscheidungsbäume auf die resultierenden Daten an. Mit den daraus resultierenden verständlichen Datenzusammenfassung konnten wir neue Erkenntnisse zum Verhalten der Dezentrale intelligente Netzsteuerung gewinnen.
Entscheidungsbäume ermöglichen die Beschreibung des Systemverhaltens für alle Eingabekombinationen.
Manchmal ist man aber nicht daran interessiert, den gesamten Eingaberaum
zu partitionieren, sondern Bereiche zu finden, die zu bestimmten Ausgabe fĂĽhren
(sog. Untergruppen). Die vorhandenen Algorithmen zum Erkennen von Untergruppen
erfordern normalerweise groĂźe Datenmengen, um eine stabile und genaue Ausgabe zu erzielen.
Der Datenerfassungsprozess ist jedoch häufig kostspielig. Unser Hauptbeitrag ist die
Verbesserung der Untergruppenerkennung aus Datensätzen mit wenigen Beobachtungen.
Die Entdeckung von Untergruppen in simulierten Daten wird als Szenarioerkennung
bezeichnet. Ein häufig verwendeter Algorithmus für die Szenarioerkennung ist PRIM
(Patient Rule Induction Method). Wir schlagen REDS (Rule Extraction for Discovering
Scenarios) vor, ein neues Verfahren fĂĽr die Szenarioerkennung. FĂĽr REDS, trainieren wir
zuerst ein statistisches Zwischenmodell und verwenden dieses, um eine groĂźe Menge
neuer Daten fĂĽr PRIM zu erstellen. Die grundlegende statistische Intuition beschrieben wir
ebenfalls. Experimente zeigen, dass REDS viel besser funktioniert als PRIM fĂĽr sich alleine:
Es reduziert die Anzahl der erforderlichen Simulationsläufe um 75% im Durchschnitt.
Mit simulierten Daten hat man perfekte Kenntnisse über die Eingangsverteilung — eine
Voraussetzung von REDS. Um REDS auf realen Messdaten anwendbar zu machen, haben
wir es mit Stichproben aus einer geschätzten multivariate Verteilung der Daten kombiniert.
Wir haben die resultierende Methode in Kombination mit verschiedenen Methoden zur Generierung von Daten experimentell evaluiert. Wir haben dies für PRIM und BestInterval — eine weitere repräsentative Methode zur Erkennung von Untergruppen — gemacht. In den meisten Fällen hat unsere Methodik die Qualität der entdeckten Untergruppen erhöht
Recommended from our members
The role of classifiers in feature selection: Number vs nature
This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel University.Wrapper feature selection approaches are widely used to select a small subset of relevant features from a dataset. However, Wrappers suffer from the fact that they only use a single classifier when selecting the features. The problem of using a single classifier is that each classifier is of a different nature and will have its own biases. This means that each classifier will select different feature subsets. To address this problem, this thesis aims to investigate the effects of using different classifiers for Wrapper feature selection. More specifically, it aims to investigate the effects of using different number of classifiers and classifiers of different nature.
This aim is achieved by proposing a new data mining method called Wrapper-based Decision Trees (WDT). The WDT method has the ability to combine multiple classifiers from four different families, including Bayesian Network, Decision Tree, Nearest Neighbour and Support Vector Machine, to select relevant features and visualise the relationships among the selected features using decision trees. Specifically, the WDT method is applied to investigate three research questions of this thesis: (1) the effects of number of classifiers on feature selection results; (2) the effects of nature of classifiers on feature selection results; and (3) which of the two (i.e., number or nature of classifiers) has more of an effect on feature selection results. Two types of user preference datasets derived from Human-Computer Interaction (HCI) are used with WDT to assist in answering these three research questions.
The results from the investigation revealed that the number of classifiers and nature of classifiers greatly affect feature selection results. In terms of number of classifiers, the results showed that few classifiers selected many relevant features whereas many classifiers selected few relevant features. In addition, it was found that using three classifiers resulted in highly accurate feature subsets. In terms of nature of classifiers, it was showed that Decision Tree, Bayesian Network and Nearest Neighbour classifiers caused signficant differences in both the number of features selected and the accuracy levels of the features. A comparison of results regarding number of classifiers and nature of classifiers revealed that the former has more of an effect on feature selection than the latter.
The thesis makes contributions to three communities: data mining, feature selection, and HCI. For the data mining community, this thesis proposes a new method called WDT which integrates the use of multiple classifiers for feature selection and decision trees to effectively select and visualise the most relevant features within a dataset. For the feature selection community, the results of this thesis have showed that the number of classifiers and nature of classifiers can truly affect the feature selection process. The results and suggestions based on the results can provide useful insight about classifiers when performing feature selection. For the HCI community, this thesis has showed the usefulness of feature selection for identifying a small number of highly relevant features for determining the preferences of different users
Erkennung menschlicher Aktivitäten zur Belehrung von Robotern
Für die Verwendung im Rahmen des Programmieren durch Vormachen-Paradigmas zur Programmierung von Robotern wurde ein Ansatz zur Klassifikation und Interpretation von menschlichen Bewegungen entwickelt. Dazu wurden erweiterte Methoden zur Beobachtung von Bewegungen untersucht und eine Prozesskette entwickelt, die unter Einsatz von Hintergrundwissen Bewegungssequenzen auf Aktivitäts-abhängig geeignete Merkmale abbildet und diese zur Erkennung von Aktivitäten nutzt
The role of classifiers in feature selection : number vs nature
Wrapper feature selection approaches are widely used to select a small subset of relevant features from a dataset. However, Wrappers suffer from the fact that they only use a single classifier when selecting the features. The problem of using a single classifier is that each classifier is of a different nature and will have its own biases. This means that each classifier will select different feature subsets. To address this problem, this thesis aims to investigate the effects of using different classifiers for Wrapper feature selection. More specifically, it aims to investigate the effects of using different number of classifiers and classifiers of different nature. This aim is achieved by proposing a new data mining method called Wrapper-based Decision Trees (WDT). The WDT method has the ability to combine multiple classifiers from four different families, including Bayesian Network, Decision Tree, Nearest Neighbour and Support Vector Machine, to select relevant features and visualise the relationships among the selected features using decision trees. Specifically, the WDT method is applied to investigate three research questions of this thesis: (1) the effects of number of classifiers on feature selection results; (2) the effects of nature of classifiers on feature selection results; and (3) which of the two (i.e., number or nature of classifiers) has more of an effect on feature selection results. Two types of user preference datasets derived from Human-Computer Interaction (HCI) are used with WDT to assist in answering these three research questions. The results from the investigation revealed that the number of classifiers and nature of classifiers greatly affect feature selection results. In terms of number of classifiers, the results showed that few classifiers selected many relevant features whereas many classifiers selected few relevant features. In addition, it was found that using three classifiers resulted in highly accurate feature subsets. In terms of nature of classifiers, it was showed that Decision Tree, Bayesian Network and Nearest Neighbour classifiers caused signficant differences in both the number of features selected and the accuracy levels of the features. A comparison of results regarding number of classifiers and nature of classifiers revealed that the former has more of an effect on feature selection than the latter. The thesis makes contributions to three communities: data mining, feature selection, and HCI. For the data mining community, this thesis proposes a new method called WDT which integrates the use of multiple classifiers for feature selection and decision trees to effectively select and visualise the most relevant features within a dataset. For the feature selection community, the results of this thesis have showed that the number of classifiers and nature of classifiers can truly affect the feature selection process. The results and suggestions based on the results can provide useful insight about classifiers when performing feature selection. For the HCI community, this thesis has showed the usefulness of feature selection for identifying a small number of highly relevant features for determining the preferences of different users.EThOS - Electronic Theses Online ServiceGBUnited Kingdo
Bridging the Gap between Distance and Generalization
Distance-based and generalization-based methods are two families of artificial intelligence techniques that have been successfully used over a wide range of real-world problems. In the first case, general algorithms can be applied to any data representation by just changing the distance. The metric space sets the search and learning space, which is generally instance-oriented. In the second case, models can be obtained for a given pattern language, which can be comprehensible. The generality-ordered space sets the search and learning space, which is generally model-oriented. However, the concepts of distance and generalization clash in many different ways, especially when knowledge representation is complex (e.g., structured data). This work establishes a framework where these two fields can be integrated in a consistent way. We introduce the concept of distance-based generalization, which connects all the generalized examples in such a way that all of them are reachable inside the generalization by using straight paths in the metric space. This makes the metric space and the generality-ordered space coherent (or even dual). Additionally, we also introduce a definition of minimal distance-based generalization that can be seen as the first formulation of the Minimum Description Length (MDL)/Minimum Message Length (MML) principle in terms of a distance function. We instantiate and develop the framework for the most common data representations and distances, where we show that consistent instances can be found for numerical data, nominal data, sets, lists, tuples, graphs, first-order atoms, and clauses. As a result, general learning methods that integrate the best from distance-based and generalization-based methods can be defined and adapted to any specific problem by appropriately choosing the distance, the pattern language and the generalization operator.We would like to thank the anonymous reviewers for their insightful comments. This work has been partially supported by the EU (FEDER) and the Spanish MICINN, under grant TIN2010-21062-C02-02, the Spanish project "Agreement Technologies" (Consolider Ingenio CSD2007-00022) and the GVA project PROMETEO/2008/051.Estruch Gregori, V.; Ferri RamĂrez, C.; JosĂ© Hernández-Orallo; RamĂrez Quintana, MJ. (2012). Bridging the Gap between Distance and Generalization. Computational Intelligence. https://doi.org/10.1111/coin.12004SArmengol , E. E. Plaza S. OntanĂłn 2004 Explaining similarity in CBR In ECCBR 2004 Workshop Proceedings 155 164Bargiela, A., & Pedrycz, W. (2003). Granular Computing. doi:10.1007/978-1-4615-1033-8Bunke, H. (1997). On a relation between graph edit distance and maximum common subgraph. Pattern Recognition Letters, 18(8), 689-694. doi:10.1016/s0167-8655(97)00060-3Cover, T., & Hart, P. (1967). Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13(1), 21-27. doi:10.1109/tit.1967.1053964Develin, M. (2006). Dimensions of Tight Spans. Annals of Combinatorics, 10(1), 53-61. doi:10.1007/s00026-006-0273-yDomingos, P. (1996). Unifying instance-based and rule-based induction. Machine Learning, 24(2), 141-168. doi:10.1007/bf00058656Driessens, K., & DĹľeroski, S. (2005). Combining model-based and instance-based learning for first order regression. Proceedings of the 22nd international conference on Machine learning - ICML ’05. doi:10.1145/1102351.1102376Eiter, T., & Mannila, H. (1997). Distance measures for point sets and their computation. Acta Informatica, 34(2), 109-133. doi:10.1007/s002360050075Estruch , V. 2008 Bridging the gap between distance and generalisation: Symbolic learning in metric spaces Ph. D. Thesis http://www.dsic.upv.es/~flip/papers/thesisvestruch.pdfEstruch , V. C. Ferri J. Hernández-Orallo M. RamĂrez-Quintana 2010 Generalisation operators for lists embedded in a metric space In Approaches and Applications of Inductive Programming, Third International Workshop, AAIP 2009 5812 117 139Estruch , V. C. Ferri J. Hernández-Orallo M. J. RamĂrez-Quintana 2005 Distance based generalisation In the 15th International Conference on Inductive Logic Programming, Volume 3625 of LNCS 87 102Estruch , V. C. Ferri J. Hernández-Orallo M. J. RamĂrez-Quintana 2006a Minimal distance-based generalisation operators for first-order objects In the 16th International Conference on Inductive Logic Programming 169 183Estruch, V., Ferri, C., Hernández-Orallo, J., & RamĂrez-Quintana, M. J. (2006). Web Categorisation Using Distance-Based Decision Trees. Electronic Notes in Theoretical Computer Science, 157(2), 35-40. doi:10.1016/j.entcs.2005.12.043Finnie, G., & Sun, Z. (2002). Similarity and metrics in case-based reasoning. International Journal of Intelligent Systems, 17(3), 273-287. doi:10.1002/int.10021Frank , A. A. Asuncion 2010 UCI machine learning repository http://archive.ics.uci.edu/mlFunes, A., Ferri, C., Hernández-Orallo, J., & RamĂrez-Quintana, M. J. (2009). An Instantiation of Hierarchical Distance-Based Conceptual Clustering for Propositional Learning. Lecture Notes in Computer Science, 637-646. doi:10.1007/978-3-642-01307-2_63Gärtner, T., Lloyd, J. W., & Flach, P. A. (2004). Kernels and Distances for Structured Data. Machine Learning, 57(3), 205-232. doi:10.1023/b:mach.0000039777.23772.30Gao , B 2006 Hyper-rectangle-based discriminative data generalization and applications in data mining Ph. D. Thesis Simon Frasier UniversityGolding , A. P. Rosenbloom 1991 Improving rule-based systems through case-based reasoning In National Conference on Artificial Intelligence 22 27Hahn, U., Chater, N., & Richardson, L. B. (2003). Similarity as transformation. Cognition, 87(1), 1-32. doi:10.1016/s0010-0277(02)00184-1Hu , C. 2008 Interval rule matrices for decision making In Knowledge Processing with Interval and Soft Computing, Chapter 6 Edited by Springer 135 146Juszczak, P., Tax, D. M. J., Pe¸kalska, E., & Duin, R. P. W. (2009). Minimum spanning tree based one-class classifier. Neurocomputing, 72(7-9), 1859-1869. doi:10.1016/j.neucom.2008.05.003Kearfott , R. C. Hu 2008 Fundamentals of interval computing In Knowledge Processing with Interval and Soft Computing, Chapter 1 Edited by Spinger 1 12Muggleton, S. (1999). Inductive Logic Programming: Issues, results and the challenge of Learning Language in Logic. Artificial Intelligence, 114(1-2), 283-296. doi:10.1016/s0004-3702(99)00067-3Piramuthu, S., & Sikora, R. T. (2009). Iterative feature construction for improving inductive learning algorithms. Expert Systems with Applications, 36(2), 3401-3406. doi:10.1016/j.eswa.2008.02.010De Raedt, L., & Ramon, J. (2009). Deriving distance metrics from generality relations. Pattern Recognition Letters, 30(3), 187-191. doi:10.1016/j.patrec.2008.09.007Ramon , J. M. Bruynooghe 1998 A framework for defining distances between first-order logic objects In Proceedings of the International Conference on Inductive Logic Programming, Volume, 1446 of LNCS 271 280Rissanen, J. (1999). Hypothesis Selection and Testing by the MDL Principle. The Computer Journal, 42(4), 260-269. doi:10.1093/comjnl/42.4.260Salzberg, S. (1991). A nearest hyperrectangle learning method. Machine Learning, 6(3), 251-276. doi:10.1007/bf00114779Stanfill, C., & Waltz, D. (1986). Toward memory-based reasoning. Communications of the ACM, 29(12), 1213-1228. doi:10.1145/7902.7906Vapnik, V. N., & Chervonenkis, A. Y. (1971). On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities. Theory of Probability & Its Applications, 16(2), 264-280. doi:10.1137/1116025Wallace, C. S. (1999). Minimum Message Length and Kolmogorov Complexity. The Computer Journal, 42(4), 270-283. doi:10.1093/comjnl/42.4.27