1,110 research outputs found
Categorical and Fuzzy Ensemble-Based Algorithms for Cluster Analysis
This dissertation focuses on improving multivariate methods of cluster analysis. In Chapter 3 we discuss methods relevant to the categorical clustering of tertiary data while Chapter 4 considers the clustering of quantitative data using ensemble algorithms. Lastly, in Chapter 5, future research plans are discussed to investigate the clustering of spatial binary data.
Cluster analysis is an unsupervised methodology whose results may be influenced by the types of variables recorded on observations. When dealing with the clustering of categorical data, solutions produced may not accurately reflect the structure of the process that generated them. Increased variability within the latent structure of the data and the presence of noisy observations are two issues that may be obscured within the categories. It is also the presence of these issues that may cause clustering solutions produced in categorical cases to be less accurate. To remedy this, in Chapter 3, a method is proposed that utilizes concepts from statistics to improve the accuracy of clustering solutions produced in tertiary data objects. By pre-smoothing the dissimilarities used in traditional clustering algorithms, we show it is possible to produce clustering solutions more reflective of the latent process from which observations arose. To do this the Fienberg-Holland estimator, a shrinkage-based statistical smoother, is used along with 3 choices of smoothing. We show the method results in more accurate clusters via simulation and an application to diabetes.
Solutions produced from clustering algorithms may vary regardless of the type of variables observed. Such variations may be due to the clustering algorithm used, the initial starting point of an algorithm, or by the type of algorithm used to produce such solutions. Furthermore, it may sometimes be of interest to produce clustering solutions that allow observations to share similarities with more than one cluster. One method proposed to combat these problems and add flexibility to clustering solutions is fuzzy ensemble-based clustering. In Chapter 4 three fuzzy ensemble based clustering algorithms are introduced for the clustering of quantitative data objects and compared to the performance of the traditional Fuzzy C-Means algorithm. The ensembles proposed in this case, however, differ from traditional ensemble-based methods of clustering in that the clustering solutions produced within the generation process have resulted from supervised classifiers and not from clustering algorithms. A simulation study and two data applications suggest that in certain settings, the proposed fuzzy ensemble-based algorithms of clustering produce more accurate clusters than the Fuzzy C-Means algorithm.
In both of the aforementioned cases, only the types of variables recorded on each object were of importance in the clustering process. In Chapter 5 the types of variables recorded and their spatial nature are both of importance. An idea is presented that combines applications to geodesics with categorical cluster analysis to deal with the spatial and categorical nature of observations. The focus in this chapter is on producing an accurate method of clustering the binary and spatial data objects found in the Global Terrorism Database
Can biological quantum networks solve NP-hard problems?
There is a widespread view that the human brain is so complex that it cannot
be efficiently simulated by universal Turing machines. During the last decades
the question has therefore been raised whether we need to consider quantum
effects to explain the imagined cognitive power of a conscious mind.
This paper presents a personal view of several fields of philosophy and
computational neurobiology in an attempt to suggest a realistic picture of how
the brain might work as a basis for perception, consciousness and cognition.
The purpose is to be able to identify and evaluate instances where quantum
effects might play a significant role in cognitive processes.
Not surprisingly, the conclusion is that quantum-enhanced cognition and
intelligence are very unlikely to be found in biological brains. Quantum
effects may certainly influence the functionality of various components and
signalling pathways at the molecular level in the brain network, like ion
ports, synapses, sensors, and enzymes. This might evidently influence the
functionality of some nodes and perhaps even the overall intelligence of the
brain network, but hardly give it any dramatically enhanced functionality. So,
the conclusion is that biological quantum networks can only approximately solve
small instances of NP-hard problems.
On the other hand, artificial intelligence and machine learning implemented
in complex dynamical systems based on genuine quantum networks can certainly be
expected to show enhanced performance and quantum advantage compared with
classical networks. Nevertheless, even quantum networks can only be expected to
efficiently solve NP-hard problems approximately. In the end it is a question
of precision - Nature is approximate.Comment: 38 page
SVMAUD: Using textual information to predict the audience level of written works using support vector machines
Information retrieval systems should seek to match resources with the reading ability of the individual user; similarly, an author must choose vocabulary and sentence structures appropriate for his or her audience. Traditional readability formulas, including the popular Flesch-Kincaid Reading Age and the Dale-Chall Reading Ease Score, rely on numerical representations of text characteristics, including syllable counts and sentence lengths, to suggest audience level of resources. However, the author’s chosen vocabulary, sentence structure, and even the page formatting can alter the predicted audience level by several levels, especially in the case of digital library resources. For these reasons, the performance of readability formulas when predicting the audience level of digital library resources is very low.
Rather than relying on these inputs, machine learning methods, including cosine, Naïve Bayes, and Support Vector Machines (SVM), can suggest the grade level of an essay based on the vocabulary chosen by the author. The audience level prediction and essay grading problems share the same inputs, expert-labeled documents, and outputs, a numerical score representing quality or audience level. After a human expert labels a representative sample of resources with audience level, the proposed SVM-based audience level prediction program, SVMAUD, constructs a vocabulary for each audience level; then, the text in an unlabeled resource is compared with this predefined vocabulary to suggest the most appropriate audience level.
Two readability formulas and four machine learning programs are evaluated with respect to predicting human-expert entered audience levels based on the text contained in an unlabeled resource. In a collection containing 10,238 expert-labeled HTML-based digital library resources, the Flesch-Kincaid Reading Age and the Dale-Chall Reading Ease Score predict the specific audience level with F-measures of 0.10 and 0.05, respectively. Conversely, cosine, Naïve Bayes, the Collins-Thompson and Callan model, and SVMAUD improve these F-measures to 0.57, 0.61, 0.68, and 0.78, respectively. When a term’s weight is adjusted based on the HTML tag in which it occurs, the specific audience level prediction performance of cosine, Naïve Bayes, the Collins-Thompson and Callan method, and SVMAUD improves to 0.68, 0.70, 0.75, and 0.84, respectively. When title, keyword, and abstract metadata is used for training, cosine, Naïve Bayes, the Collins-Thompson and Callan model, and SVMAUD specific audience level prediction F-measures are found to be 0.61, 0.68, 0.75, and 0.86, respectively. When cosine, Naïve Bayes, the Collins-Thompson and Callan method, and SVMAUD are trained and tested using resources from a single subject category, the specific audience level prediction F- measure performance improves to 0.63, 0.70, 0.77, and 0.87, respectively. SVMAUD experiences the highest audience level prediction performance among all methods under evaluation in this study. After SVMAUD is properly trained, it can be used to predict the audience level of any written work
Recommended from our members
Improved integration of information to reduce subsurface model bias
Subsurface modeling deals with data-related issues like cognitive and sampling biases, and model-related challenges including statistical assumptions, misspecification, and algorithmic biases. These challenges introduce four critical implications during subsurface modeling. Firstly, subsurface sampling is subject to sampling bias, which compromises statistical representativeness. Secondly, analog selection methodologies rely on multivariate statistics and expert judgment that overlook spatial information and data dimensionality. Thirdly, subsurface inferential workflows that utilize dimensionality reduction seldom provide repeatable frameworks that maintain model stability and are invariant to Euclidean transformations. Lastly, deep learning methods for dimensionality reduction, characterized as black-box models, lack interpretability and robust evaluation metrics, increasing susceptibility to algorithmic bias. Consequently, neglecting these challenges in subsurface modeling could lead to erroneous predictions, inconsistent inferences, diminished model reliability, and suboptimal decision-making that impacts project economics.
This dissertation integrates information within subsurface models to reduce model bias and significantly improve their accuracy, robustness, and generalizability. First, I create spatial declustering methods to debias spatial datasets with single and multiscale preferential sampling in stationary populations. Second, I introduce a novel geostatistics-based machine learning method for identifying subsurface resource analogs that integrate spatial information in subsurface datasets with high dimensionality. Next, I efficiently combine machine learning and computational geometry methods to stabilize lower dimensional spaces for uncertainty quantification and interpretation. Finally, I create a methodology to assess, evaluate, and interpret the stability of deep learning latent feature spaces.
These novel methodologies demonstrate the importance of improved techniques for information integration in subsurface modeling and show better results over naïve methods. This results in objective sampling debiasing in spatial stationary populations with single or multiple data scales, improving statistical representativity. Also, the results show better generalization and accurate identification of spatial analogs in high-dimensional datasets. Moreover, the methods yield Euclidean transformation-invariant lower-dimensional spaces, ensuring unique and repeatable solutions that improve model reliability and interpretability, for rational comparisons. Finally, the results indicate that deep learning models for dimensionality reduction exhibit algorithmic biases and instabilities, including sample, structural, and inferential instability, affecting their reliability and interpretability. Together, these innovations ultimately reduce model bias and significantly improve subsurface modeling.Petroleum and Geosystems Engineerin
Using data mining to repurpose German language corpora. An evaluation of data-driven analysis methods for corpus linguistics
A growing number of studies report interesting insights gained from existing data resources. Among those, there are analyses on textual data, giving reason to consider such methods for linguistics as well. However, the field of corpus linguistics usually works with purposefully collected, representative language samples that aim to answer only a limited set of research questions.
This thesis aims to shed some light on the potentials of data-driven analysis based on machine learning and predictive modelling for corpus linguistic studies, investigating the possibility to repurpose existing German language corpora for linguistic inquiry by using methodologies developed for data science and computational linguistics. The study focuses on predictive modelling and machine-learning-based data mining and gives a detailed overview and evaluation of currently popular strategies and methods for analysing corpora with computational methods.
After the thesis introduces strategies and methods that have already been used on language data, discusses how they can assist corpus linguistic analysis and refers to available toolkits and software as well as to state-of-the-art research and further references, the introduced methodological toolset is applied in two differently shaped corpus studies that utilize readily available corpora for German. The first study explores linguistic correlates of holistic text quality ratings on student essays, while the second deals with age-related language features in computer-mediated communication and interprets age prediction models to answer a set of research questions that are based on previous research in the field. While both studies give linguistic insights that integrate into the current understanding of the investigated phenomena in German language, they systematically test the methodological toolset introduced beforehand, allowing a detailed discussion of added values and remaining challenges of machine-learning-based data mining methods in corpus at the end of the thesis
OCM 2015 - 2nd International Conference on Optical Characterization of Materials: March 18th - 19th, 2015, Karlsruhe, Germany
Each material has its own specific spectral signature independent if it is food, plastics, or minerals. During the conference we will discuss new trends and developments in material characterization. You also will be informed about latest highlights to identify spectral footprints and their realizations in industry
Short Text Categorization using World Knowledge
The content of the World Wide Web is drastically multiplying, and thus the amount of available online text data is increasing every day.
Today, many users contribute to this massive global network via online platforms by sharing information in the form of a short text. Such an immense amount of data covers subjects from all the existing domains (e.g., Sports, Economy, Biology, etc.). Further, manually processing such data is beyond human capabilities. As a result, Natural Language Processing (NLP) tasks, which aim to automatically analyze and process natural language documents have gained significant attention. Among these tasks, due to its application in various domains, text categorization has become one of the most fundamental and crucial tasks.
However, the standard text categorization models face major challenges while performing short text categorization, due to the unique characteristics of short texts, i.e., insufficient text length, sparsity, ambiguity, etc. In other words, the conventional approaches provide substandard performance, when they are directly applied to the short text categorization task. Furthermore, in the case of short text, the standard feature extraction techniques such as bag-of-words suffer from limited contextual information. Hence, it is essential to enhance the text representations with an external knowledge source. Moreover, the traditional models require a significant amount of manually labeled data and obtaining labeled data is a costly and time-consuming task. Therefore, although recently proposed supervised methods, especially, deep neural network approaches have demonstrated notable performance, the requirement of the labeled data remains the main bottleneck of these approaches.
In this thesis, we investigate the main research question of how to perform \textit{short text categorization} effectively \textit{without requiring any labeled data} using knowledge bases as an external source. In this regard, novel short text categorization models, namely, Knowledge-Based Short Text Categorization (KBSTC) and Weakly Supervised Short Text Categorization using World Knowledge (WESSTEC) have been introduced and evaluated in this thesis. The models do not require any hand-labeled data to perform short text categorization, instead, they leverage the semantic similarity between the short texts and the predefined categories. To quantify such semantic similarity, the low dimensional representation of entities and categories have been learned by exploiting a large knowledge base. To achieve that a novel entity and category embedding model has also been proposed in this thesis. The extensive experiments have been conducted to assess the performance of the proposed short text categorization models and the embedding model on several standard benchmark datasets
Algorithms for Multiclass Classification and Regularized Regression
Multiclass classification and regularized regression problems are very common in modern statistical and machine learning applications. On the one hand, multiclass classification problems require the prediction of class labels: given observations of objects that belong to certain classes, can we predict to which class a new object belongs? On the other hand, the reg
Tomato Maturity Recognition with Convolutional Transformers
Tomatoes are a major crop worldwide, and accurately classifying their
maturity is important for many agricultural applications, such as harvesting,
grading, and quality control. In this paper, the authors propose a novel method
for tomato maturity classification using a convolutional transformer. The
convolutional transformer is a hybrid architecture that combines the strengths
of convolutional neural networks (CNNs) and transformers. Additionally, this
study introduces a new tomato dataset named KUTomaData, explicitly designed to
train deep-learning models for tomato segmentation and classification.
KUTomaData is a compilation of images sourced from a greenhouse in the UAE,
with approximately 700 images available for training and testing. The dataset
is prepared under various lighting conditions and viewing perspectives and
employs different mobile camera sensors, distinguishing it from existing
datasets. The contributions of this paper are threefold:Firstly, the authors
propose a novel method for tomato maturity classification using a modular
convolutional transformer. Secondly, the authors introduce a new tomato image
dataset that contains images of tomatoes at different maturity levels. Lastly,
the authors show that the convolutional transformer outperforms
state-of-the-art methods for tomato maturity classification. The effectiveness
of the proposed framework in handling cluttered and occluded tomato instances
was evaluated using two additional public datasets, Laboro Tomato and Rob2Pheno
Annotated Tomato, as benchmarks. The evaluation results across these three
datasets demonstrate the exceptional performance of our proposed framework,
surpassing the state-of-the-art by 58.14%, 65.42%, and 66.39% in terms of mean
average precision scores for KUTomaData, Laboro Tomato, and Rob2Pheno Annotated
Tomato, respectively.Comment: 23 pages, 6 figures and 8 Table
- …