183,757 research outputs found
MAINTENANCE OF DATA RICHNESS IN BUSINESS COMMUNICATION DATA
Business negotiations – be they face-to-face or electronic – are conducted through communication enabling the declaration of negotiation objectives and active implementation of negotiation strategies to achieve pre-defined goals and the declaration of a successful or unsuccessful end of the negotiation. The processing of exchanged textual communication enables the automatic transformation of unstructured data into processable structured datasets and subsequently the analysis of textual content without losing the data richness of exchanged communication messages. For this purpose, the paper presents Text Mining-based pre-processing approaches and dimensionality reduction algorithms from Feature Extraction and Feature Selection in a research framework and evaluates those to counteract common dimensionality problems with textual processing. In doing so, the maintenance of data richness in communication data is considered as the overall goal to determine the dataset with minimal information loss. In this sense, various pre-processed and transformed communication datasets derived from dimensionality reduction are integrated as input data into selected classification models to measure the prediction performance regarding the final negotiation outcome with ROC analysis. The central results of the ROC show that quantified business communication generated by Optimized Selection delivers the best data based on Lovins’ stemming algorithm compared to stemming variations of Forward Selection and SVD
Taming Wild High Dimensional Text Data with a Fuzzy Lash
The bag of words (BOW) represents a corpus in a matrix whose elements are the
frequency of words. However, each row in the matrix is a very high-dimensional
sparse vector. Dimension reduction (DR) is a popular method to address sparsity
and high-dimensionality issues. Among different strategies to develop DR
method, Unsupervised Feature Transformation (UFT) is a popular strategy to map
all words on a new basis to represent BOW. The recent increase of text data and
its challenges imply that DR area still needs new perspectives. Although a wide
range of methods based on the UFT strategy has been developed, the fuzzy
approach has not been considered for DR based on this strategy. This research
investigates the application of fuzzy clustering as a DR method based on the
UFT strategy to collapse BOW matrix to provide a lower-dimensional
representation of documents instead of the words in a corpus. The quantitative
evaluation shows that fuzzy clustering produces superior performance and
features to Principal Components Analysis (PCA) and Singular Value
Decomposition (SVD), two popular DR methods based on the UFT strategy
Sentiment Analysis using an ensemble of Feature Selection Algorithms
To determine the opinion of any person experiencing any services or buying any product, the usage of Sentiment Analysis, a continuous research in the field of text mining, is a common practice. It is a process of using computation to identify and categorize opinions expressed in a piece of text. Individuals post their opinion via reviews, tweets, comments or discussions which is our unstructured information. Sentiment analysis gives a general conclusion of audits which benefit clients, individuals or organizations for decision making. The primary point of this paper is to perform an ensemble approach on feature reduction methods identified with natural language processing and performing the analysis based on the results. An ensemble approach is a process of combining two or more methodologies. The feature reduction methods used are Principal Component Analysis (PCA) for feature extraction and Pearson Chi squared statistical test for feature selection. The fundamental commitment of this paper is to experiment whether combined use of cautious feature determination and existing classification methodologies can yield better accuracy
EEF: Exponentially Embedded Families with Class-Specific Features for Classification
In this letter, we present a novel exponentially embedded families (EEF)
based classification method, in which the probability density function (PDF) on
raw data is estimated from the PDF on features. With the PDF construction, we
show that class-specific features can be used in the proposed classification
method, instead of a common feature subset for all classes as used in
conventional approaches. We apply the proposed EEF classifier for text
categorization as a case study and derive an optimal Bayesian classification
rule with class-specific feature selection based on the Information Gain (IG)
score. The promising performance on real-life data sets demonstrates the
effectiveness of the proposed approach and indicates its wide potential
applications.Comment: 9 pages, 3 figures, to be published in IEEE Signal Processing Letter.
IEEE Signal Processing Letter, 201
Evaluation of linear classifiers on articles containing pharmacokinetic evidence of drug-drug interactions
Background. Drug-drug interaction (DDI) is a major cause of morbidity and
mortality. [...] Biomedical literature mining can aid DDI research by
extracting relevant DDI signals from either the published literature or large
clinical databases. However, though drug interaction is an ideal area for
translational research, the inclusion of literature mining methodologies in DDI
workflows is still very preliminary. One area that can benefit from literature
mining is the automatic identification of a large number of potential DDIs,
whose pharmacological mechanisms and clinical significance can then be studied
via in vitro pharmacology and in populo pharmaco-epidemiology. Experiments. We
implemented a set of classifiers for identifying published articles relevant to
experimental pharmacokinetic DDI evidence. These documents are important for
identifying causal mechanisms behind putative drug-drug interactions, an
important step in the extraction of large numbers of potential DDIs. We
evaluate performance of several linear classifiers on PubMed abstracts, under
different feature transformation and dimensionality reduction methods. In
addition, we investigate the performance benefits of including various
publicly-available named entity recognition features, as well as a set of
internally-developed pharmacokinetic dictionaries. Results. We found that
several classifiers performed well in distinguishing relevant and irrelevant
abstracts. We found that the combination of unigram and bigram textual features
gave better performance than unigram features alone, and also that
normalization transforms that adjusted for feature frequency and document
length improved classification. For some classifiers, such as linear
discriminant analysis (LDA), proper dimensionality reduction had a large impact
on performance. Finally, the inclusion of NER features and dictionaries was
found not to help classification.Comment: Pacific Symposium on Biocomputing, 201
- …