108 research outputs found
Combination of Multiple Bipartite Ranking for Web Content Quality Evaluation
Web content quality estimation is crucial to various web content processing
applications. Our previous work applied Bagging + C4.5 to achive the best
results on the ECML/PKDD Discovery Challenge 2010, which is the comibination of
many point-wise rankinig models. In this paper, we combine multiple pair-wise
bipartite ranking learner to solve the multi-partite ranking problems for the
web quality estimation. In encoding stage, we present the ternary encoding and
the binary coding extending each rank value to (L is the number of the
different ranking value). For the decoding, we discuss the combination of
multiple ranking results from multiple bipartite ranking models with the
predefined weighting and the adaptive weighting. The experiments on ECML/PKDD
2010 Discovery Challenge datasets show that \textit{binary coding} +
\textit{predefined weighting} yields the highest performance in all four
combinations and furthermore it is better than the best results reported in
ECML/PKDD 2010 Discovery Challenge competition.Comment: 17 pages, 8 figures, 2 table
An Email Attachment is Worth a Thousand Words, or Is It?
There is an extensive body of research on Social Network Analysis (SNA) based
on the email archive. The network used in the analysis is generally extracted
either by capturing the email communication in From, To, Cc and Bcc email
header fields or by the entities contained in the email message. In the latter
case, the entities could be, for instance, the bag of words, url's, names,
phones, etc. It could also include the textual content of attachments, for
instance Microsoft Word documents, excel spreadsheets, or Adobe pdfs. The nodes
in this network represent users and entities. The edges represent communication
between users and relations to the entities. We suggest taking a different
approach to the network extraction and use attachments shared between users as
the edges. The motivation for this is two-fold. First, attachments represent
the "intimacy" manifestation of the relation's strength. Second, the
statistical analysis of private email archives that we collected and Enron
email corpus shows that the attachments contribute in average around 80-90% to
the archive's disk-space usage, which means that most of the data is presently
ignored in the SNA of email archives. Consequently, we hypothesize that this
approach might provide more insight into the social structure of the email
archive. We extract the communication and shared attachments networks from
Enron email corpus. We further analyze degree, betweenness, closeness, and
eigenvector centrality measures in both networks and review the differences and
what can be learned from them. We use nearest neighbor algorithm to generate
similarity groups for five Enron employees. The groups are consistent with
Enron's organizational chart, which validates our approach.Comment: 12 pages, 4 figures, 7 tables, IML'17, Liverpool, U
Visual Scene Understanding by Deep Fisher Discriminant Learning
Modern deep learning has recently revolutionized
several fields of classic machine learning and computer vision,
such as, scene understanding, natural language processing and
machine translation. The substitution of feature hand-crafting
with automatic feature learning, provides an excellent
opportunity for gaining an in-depth understanding of large-scale
data statistics. Deep neural networks generally train models with
huge numbers of parameters, facilitating efficient search for
optimal and sub-optimal spaces of highly non-convex objective
functions. On the other hand, Fisher discriminant analysis has
been widely employed to impose class discrepancy, for the sake of
segmentation, classification, and recognition tasks. This thesis
bridges between contemporary deep learning and classic
discriminant analysis, to accommodate some important challenges
in visual scene understanding, i.e. semantic segmentation,
texture classification, and object recognition. The aim is to
accomplish specific tasks in some new high-dimensional spaces,
covered by the statistical information of the datasets under
study. Inspired by a new formulation of Fisher discriminant
analysis, this thesis introduces some novel arrangements of
well-known deep learning architectures, to achieve better
performances on the targeted missions. The theoretical
justifications are based upon a large body of experimental work,
and consolidate the contribution of the proposed idea; Deep
Fisher Discriminant Learning, to several challenges in visual
scene understanding
RANDOM WALK APPLIED TO HETEROGENOUS DRUG-TARGET NETWORKS FOR PREDICTING BIOLOGICAL OUTCOMES
Thesis (Ph.D.) - Indiana University, Informatics and Computing, 2016Prediction of unknown drug target interactions from bioassay data is critical not only for the understanding of various interactions but also crucial for the development of new drugs and repurposing of old ones. Conventional methods for prediction of such interactions can be divided into 2D based and 3D based methods. 3D methods are more CPU expensive and require more manual interpretation whereas 2D methods are actually fast methods like machine learning and similarity search which use chemical fingerprints. One of the problems of using traditional machine learning based method to predict drug-target pairs is that it requires a labeled information of true and false interactions. One of the major problems of supervised learning methods is selection on negative samples. Unknown drug target interactions are regarded as false interactions, which may influence the predictive accuracy of the model. To overcome this problem network based methods has become an effective tool in predicting the drug target interactions overcoming the negative sampling problem. In this dissertation study, I will describe traditional machine learning methods and 3D methods of pharmacophore modeling for drug target prediction and will show how these methods work in a drug discovery scenario. I will then introduce a new framework for drug target prediction based on bipartite networks of drug target relations known as Random Walk with Restart (RWR). RWR integrates various networks including drug– drug similarity networks, protein-protein similarity networks and drug- target interaction networks into a heterogeneous network that is capable of predicting novel drug-target relations. I will describe how chemical features for measuring drug-drug similarity do not affect performance in predicting interactions and further show the performance of RWR using an external dataset from ChEMBL database. I will describe about further implementations of RWR approach into multilayered networks consisting of biological data like diseases, tissue based gene expression data, protein- complexes and metabolic pathways to predict associations between human diseases and metabolic pathways which are very crucial in drug discovery. I have further developed a software tool package netpredictor in R (standalone and the web) for unipartite and bipartite networks and implemented network-based predictive algorithms and network properties for drug-target prediction. This package will be described
Topic Distiller:distilling semantic topics from documents
Abstract. This thesis details the design and implementation of a system that can find relevant and latent semantic topics from textual documents. The design of this system, named Topic Distiller, is inspired by research conducted on automatic keyphrase extraction and automatic topic labeling, and it employs entity linking and knowledge bases to reduce text documents to their semantic topics.
The Topic Distiller is evaluated using methods and datasets used in information retrieval and automatic keyphrase extraction. On top of the common datasets used in the literature three additional datasets are created to evaluate the system.
The evaluation reveals that the Topic Distiller is able to find relevant and latent topics from textual documents, beating the state-of-the-art automatic keyphrase methods in performance when used on news articles and social media posts.Semanttisten aiheiden suodattaminen dokumenteista. Tiivistelmä. Tässä diplomityössä tarkastellaan järjestelmää, joka pystyy löytämään tekstistä relevantteja ja piileviä semanttisia aihealueita, sekä kyseisen järjestelmän suunnittelua ja implementaatiota. Tämän Topic Distiller -järjestelmän suunnittelu ammentaa inspiraatiota automaattisen termintunnistamisen ja automaattisen aiheiden nimeämisen tutkimuksesta sekä hyödyntää automaattista semanttista annotointia ja tietämyskantoja tekstin aihealueiden löytämisessä.
Topic Distiller -järjestelmän suorituskykyä mitataan hyödyntämällä kirjallisuudessa paljon käytettyjä automaattisen termintunnistamisen evaluontimenetelmiä ja aineistoja. Näiden yleisten aineistojen lisäksi esittelemme kolme uutta aineistoa, jotka on luotu Topic Distiller -järjestelmän arviointia varten.
Evaluointi tuo ilmi, että Topic Distiller kykenee löytämään relevantteja ja piileviä aiheita tekstistä. Se päihittää kirjallisuuden viimeisimmät automaattisen termintunnistamisen menetelmät suorituskyvyssä, kun sitä käytetään uutisartikkelien sekä sosiaalisen median julkaisujen analysointiin
A computational approach to the art of visual storytelling
For millennia, humanity as been using images to tell stories. In modern society, these
visual narratives take the center stage in many different contexts, from illustrated children’s
books to news media and comic books. They leverage the power of compounding
various images in sequence to present compelling and informative narratives, in an immediate
and impactful manner. In order to create them, many criteria are taken into account,
from the quality of the individual images to how they synergize with one another.
With the rise of the Internet, visual content with which to create these visual storylines
is now in abundance. In areas such as news media, where visual storylines are regularly
used to depict news stories, this has both advantages and disadvantages. Although content
might be available online to create a visual storyline, filtering the massive amounts
of existing images for high quality, relevant ones is a hard and time consuming task. Furthermore,
combining these images into visually and semantically cohesive narratives is a
highly skillful process and one that takes time.
As a first step to help solve this problem, this thesis brings state of the art computational
methodologies to the age old tradition of creating visual storylines. Leveraging
these methodologies, we define a three part architecture to help with the creation of visual
storylines in the context of news media, using social media content. To ensure the
quality of the storylines from a human perception point of view, we deploy methods for
filtering and raking images according to news quality standards, we resort to multimedia
retrieval techniques to find relevant content and we propose a machine learning based
approach to organize visual content into cohesive and appealing visual narratives
- …