1,858 research outputs found
Semi-Automatic Terminology Ontology Learning Based on Topic Modeling
Ontologies provide features like a common vocabulary, reusability,
machine-readable content, and also allows for semantic search, facilitate agent
interaction and ordering & structuring of knowledge for the Semantic Web (Web
3.0) application. However, the challenge in ontology engineering is automatic
learning, i.e., the there is still a lack of fully automatic approach from a
text corpus or dataset of various topics to form ontology using machine
learning techniques. In this paper, two topic modeling algorithms are explored,
namely LSI & SVD and Mr.LDA for learning topic ontology. The objective is to
determine the statistical relationship between document and terms to build a
topic ontology and ontology graph with minimum human intervention. Experimental
analysis on building a topic ontology and semantic retrieving corresponding
topic ontology for the user's query demonstrating the effectiveness of the
proposed approach
State of the Art, Evaluation and Recommendations regarding "Document Processing and Visualization Techniques"
Several Networks of Excellence have been set up in the framework of the
European FP5 research program. Among these Networks of Excellence, the NEMIS
project focuses on the field of Text Mining.
Within this field, document processing and visualization was identified as
one of the key topics and the WG1 working group was created in the NEMIS
project, to carry out a detailed survey of techniques associated with the text
mining process and to identify the relevant research topics in related research
areas.
In this document we present the results of this comprehensive survey. The
report includes a description of the current state-of-the-art and practice, a
roadmap for follow-up research in the identified areas, and recommendations for
anticipated technological development in the domain of text mining.Comment: 54 pages, Report of Working Group 1 for the European Network of
Excellence (NoE) in Text Mining and its Applications in Statistics (NEMIS
Fuzzy Approach Topic Discovery in Health and Medical Corpora
The majority of medical documents and electronic health records (EHRs) are in
text format that poses a challenge for data processing and finding relevant
documents. Looking for ways to automatically retrieve the enormous amount of
health and medical knowledge has always been an intriguing topic. Powerful
methods have been developed in recent years to make the text processing
automatic. One of the popular approaches to retrieve information based on
discovering the themes in health & medical corpora is topic modeling, however,
this approach still needs new perspectives. In this research we describe fuzzy
latent semantic analysis (FLSA), a novel approach in topic modeling using fuzzy
perspective. FLSA can handle health & medical corpora redundancy issue and
provides a new method to estimate the number of topics. The quantitative
evaluations show that FLSA produces superior performance and features to latent
Dirichlet allocation (LDA), the most popular topic model.Comment: 12 Pages, International Journal of Fuzzy Systems, 201
A Probabilistic Embedding Clustering Method for Urban Structure Detection
Urban structure detection is a basic task in urban geography. Clustering is a
core technology to detect the patterns of urban spatial structure, urban
functional region, and so on. In big data era, diverse urban sensing datasets
recording information like human behaviour and human social activity, suffer
from complexity in high dimension and high noise. And unfortunately, the
state-of-the-art clustering methods does not handle the problem with high
dimension and high noise issues concurrently. In this paper, a probabilistic
embedding clustering method is proposed. Firstly, we come up with a
Probabilistic Embedding Model (PEM) to find latent features from high
dimensional urban sensing data by learning via probabilistic model. By latent
features, we could catch essential features hidden in high dimensional data
known as patterns; with the probabilistic model, we can also reduce uncertainty
caused by high noise. Secondly, through tuning the parameters, our model could
discover two kinds of urban structure, the homophily and structural
equivalence, which means communities with intensive interaction or in the same
roles in urban structure. We evaluated the performance of our model by
conducting experiments on real-world data and experiments with real data in
Shanghai (China) proved that our method could discover two kinds of urban
structure, the homophily and structural equivalence, which means clustering
community with intensive interaction or under the same roles in urban space.Comment: 6 pages, 7 figures, ICSDM201
On the Semantic Interpretability of Artificial Intelligence Models
Artificial Intelligence models are becoming increasingly more powerful and
accurate, supporting or even replacing humans' decision making. But with
increased power and accuracy also comes higher complexity, making it hard for
users to understand how the model works and what the reasons behind its
predictions are. Humans must explain and justify their decisions, and so do the
AI models supporting them in this process, making semantic interpretability an
emerging field of study. In this work, we look at interpretability from a
broader point of view, going beyond the machine learning scope and covering
different AI fields such as distributional semantics and fuzzy logic, among
others. We examine and classify the models according to their nature and also
based on how they introduce interpretability features, analyzing how each
approach affects the final users and pointing to gaps that still need to be
addressed to provide more human-centered interpretability solutions.Comment: 17 pages, 4 figures. Submitted to AI Magazine on August, 201
Clustering and its Application in Requirements Engineering
Large scale software systems challenge almost every activity in the software development life-cycle, including tasks related to eliciting, analyzing, and specifying requirements. Fortunately many of these complexities can be addressed through clustering the requirements in order to create abstractions that are meaningful to human stakeholders. For example, the requirements elicitation process can be supported through dynamically clustering incoming stakeholders’ requests into themes. Cross-cutting concerns, which have a significant impact on the architectural design, can be identified through the use of fuzzy clustering techniques and metrics designed to detect when a theme cross-cuts the dominant decomposition of the system. Finally, traceability techniques, required in critical software projects by many regulatory bodies, can be automated and enhanced by the use of cluster-based information retrieval methods. Unfortunately, despite a significant body of work describing document clustering techniques, there is almost no prior work which directly addresses the challenges, constraints, and nuances of requirements clustering. As a result, the effectiveness of software engineering tools and processes that depend on requirements clustering is severely limited. This report directly addresses the problem of clustering requirements through surveying standard clustering techniques and discussing their application to the requirements clustering process
Numeric Input Relations for Relational Learning with Applications to Community Structure Analysis
Most work in the area of statistical relational learning (SRL) is focussed on
discrete data, even though a few approaches for hybrid SRL models have been
proposed that combine numerical and discrete variables. In this paper we
distinguish numerical random variables for which a probability distribution is
defined by the model from numerical input variables that are only used for
conditioning the distribution of discrete response variables. We show how
numerical input relations can very easily be used in the Relational Bayesian
Network framework, and that existing inference and learning methods need only
minor adjustments to be applied in this generalized setting. The resulting
framework provides natural relational extensions of classical probabilistic
models for categorical data. We demonstrate the usefulness of RBN models with
numeric input relations by several examples.
In particular, we use the augmented RBN framework to define probabilistic
models for multi-relational (social) networks in which the probability of a
link between two nodes depends on numeric latent feature vectors associated
with the nodes. A generic learning procedure can be used to obtain a
maximum-likelihood fit of model parameters and latent feature values for a
variety of models that can be expressed in the high-level RBN representation.
Specifically, we propose a model that allows us to interpret learned latent
feature values as community centrality degrees by which we can identify nodes
that are central for one community, that are hubs between communities, or that
are isolated nodes. In a multi-relational setting, the model also provides a
characterization of how different relations are associated with each community
Machine Learning Techniques and Applications For Ground-based Image Analysis
Ground-based whole sky cameras have opened up new opportunities for
monitoring the earth's atmosphere. These cameras are an important complement to
satellite images by providing geoscientists with cheaper, faster, and more
localized data. The images captured by whole sky imagers can have high spatial
and temporal resolution, which is an important pre-requisite for applications
such as solar energy modeling, cloud attenuation analysis, local weather
prediction, etc.
Extracting valuable information from the huge amount of image data by
detecting and analyzing the various entities in these images is challenging.
However, powerful machine learning techniques have become available to aid with
the image analysis. This article provides a detailed walk-through of recent
developments in these techniques and their applications in ground-based
imaging. We aim to bridge the gap between computer vision and remote sensing
with the help of illustrative examples. We demonstrate the advantages of using
machine learning techniques in ground-based image analysis via three primary
applications -- segmentation, classification, and denoising
Linking Datasets on Organizations Using Half A Billion Open Collaborated Records
Scholars studying organizations often work with multiple datasets lacking
shared unique identifiers or covariates. In such situations, researchers may
turn to approximate string matching methods to combine datasets. String
matching, although useful, faces fundamental challenges. Even when two strings
appear similar to humans, fuzzy matching often does not work because it fails
to adapt to the informativeness of the character combinations presented. Worse,
many entities have multiple names that are dissimilar (e.g., "Fannie Mae" and
"Federal National Mortgage Association"), a case where string matching has
little hope of succeeding. This paper introduces data from a prominent
employment-related networking site (LinkedIn) as a tool to address these
problems. We propose interconnected approaches to leveraging the massive amount
of information from LinkedIn regarding organizational name-to-name links. The
first approach builds a machine learning model for predicting matches from
character strings, treating the trillions of user-contributed organizational
name pairs as a training corpus: this approach constructs a string matching
metric that explicitly maximizes match probabilities. A second approach
identifies relationships between organization names using network
representations of the LinkedIn data. A third approach combines the first and
second. We document substantial improvements over fuzzy matching in
applications, making all methods accessible in open-source software
("LinkOrgs")
Scribe: A Clustering Approach To Semantic Information Retrieval
Information retrieval is the process of fulfilling a user?s need for information by locating items in a data collection that are similar to a complex query that is often posed in natural language. Latent Semantic Indexing (LSI) was the predominant technique employed at the National Institute of Standards and Technology?s Text Retrieval Conference for many years until limitations of its scalability to large data sets were discovered. This thesis describes SCRIBE, a modification of LSI with improved scalability. SCRIBE clusters its semantic index into discrete volumes described by high-dimensional extensions to computer graphics data structures. SCRIBE?s clustering strategy limits the number of items that must be searched and provides for sub-linear time complexity in the number of documents. Experimental results with a large, natural language document collection demonstrate that SCRIBE achieves retrieval accuracy similar to LSI but requires 1/10 the time
- …