Search CORE

233,646 research outputs found

Infinite Author Topic Model based on Mixed Gamma-Negative Binomial Process.

Author: Lu J
Luo X
Xu RYD
Xuan J
Zhang G
Publication venue
Publication date: 01/01/2015
Field of study

Incorporating the side information of text corpus, i.e., authors, time stamps, and emotional tags, into the traditional text mining models has gained significant interests in the area of information retrieval, statistical natural language processing, and machine learning. One branch of these works is the so-called Author Topic Model (ATM), which incorporates the authors's interests as side information into the classical topic model. However, the existing ATM needs to predefine the number of topics, which is difficult and inappropriate in many real-world settings. In this paper, we propose an Infinite Author Topic (IAT) model to resolve this issue. Instead of assigning a discrete probability on fixed number of topics, we use a stochastic process to determine the number of topics from the data itself. To be specific, we extend a gamma-negative binomial process to three levels in order to capture the author-document-keyword hierarchical structure. Furthermore, each document is assigned a mixed gamma process that accounts for the multi-author's contribution towards this document. An efficient Gibbs sampling inference algorithm with each conditional distribution being closed-form is developed for the IAT model. Experiments on several real-world datasets show the capabilities of our IAT model to learn the hidden topics, authors' interests on these topics and the number of topics simultaneously.Comment: 10 pages, 5 figures, submitted to KDD conferenc

arXiv.org e-Print Archive

Crossref

OPUS - University of Technology Sydney

GarlicESTdb: an online database and mining tool for garlic EST sequences

Author: Chae Sung-Hwa
Choi Sang-Haeng
Jung Tae-Sung
Kim Aeri
Kim Dae-Won
Kim Dong-Wook
Kim Ryong Nam
Kwon Hyuk-Ryul
Nam Seong-Hyeuk
Park Hong-Seog
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

Abstract Background <it>Allium sativum</it>., commonly known as garlic, is a species in the onion genus (<it>Allium</it>), which is a large and diverse one containing over 1,250 species. Its close relatives include chives, onion, leek and shallot. Garlic has been used throughout recorded history for culinary, medicinal use and health benefits. Currently, the interest in garlic is highly increasing due to nutritional and pharmaceutical value including high blood pressure and cholesterol, atherosclerosis and cancer. For all that, there are no comprehensive databases available for Expressed Sequence Tags(EST) of garlic for gene discovery and future efforts of genome annotation. That is why we developed a new garlic database and applications to enable comprehensive analysis of garlic gene expression. Description GarlicESTdb is an integrated database and mining tool for large-scale garlic (<it>Allium sativum</it>) EST sequencing. A total of 21,595 ESTs collected from an in-house cDNA library were used to construct the database. The analysis pipeline is an automated system written in JAVA and consists of the following components: automatic preprocessing of EST reads, assembly of raw sequences, annotation of the assembled sequences, storage of the analyzed information into MySQL databases, and graphic display of all processed data. A web application was implemented with the latest J2EE (Java 2 Platform Enterprise Edition) software technology (JSP/EJB/JavaServlet) for browsing and querying the database, for creation of dynamic web pages on the client side, and for mapping annotated enzymes to KEGG pathways, the AJAX framework was also used partially. The online resources, such as putative annotation, single nucleotide polymorphisms (SNP) and tandem repeat data sets, can be searched by text, explored on the website, searched using BLAST, and downloaded. To archive more significant BLAST results, a curation system was introduced with which biologists can easily edit best-hit annotation information for others to view. The GarlicESTdb web application is freely available at <url>http://garlicdb.kribb.re.kr</url>. Conclusion GarlicESTdb is the first incorporated online information database of EST sequences isolated from garlic that can be freely accessed and downloaded. It has many useful features for interactive mining of EST contigs and datasets from each library, including curation of annotated information, expression profiling, information retrieval, and summary of statistics of functional annotation. Consequently, the development of GarlicESTdb will provide a crucial contribution to biologists for data-mining and more efficient experimental studies.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Drug prescription support in dental clinics through drug corpus mining

Author: Goh Wee Pheng
Tao Xiaohui
Xie Haoran
Yong Jianming
Zhang Ji
Zhang Wenping
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/12/2018
Field of study

The rapid increase in the volume and variety of data poses a challenge to safe drug prescription for the dentist. The increasing number of patients that take multiple drugs further exerts pressure on the dentist to make the right decision at point-of-care. Hence, a robust decision support system will enable dentists to make decisions on drug prescription quickly and accurately. Based on the assumption that similar drug pairs have a higher similarity ratio, this paper suggests an innovative approach to obtain the similarity ratio between the drug that the dentist is going to prescribe and the drug that the patient is currently taking. We conducted experiments to obtain the similarity ratios of both positive and negative drug pairs, by using feature vectors generated from term similarities and word embeddings of biomedical text corpus. This model can be easily adapted and implemented for use in a dental clinic to assist the dentist in deciding if a drug is suitable for prescription, taking into consideration the medical profile of the patients. Experimental evaluation of our model’s association of the similarity ratio between two drugs yielded a superior F score of 89%. Hence, such an approach, when integrated within the clinical work flow, will reduce prescription errors and thereby increase the health outcomes of patients

University of Southern Queensland ePrints

Drug prescription support in dental clinics through drug corpus mining

Author: Goh Wee Pheng
Tao Xiaohui
Zhang Ji
Yong Jianming
Zhang Wenping
Xie Haoran
Publication venue: Springer
Publication date: 01/01/2018
Field of study

Crossref

UND Scholarly Commons (University of North Dakota)

University of Southern Queensland ePrints

Assessing Crash Risks on Curves

Author: Chen Samantha
Krishnaswamy Shonali
Loke Seng
Rakotonirainy Andry
Sheehan Mary
Publication venue: Able Video & Multimedia Pty Ltd
Publication date: 01/01/2006
Field of study

In Queensland, curve related crashes contributed to 63.44% of fatalities, and 25.17% required hospitalisation. In addition, 51.1% of run-off-road crashes occurred on obscured or open-view road curves (Queensland Transport, 2006). This paper presents a conceptual framework for an in-vehicle system, which assesses crash risk when a driver is manoeuvring on a curve. Our approach consists of using Intelligent Transport Systems (ITS) to collect information about the driving context. The driving context corresponds to information about the environment, driver, and vehicle gathered from sensor technology. Sensors are useful to detect drivers’ high-risk situations such as curves, fogs, drivers’ fatigue or slippery roads. However, sensors can be unreliable, and therefore the information gathered from them can be incomplete or inaccurate. In order to improve the accuracy, a system is built to perform information fusion from past and current driving information. The integrated information is analysed using ubiquitous data mining techniques and the results are later used in a Coupled Hidden Markov Model (CHMM), to learn and classify the information into different risk categories. CHMM is used to predict the probability of crash on curves. Based on the risk assessment, our system provides appropriate intervention to the driver. This approach could allow the driver to have sufficient time to react promptly. Hence, this could potentially promote safe driving and decrease curve related injuries and fatalities

Queensland University of Technology ePrints Archive

A pattern mining approach for information filtering systems

Author: Algarni Abdulmohsen
Li Yuefeng
Xu Yue
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2011
Field of study

It is a big challenge to clearly identify the boundary between positive and negative streams for information filtering systems. Several attempts have used negative feedback to solve this challenge; however, there are two issues for using negative relevance feedback to improve the effectiveness of information filtering. The first one is how to select constructive negative samples in order to reduce the space of negative documents. The second issue is how to decide noisy extracted features that should be updated based on the selected negative samples. This paper proposes a pattern mining based approach to select some offenders from the negative documents, where an offender can be used to reduce the side effects of noisy features. It also classifies extracted features (i.e., terms) into three categories: positive specific terms, general terms, and negative specific terms. In this way, multiple revising strategies can be used to update extracted features. An iterative learning algorithm is also proposed to implement this approach on the RCV1 data collection, and substantial experiments show that the proposed approach achieves encouraging performance and the performance is also consistent for adaptive filtering as well

Queensland University of Technology ePrints Archive

The Ideal Candidate. Analysis of Professional Competences through Text Mining of Job Offers

Author: E. DI MEGLIO
GRASSIA MARIA GABRIELLA
M. MISURACA
Publication venue: place:HEIDELBERG
Publication date: 01/01/2007
Field of study

The aim of this paper is to propose analytical tools for identifying peculiar aspects of job market for graduates. We propose a strategy for dealing with daa tat have different source and nature

Archivio della ricerca - Università degli studi di Napoli Federico II

Adaptive text mining: Inferring structure from sequences

Author: Witten Ian H.
Publication venue: 'Elsevier BV'
Publication date: 01/01/2004
Field of study

Text mining is about inferring structure from sequences representing natural language text, and may be defined as the process of analyzing text to extract information that is useful for particular purposes. Although hand-crafted heuristics are a common practical approach for extracting information from text, a general, and generalizable, approach requires adaptive techniques. This paper studies the way in which the adaptive techniques used in text compression can be applied to text mining. It develops several examples: extraction of hierarchical phrase structures from text, identification of keyphrases in documents, locating proper names and quantities of interest in a piece of text, text categorization, word segmentation, acronym extraction, and structure recognition. We conclude that compression forms a sound unifying principle that allows many text mining problems to be tacked adaptively

Research Commons@Waikato