20,354 research outputs found
Chi-square-based scoring function for categorization of MEDLINE citations
Objectives: Text categorization has been used in biomedical informatics for
identifying documents containing relevant topics of interest. We developed a
simple method that uses a chi-square-based scoring function to determine the
likelihood of MEDLINE citations containing genetic relevant topic. Methods: Our
procedure requires construction of a genetic and a nongenetic domain document
corpus. We used MeSH descriptors assigned to MEDLINE citations for this
categorization task. We compared frequencies of MeSH descriptors between two
corpora applying chi-square test. A MeSH descriptor was considered to be a
positive indicator if its relative observed frequency in the genetic domain
corpus was greater than its relative observed frequency in the nongenetic
domain corpus. The output of the proposed method is a list of scores for all
the citations, with the highest score given to those citations containing MeSH
descriptors typical for the genetic domain. Results: Validation was done on a
set of 734 manually annotated MEDLINE citations. It achieved predictive
accuracy of 0.87 with 0.69 recall and 0.64 precision. We evaluated the method
by comparing it to three machine learning algorithms (support vector machines,
decision trees, na\"ive Bayes). Although the differences were not statistically
significantly different, results showed that our chi-square scoring performs as
good as compared machine learning algorithms. Conclusions: We suggest that the
chi-square scoring is an effective solution to help categorize MEDLINE
citations. The algorithm is implemented in the BITOLA literature-based
discovery support system as a preprocessor for gene symbol disambiguation
process.Comment: 34 pages, 2 figure
Coherent Keyphrase Extraction via Web Mining
Keyphrases are useful for a variety of purposes, including summarizing,
indexing, labeling, categorizing, clustering, highlighting, browsing, and
searching. The task of automatic keyphrase extraction is to select keyphrases
from within the text of a given document. Automatic keyphrase extraction makes
it feasible to generate keyphrases for the huge number of documents that do not
have manually assigned keyphrases. A limitation of previous keyphrase
extraction algorithms is that the selected keyphrases are occasionally
incoherent. That is, the majority of the output keyphrases may fit together
well, but there may be a minority that appear to be outliers, with no clear
semantic relation to the majority or to each other. This paper presents
enhancements to the Kea keyphrase extraction algorithm that are designed to
increase the coherence of the extracted keyphrases. The approach is to use the
degree of statistical association among candidate keyphrases as evidence that
they may be semantically related. The statistical association is measured using
web mining. Experiments demonstrate that the enhancements improve the quality
of the extracted keyphrases. Furthermore, the enhancements are not
domain-specific: the algorithm generalizes well when it is trained on one
domain (computer science documents) and tested on another (physics documents).Comment: 6 pages, related work available at http://purl.org/peter.turney
Towards the Automatic Classification of Documents in User-generated Classifications
There is a huge amount of information scattered on the World Wide Web. As the information flow occurs at a high speed in the WWW, there is a need to organize it in the right manner so that a user can access it very easily. Previously the organization of information was generally done manually, by matching the document contents to some pre-defined categories. There are two approaches for this text-based categorization: manual and automatic. In the manual approach, a human expert performs the classification task, and in the second case supervised classifiers are used to automatically classify resources. In a supervised classification, manual interaction is required to create some training data before the automatic classification task takes place. In our new approach, we intend to propose automatic classification of documents through semantic keywords and building the formulas generation by these keywords. Thus we can reduce this human participation by combining the knowledge of a given classification and the knowledge extracted from the data. The main focus of this PhD thesis, supervised by Prof. Fausto Giunchiglia, is the automatic classification of documents into user-generated classifications. The key benefits foreseen from this automatic document classification is not only related to search engines, but also to many other fields like, document organization, text filtering, semantic index managing
- …