3 research outputs found
Data Mining Revision Controlled Document History Metadata for Automatic Classification
Version controlled documents provide a complete history of the changes to the document, including everything from what was changed to who made the change and much more. Through the use of cluster analysis and several sets of manipulated data, this research examines the revision history of Wikipedia in an attempt to find language-independent patterns that could assist in automatic page classification software. Utilizing two sample data sets and applying the aforementioned cluster analysis, no conclusive evidence was found that would indicate that such patterns exist. Our work on the software, however, does provide a foundation for more possible types of data manipulation and refined clustering algorithms to be used for further research into finding such patterns
A Review on Web Page Classification
With the increase in digital documents on the world wide web and an increase in the number of webpages and blogs which are common sources for providing users with news about current events, aggregating and categorizing information from these sources seems to be a daunting task as the volume of digital documents available online is growing exponentially. Although several benefits can accrue from the accurate classification of such documents into their respective categories such as providing tools that help people to find, filter and analyze digital information on the web amongst others. Accurate classification of these documents into their respective categories is dependent on the quality of training dataset which is dependent on the preprocessing techniques. Existing literature in this area of web page classification identified that better document representation techniques would reduce the training and testing time, improve the classification accuracy, precision and recall of classifier. In this paper, we give an overview of web page classification with an in-depth study of the web classification process, while at the same time making awareness of the need for an adequate document representation technique as this helps capture the semantics of document and-also contribute to reduce the problem of high dimensionality
A Review on Web Page Classification
With the increase in digital documents on the world wide web and an
increase in the number of webpages and blogs which are common sources for
providing users with news about current events, aggregating and categorizing
information from these sources seems to be a daunting task as the volume of
digital documents available online is growing exponentially. Although several
benefits can accrue from the accurate classification of such documents into their
respective categories such as providing tools that help people to find, filter and
analyze digital information on the web amongst others. Accurate classification
of these documents into their respective categories is dependent on the quality of
training dataset which is dependent on the preprocessing techniques. Existing
literature in this area of web page classification identified that better document
representation techniques would reduce the training and testing time, improve
the classification accuracy, precision and recall of classifier. In this paper, we
give an overview of web page classification with an in-depth study of the web
classification process, while at the same time making awareness of the need for
an adequate document representation technique as this helps capture the
semantics of document and-also contribute to reduce the problem of high
dimensionality