72,339 research outputs found
Categorization of web sites in Turkey with SVM
Thesis (Master)--Izmir Institute of Technology, Computer Engineering, Izmir, 2004Includes bibliographical references (leaves: 61-63)Text in English; Abstract: Turkish and Englishix, 70 leavesIn this study of topic .Categorization of Web Sites in Turkey with SVM. after a brief introduction to what the World Wide Web is and a more detailed description of text categorization and web site categorization concepts, categorization of web sites including all prerequisites for classification task takes part. As an information resource the web has an undeniable importance in human life. However the huge structure of the web and its uncontrolled growth led to new information retrieval research areas to be risen in last years. Web mining, the general name of these studies, investigates activities and structures on the web to automatically discover and gather meaningful information from the web documents. It consists of three subfields: .Web Structure Mining., .Web Content Mining. and .Web Usage Mining.. In this project, web content mining concept was applied on the web sites in Turkey during the categorization process. Support Vector Machine, a supervised learning method based on statistics and principle of structural risk minimization is used as the machine learning technique for web site categorization. This thesis is intended to draw a conclusion about web site distributions with respect to thematic categorization based on text. The popular web directory Yahoo.s 12 top level categories were used in this project. Beside of the main purpose, we gathered several statistical descriptive informations about web sites and contents used in html pages. Metatag usage percentages, html design structures and plug-in usage are some of these information. The processes taken through solution, start with employing a web downloader which downloads web page contents and other information such as frame content from each web site. Next, manipulating, parsing and simplifying the downloaded documents takes place. At this point, preperations for categorization task are completed. Then, by applying Support Vector Machine (SVM) package SVMLight developed by Thorsten Joachims, web sites are classified under given categories. The classification results obtained in the last section show that there are some over-lapping categories exist and accuracy and precision values are between 60-80. In addition to categorization results, we saw that almost 17 of web sites utilize html frames and 9367 web sites include metakeywords
Text Message Categorization of Collaborative Learning Skills in Online Discussion using Neural Network
This paper presents research in neural network approach for text messages categorization of collaborative learning skill in an online discussion. Although a neural network is a popular method for text categorization in the research area of machine learning, unfortunately, the use of neural network in educational settings is rare. Usually, text categorization by neural network is employed to categorize news articles, emails, product reviews, and web pages. In an online discussion, text categorization that is used to classify the message sent by the student into a certain category is often manual, requiring skilled human specialists. However, human categorization is not an effective way for a number of reasons; time- consuming, labor-intensive, lack of consistency in a category, and costly. Therefore, this paper proposes a neural network approach to code the message automatically. Results show that neural networks achieving useful classification on eight categories of collaborative learning skills in an online discussion as measured based on precision, recall, and balanced F-measure
A Large Visual, Qualitative, and Quantitative Dataset for Web Intelligence Applications
The Web is the communication platform and source of information par excellence. The volume and complexity of its content have grown enormously, with organizing, retrieving, and cleaning Web information becoming a challenge for traditional techniques. Web intelligence is a novel research area to improve Web-based services and applications using artificial intelligence and automatic learning algorithms, for which a large amount of Web-related data are essential. Current datasets are, however, limited and do not combine visual representation and attributes of Web pages. Our work provides a large dataset of 49,438 Web pages, composed of webshots, along with qualitative and quantitative attributes. This dataset covers all the countries in the world and a wide range of topics, such as art, entertainment, economics, business, education, government, news, media, science, and the environment, addressing different cultural characteristics and varied design preferences. We use this dataset to develop three Web Intelligence applications: knowledge extraction on Web design using statistical analysis, recognition of error Web pages using a customized convolutional neural network (CNN) to eliminate invalid pages, and Web categorization based solely on screenshots using a CNN with transfer learning to assist search engines, indexers, and Web directories.This work has been funded by the grant awarded by the Central University of Ecuador through budget certification No. 34 of March 25, 2022 for the development of the research project with code: DOCT-DI-2020-37
TERM WEIGHTING BASED ON INDEX OF GENRE FOR WEB PAGE GENRE CLASSIFICATION
Automating the identification of the genre of web pages becomes an important area in web pages classification, as it can be used to improve the quality of the web search result and to reduce search time. To index the terms used in classification, generally the selected type of weighting is the document-based TF-IDF. However, this method does not consider genre, whereas web page documents have a type of categorization called genre. With the existence of genre, the term appearing often in a genre should be more significant in document indexing compared to the term appearing frequently in many genres despites its high TF-IDF value. We proposed a new weighting method for web page documents indexing called inverse genre frequency (IGF). This method is based on genre, a manual categorization done semantically from previous research. Experimental results show that the term weighting based on index of genre (TF-IGF) performed better compared to term weighting based on index of document (TF-IDF), with the highest value of accuracy, precision, recall, and F-measure in case of excluding the genre-specific keywords were 78%, 80.2%, 78%, and 77.4% respectively, and in case of including the genre-specific keywords were 78.9%, 78.7%, 78.9%, and 78.1% respectively
Learning Deep Visual Object Models From Noisy Web Data: How to Make it Work
Deep networks thrive when trained on large scale data collections. This has
given ImageNet a central role in the development of deep architectures for
visual object classification. However, ImageNet was created during a specific
period in time, and as such it is prone to aging, as well as dataset bias
issues. Moving beyond fixed training datasets will lead to more robust visual
systems, especially when deployed on robots in new environments which must
train on the objects they encounter there. To make this possible, it is
important to break free from the need for manual annotators. Recent work has
begun to investigate how to use the massive amount of images available on the
Web in place of manual image annotations. We contribute to this research thread
with two findings: (1) a study correlating a given level of noisily labels to
the expected drop in accuracy, for two deep architectures, on two different
types of noise, that clearly identifies GoogLeNet as a suitable architecture
for learning from Web data; (2) a recipe for the creation of Web datasets with
minimal noise and maximum visual variability, based on a visual and natural
language processing concept expansion strategy. By combining these two results,
we obtain a method for learning powerful deep object models automatically from
the Web. We confirm the effectiveness of our approach through object
categorization experiments using our Web-derived version of ImageNet on a
popular robot vision benchmark database, and on a lifelong object discovery
task on a mobile robot.Comment: 8 pages, 7 figures, 3 table
Changes in corporate websites and business activity: automatic classification of corporate webpages
[EN] Every time a firm or institution performs an activity on the Web, this is registered, leaving a "digital footprint”. Part this digital footprint is reflected on their websites as these officially represent them on the Web. We plan to automatically monitor the changes that periodically occur in a website to relate them with the business activity. The aim of this paper is to propose a theoretical classification of corporate webpages to associate changes that occur on them with the regular activity of the firms, and to evaluate the possibility of an automatic categorization using classification models. To generate the classification of corporate webpages, a significant number of today corporate webpages were analyzed and observed, distinguishing four theoretical types of corporate webpages. To evaluate the automatic categorization of corporate webpages, a dataset of 1005 today corporate pages was generated by manually labeling them and evaluating their automatic categorization using classification models.This work was partially supported by grants PID2019-107765RB-I00 and funded by MCIN/AEI/10.13039/501100011033.Valenzuela Rubilar, JM.; Domenech, J.; Pont, A. (2022). Changes in corporate websites and business activity: automatic classification of corporate webpages. En 4th International Conference on Advanced Research Methods and Analytics (CARMA 2022). Editorial Universitat Politècnica de València. 213-220. https://doi.org/10.4995/CARMA2022.2022.1509021322
Collaborative Categorization on the Web
Collaborative categorization is an emerging direction for research and innovative
applications. Arguably, collaborative categorization on the Web is an especially
promising emerging form of collaborative Web systems because of both, the
widespread use of the conventional Web and the emergence of the Semantic Web
providing with more semantic information on Web data. This paper discusses this issue
and proposes two approaches: collaborative categorization via category merging and
collaborative categorization proper. The main advantage of the first approach is that it
can be rather easily realized and implemented using existing systems such as Web
browsers and mail clients. A prototype system for collaborative Web usage that uses
category merging for collaborative categorization is described and the results of field
experiments using it are reported. The second approach, called collaborative
categorization proper, however, is more general and scales better. The data structure and
user interface aspects of an approach to collaborative categorization proper are
discussed
- …