63 research outputs found
A pattern mining approach for information filtering systems
It is a big challenge to clearly identify the boundary between positive and negative streams for information filtering systems. Several attempts have used negative feedback to solve this challenge; however, there are two issues for using negative relevance feedback to improve the effectiveness of information filtering. The first one is how to select constructive negative samples in order to reduce the space of negative documents. The second issue is how to decide noisy extracted features that should be updated based on the selected negative samples. This paper proposes a pattern mining based approach to select some offenders from the negative documents, where an offender can be used to reduce the side effects of noisy features. It also classifies extracted features (i.e., terms) into three categories: positive specific terms, general terms, and negative specific terms. In this way, multiple revising strategies can be used to update extracted features. An iterative learning algorithm is also proposed to implement this approach on the RCV1 data collection, and substantial experiments show that the proposed approach achieves encouraging performance and the performance is also consistent for adaptive filtering as well
A Classification Mechanism To Avoid Useless Data From Osn Walls
The attempt of the present work is consequently to propose and experimentally estimate an automated system called Filtered Wall (FW) which is competent to filter unwanted messages from OSN user walls. One essential issue in today’s Online Social Networks (OSNs) is to give users the provision to control the messages posted on their own private space to shun that unwanted content is displayed. This is achieved through a flexible rule-based system that let users to adapt the filtering criterion to be applied to their walls and a Machine Learning-based soft classifier automatically labelling messages in support of content-based filtering. The unique set of description imitative from endogenous properties of short texts is distended here including exogenous knowledge connected to the context from which the messages create. As far as the learning model is apprehensive we confirm in the current paper the use of neural learning which is today documented as one of the well-organized solutions in text classification. In particular we base the overall short text classification strategy on Radial Basis Function Networks (RBFN) for their established potential in acting as soft classifiers in managing noisy data and essentially vague classes.
Non-Compositional Term Dependence for Information Retrieval
Modelling term dependence in IR aims to identify co-occurring terms that are
too heavily dependent on each other to be treated as a bag of words, and to
adapt the indexing and ranking accordingly. Dependent terms are predominantly
identified using lexical frequency statistics, assuming that (a) if terms
co-occur often enough in some corpus, they are semantically dependent; (b) the
more often they co-occur, the more semantically dependent they are. This
assumption is not always correct: the frequency of co-occurring terms can be
separate from the strength of their semantic dependence. E.g. "red tape" might
be overall less frequent than "tape measure" in some corpus, but this does not
mean that "red"+"tape" are less dependent than "tape"+"measure". This is
especially the case for non-compositional phrases, i.e. phrases whose meaning
cannot be composed from the individual meanings of their terms (such as the
phrase "red tape" meaning bureaucracy). Motivated by this lack of distinction
between the frequency and strength of term dependence in IR, we present a
principled approach for handling term dependence in queries, using both lexical
frequency and semantic evidence. We focus on non-compositional phrases,
extending a recent unsupervised model for their detection [21] to IR. Our
approach, integrated into ranking using Markov Random Fields [31], yields
effectiveness gains over competitive TREC baselines, showing that there is
still room for improvement in the very well-studied area of term dependence in
IR
A Robust Linguistic Platform for Efficient and Domain specific Web Content Analysis
Web semantic access in specific domains calls for specialized search engines
with enhanced semantic querying and indexing capacities, which pertain both to
information retrieval (IR) and to information extraction (IE). A rich
linguistic analysis is required either to identify the relevant semantic units
to index and weight them according to linguistic specific statistical
distribution, or as the basis of an information extraction process. Recent
developments make Natural Language Processing (NLP) techniques reliable enough
to process large collections of documents and to enrich them with semantic
annotations. This paper focuses on the design and the development of a text
processing platform, Ogmios, which has been developed in the ALVIS project. The
Ogmios platform exploits existing NLP modules and resources, which may be tuned
to specific domains and produces linguistically annotated documents. We show
how the three constraints of genericity, domain semantic awareness and
performance can be handled all together
Filtrage automatique de courriels : une approche adaptative et multi niveaux
International audienceCet article propose un système de courriers électroniques paramétrable avec plusieurs niveaux de filtrage: un filtrage simple basé sur l'information contenue dans l'entête du courriel ; un filtrage booléen basé sur l'existence ou non de mots clés dans le corps du courriel ; un filtrage vectoriel basé sur le poids de contribution des mots clés du courriel ; un filtrage approfondi basé sur les propriétés linguistiques caractérisant la structure et le contenu du courriel. Nous proposons une solution adaptative qui offre au système la possibilité d'apprendre à partir de données, de modifier ses connaissances et de s'adapter à l'évolution des intérêts de l'utilisateur et à la variation de la nature des courriels dans le temps. De plus, nous utilisons un réseau lexical permettant d'améliorer la représentation du courriel en prenant en considération l'aspect sémantique.<BR /
- …