2,092 research outputs found
An Efficient Approach for Finding Near Duplicate Web pages using Minimum Weight Overlapping Method
The existence of billions of web data has severely affected the performance and reliability of web search. The presence of near duplicate web pages plays an important role in this performance degradation while integrating data from heterogeneous sources. Web mining faces huge problems due to the existence of such documents. These pages increase the index storage space and thereby increase the serving cost. By introducing efficient methods to detect and remove such documents from the Web not only decreases the computation time but also increases the relevancy of search results. We aim a novel idea for finding near duplicate web pages which can be incorporated in the field of plagiarism detection, spam detection and focused web crawling scenarios. Here we propose an efficient method for finding near duplicates of an input web page, from a huge repository. A TDW matrix based algorithm is proposed with three phases, rendering, filtering and verification, which receives an input web page and a threshold in its first phase, prefix filtering and positional filtering to reduce the size of record set in the second phase and returns an optimal set of near duplicate web pages in the verification phase by using Minimum Weight Overlapping (MWO) method. The experimental results show that our algorithm outperforms in terms of two benchmark measures, precision and recall, and a reduction in the size of competing record set.DOI:http://dx.doi.org/10.11591/ijece.v1i2.7
An Approach for Optimal Feature Subset Selection using a New Term Weighting Scheme and Mutual Information
With the development of the web, large numbers of documents are available on the Internet and they are growing drastically day by day. Hence automatic text categorization becomes more and more important for dealing with massive data. However the major problem of document categorization is the high dimensionality of feature space. Â The measures to decrease the feature dimension under not decreasing recognition effect are called the problems of feature optimum extraction or selection. Dealing with reduced relevant feature set can be more efficient and effective. The objective of feature selection is to find a subset of features that have all characteristics of the full features set. Instead Dependency among features is also important for classification. During past years, various metrics have been proposed to measure the dependency among different features. A popular approach to realize dependency is maximal relevance feature selection: selecting the features with the highest relevance to the target class. A new feature weighting scheme, we proposed have got a tremendous improvements in dimensionality reduction of the feature space. The experimental results clearly show that this integrated method works far better than the others
Recommended from our members
A Lagrangian analysis of ice-supersaturated air over the North Atlantic
Understanding the nature of air parcels that exhibit ice-supersaturation is important because they are the regions of potential formation of both cirrus and aircraft contrails, which affect the radiation balance. Ice-supersaturated air parcels in the upper troposphere and lower stratosphere over the North Atlantic are investigated using Lagrangian trajectories. The trajectory calculations use ERA-Interim data for three winter and three summer seasons, resulting in approximately 200,000 trajectories with ice-supersaturation for each season. For both summer and winter, the median duration of ice-supersaturation along a trajectory is less than 6 hours. 5% of air which becomes ice-supersaturated in the troposphere, and 23% of air which becomes ice-supersaturated in the stratosphere will remain ice-supersaturated for at least 24 hours. Weighting the ice-supersaturation duration with the observed frequency indicates the likely overall importance of the longer duration ice-supersaturated trajectories. Ice-supersaturated air parcels typically experience a decrease in moisture content while ice-supersaturated, suggesting that cirrus clouds eventually form in the majority of such air. A comparison is made between short-lived (less than 24 h) and long-lived (greater than 24 h) ice-supersaturated air flows. For both air flows, ice-supersaturation occurs around the northernmost part of the trajectory. Short-lived ice-supersaturated air flows show no significant differences in speed or direction of movement to subsaturated air parcels. However, long-lived ice-supersaturated air occurs in slower moving air flows, which implies that they are not associated with the fastest moving air through a jet stream
Recommended from our members
The contribution of greenhouse gases to the recent slowdown in global-mean temperature trends
The recent slowdown in the rate of increase in global-mean surface temperature (GMST) has generated extensive discussion, but little attention has been given to the contribution of time-varying trends in greenhouse gas concentrations. We use a simple model approach to quantify this contribution. Between 1985 and 2003, greenhouse gases (including well-mixed greenhouse gases, tropospheric and stratospheric ozone, and stratospheric water vapour from methane oxidation) caused a reduction in GMST trend of around 0.03–0.05 K decade−1 which is around 18%–25% of the observed trend over that period. The main contributors to this reduction are the rapid change in the growth rates of ozone-depleting gases (with this contribution slightly opposed by stratospheric ozone depletion itself) and the weakening in growth rates of methane and tropospheric ozone radiative forcing. Although CO2 is the dominant greenhouse gas contributor to GMST trends, the continued increase in CO2 concentrations offsets only about 30% of the simulated trend reduction due to these other contributors. These results emphasize that trends in non-CO2 greenhouse gas concentrations can make significant positive and negative contributions to changes in the rate of warming, and that they need to be considered more closely in analyses of the causes of such variations
Balltracking: an highly efficient method for tracking flow fields
We present a method for tracking solar photospheric flows that is highly efficient, and demonstrate it using high resolution MDI continuum images. The method involves making a surface from the photospheric granulation data, and allowing many small floating tracers or balls to be moved around by the evolving granulation pattern. The results are tested against synthesised granulation with known flow fields and compared to the results produced by Local Correlation tracking (LCT). The results from this new method have similar accuracy to those produced by LCT. We also investigate the maximum spatial and temporal resolution of the velocity field that it is possible to extract, based on the statistical properties of the granulation data. We conclude that both methods produce results that are close to the maximum resolution possible from granulation data. The code runs very significantly faster than our similarly optimised LCT code, making real time applications on large data sets possible. The tracking method is not limited to photospheric flows, and will also work on any velocity field where there are visible moving features of known scale length
THE EFFECT OF INDIVIDUALISED COACHING INTERVENTIONS ON ELITE YOUNG FAST BOWLERS‘ TECHNIQUE
Fast bowling in cricket is an activity well recognised as having a high injury prevalence. Previous research has associated lower back injury with aspects of fast bowling technique. Coaching interventions that may decrease the likelihood of injury, whilst maintaining or increasing ball speed, remain a priority within the sport. Selected kinematics of the bowling action of 14 elite young fast bowlers were measured using an 18 camera Vicon Motion Analysis System. Subjects were tested before and after a two year coaching intervention period, during which subject-specific coaching interventions were provided. Mann-Whitney tests were used to identify significant differences in the change in the selected kinematics between those bowlers who were coached or un-coached on each specific aspect. Coached athletes demonstrated a significant change in shoulder alignment at back foot contact (more side-on, P = 0.002) and shoulder counter-rotation (decreased, P = 0.001) relative to un-coached athletes. There was no difference in the amount of change in flexion angles of the front or back knee or lower trunk side-flexion between those who received coaching intervention and those that did not. This study shows that specific aspects of fast bowling technique in elite players can change over a two year period and may be attributed to coaching intervention
Recommended from our members
Regional emission metrics for short-lived climate forcers from multiple models
For short-lived climate forcers (SLCFs), the impact of emissions depends on where and when the emissions take place. Comprehensive new calculations of various emission metrics for SLCFs are presented based on radiative forcing (RF) values calculated in four different (chemical-transport or coupled chemistry–climate) models. We distinguish between emissions during summer (May–October) and winter (November–April) for emissions in Europe and East Asia, as well as from the global shipping sector and global emissions. The species included in this study are aerosols and aerosol precursors (BC, OC, SO2, NH3), as well as ozone precursors (NOx, CO, VOCs), which also influence aerosols to a lesser degree. Emission metrics for global climate responses of these emissions, as well as for CH4, have been calculated using global warming potential (GWP) and global temperature change potential (GTP), based on dedicated RF simulations by four global models. The emission metrics include indirect cloud effects of aerosols and the semi-direct forcing for BC. In addition to the standard emission metrics for pulse and sustained emissions, we have also calculated a new emission metric designed for an emission profile consisting of a ramping period of 15 years followed by sustained emissions, which is more appropriate for a gradual implementation of mitigation policies.
For the aerosols, the emission metric values are larger in magnitude for emissions in Europe than East Asia and for summer than winter. A variation is also observed for the ozone precursors, with largest values for emissions in East Asia and winter for CO and in Europe and summer for VOCs. In general, the variations between the emission metrics derived from different models are larger than the variations between regions and seasons, but the regional and seasonal variations for the best estimate also hold for most of the models individually. Further, the estimated climate impact of an illustrative mitigation policy package is robust even when accounting for the fact that the magnitude of emission metrics for different species in a given model is correlated. For the ramping emission metrics, the values are generally larger than for pulse or sustained emissions, which holds for all SLCFs. For SLCFs mitigation policies, the dependency of metric values on the region and season of emission should be considered
- …