1,697 research outputs found
Clustering: finding patterns in the darkness
Machine learning is changing the world and fuelling Industry 4.0. These statistical methods focused on identifying patterns in data to provide an intelligent response to specific requests. Although understanding data tends to require expert knowledge to supervise the decision-making process, some techniques need no supervision. These unsupervised techniques can work blindly but they are based on data similarity. One of the most popular areas in this field is clustering. Clustering groups data to guarantee that the clusters’ elements have a strong similarity while the clusters are distinct among them. This field started with the K-means algorithm, one of the most popular algorithms in machine learning with extensive applications. Currently, there are multiple strategies to deal with the clustering problem. This review introduces some of the classical algorithms, focusing significantly on algorithms based on evolutionary computation, and explains some current applications of clustering to large datasets
Software testing or the bugs’ nightmare
Software development is not error-free. For decades, bugs –including physical ones– have become a significant development problem requiring major maintenance efforts. Even in some cases, solving bugs led to increment them. One of the main reasons for bug’s prominence is their ability to hide. Finding them is difficult and costly in terms of time and resources. However, software testing made significant progress identifying them by using different strategies that combine knowledge from every single part of the program. This paper humbly reviews some different approaches from software testing that discover bugs automatically and presents some different state-of-the-art methods and tools currently used in this area. It covers three testing strategies: search-based methods, symbolic execution, and fuzzers. It also provides some income about the application of diversity in these areas, and common and future challenges on automatic test generation that still need to be addressed
Malware: the never-ending arm race
"Antivirus is death"' and probably every detection system that focuses on a single strategy for indicators of compromise. This famous quote that Brian Dye --Symantec's senior vice president-- stated in 2014 is the best representation of the current situation with malware detection and mitigation. Concealment strategies evolved significantly during the last years, not just like the classical ones based on polimorphic and metamorphic methodologies, which killed the signature-based detection that antiviruses use, but also the capabilities to fileless malware, i.e. malware only resident in volatile memory that makes every disk analysis senseless. This review provides a historical background of different concealment strategies introduced to protect malicious --and not necessarily malicious-- software from different detection or analysis techniques. It will cover binary, static and dynamic analysis, and also new strategies based on machine learning from both perspectives, the attackers and the defenders
Inequality of Outcomes and Inequality of Opportunities in Brazil
This paper departs from John Roemer's theory of equality of opportunities. We seek to determine what part of observed outcome inequality may be attributed to differences in observed 'circumstances', including family background, and what part is due to 'personal efforts'. We use a micro-econometric technique to simulate what the distribution of outcomes would look like if 'circumstances' were the same for everybody. This technique is applied to Brazilian data from the 1996 household survey, both for earnings and for household incomes. It is shown that observed circumstances are a major source of outcome inequality in Brazil, probably more so than in other countries for which information is available. Nevertheless, the level of inequality after observed circumstances are equalized remains very high in Brazil.http://deepblue.lib.umich.edu/bitstream/2027.42/40016/3/wp630.pd
Mimicking anti-viruses with machine learning and entropy profiles
The quality of anti-virus software relies on simple patterns extracted from binary files. Although these patterns have proven to work on detecting the specifics of software, they are extremely sensitive to concealment strategies, such as polymorphism or metamorphism. These limitations also make anti-virus software predictable, creating a security breach. Any black hat with enough information about the anti-virus behaviour can make its own copy of the software, without any access to the original implementation or database. In this work, we show how this is indeed possible by combining entropy patterns with classification algorithms. Our results, applied to 57 different anti-virus engines, show that we can mimic their behaviour with an accuracy close to 98% in the best case and 75% in the worst, applied on Windows’ disk resident malware
Hashing fuzzing: introducing input diversity to improve crash detection
The utility of a test set of program inputs is strongly influenced by its diversity and its size. Syntax coverage has become a standard proxy for diversity. Although more sophisticated measures exist, such as proximity of a sample to a uniform distribution, methods to use them tend to be type dependent. We use r-wise hash functions to create a novel, semantics preserving, testability transformation for C programs that we call HashFuzz. Use of HashFuzz improves the diversity of test sets produced by instrumentation-based fuzzers. We evaluate the effect of the HashFuzz transformation on eight programs from the Google Fuzzer Test Suite using four state-of-the-art fuzzers that have been widely used in previous research. We demonstrate pronounced improvements in the performance of the test sets for the transformed programs across all the fuzzers that we used. These include strong improvements in diversity in every case, maintenance or small improvement in branch coverage – up to 4.8% improvement in the best case, and significant improvement in unique crash detection numbers – between 28% to 97% increases compared to test sets for untransformed program
Designing large quantum key distribution networks via medoid-based algorithms
The current development of quantum mechanics and its applications suppose a threat to modern cryptography as it was conceived. The abilities of quantum computers for solving complex mathematical problems, as a strong computational novelty, is the root of that risk. However, quantum technologies can also prevent this threat by leveraging quantum methods to distribute keys. This field, called Quantum Key Distribution (QKD) is growing, although it still needs more physical basics to become a reality as popular as the Internet. This work proposes a novel methodology that leverages medoid-based clustering techniques to design quantum key distribution networks on commercial fiber optics systems. Our methodology focuses on the current limitations of these communication systems, their error loss and how trusted repeaters can lead to achieve a proper communication with the current technology. We adapt our model to the current data on a wide territory covering an area of almost 100,000 km2, and prove that considering physical limitations of around 45km with 3.1 error loss, our design can provide service to the whole area. This technique is the first to extend the state of the art network’s design, that is focused on up to 10 nodes, to networks dealing with more than 200 nodes
Medoid-based clustering using ant colony optimization
The application of ACO-based algorithms in data mining has been growing over the last few years, and several supervised and unsupervised learning algorithms have been developed using this bio-inspired approach. Most recent works about unsupervised learning have focused on clustering, showing the potential of ACO-based techniques. However, there are still clustering areas that are almost unexplored using these techniques, such as medoid-based clustering. Medoid-based clustering methods are helpful—compared to classical centroid-based techniques—when centroids cannot be easily defined. This paper proposes two medoid-based ACO clustering algorithms, where the only information needed is the distance between data: one algorithm that uses an ACO procedure to determine an optimal medoid set (METACOC algorithm) and another algorithm that uses an automatic selection of the number of clusters (METACOC-K algorithm). The proposed algorithms are compared against classical clustering approaches using synthetic and real-world datasets
MOCDroid: multi-objective evolutionary classifier for Android malware detection
Malware threats are growing, while at the same time, concealment strategies are being used to make them undetectable for current commercial Anti-Virus. Android is one of the target architectures where these problems are specially alarming, due to the wide extension of the platform in different everyday devices.The detection is specially relevant for Android markets in order to ensure that all the software they offer is clean, however, obfuscation has proven to be effective at evading the detection process. In this paper we leverage third-party calls to bypass the effects of these concealment strategies, since they cannot be obfuscated. We combine clustering and multi-objective optimisation to generate a classifier based on specific behaviours defined by 3rd party calls groups. The optimiser ensures that these groups are related to malicious or benign behaviours cleaning any non-discriminative pattern. This tool, named MOCDroid, achieves an ac-curacy of 94.6% in test with 2.12% of false positives with real apps extracted from the wild, overcoming all commercial Anti-Virus engines from VirusTotal
Extending the SACOC algorithm through the Nystrom method for dense manifold data analysis
Data analysis has become an important field over the last decades. The growing amount of data demands new analytical methodologies in order to extract relevant knowledge. Clustering is one of the most competitive techniques in this context.Using a dataset as a starting point, these techniques aim to blindly group the data by similarity. Among the different areas, manifold identification is currently gaining importance. Spectral-based methods, which are the mostly used methodologies in this area, are however sensitive to metric parameters and noise. In order to solve these problems, new bio-inspired techniques have been combined with different heuristics to perform the clustering solutions and stability, specially for dense datasets. Ant Colony Optimization (ACO) is one of these new bio-inspired methodologies. This paper presents an extension of a previous algorithm named Spectral-based ACO Clustering (SACOC). SACOC is a spectral-based clustering methodology used for manifold identification. This work is focused on improving this algorithm through the Nystrom extension. The new algorithm, named SACON, is able to deal with Dense Data problems.We have evaluated the performance of this new approach comparing it with online clustering algorithms and the Nystrom extension of the Spectral Clustering algorithm using several datasets
- …