452 research outputs found

    Identifying high-impact sub-structures for convolution kernels in document-level sentiment classification

    Get PDF
    Convolution kernels support the modeling of complex syntactic information in machine-learning tasks. However, such models are highly sensitive to the type and size of syntactic structure used. It is therefore an important challenge to automatically identify high impact sub-structures relevant to a given task. In this paper we present a systematic study investigating (combinations of) sequence and convolution kernels using different types of substructures in document-level sentiment classification. We show that minimal sub-structures extracted from constituency and dependency trees guided by a polarity lexicon show 1.45 point absolute improvement in accuracy over a bag-of-words classifier on a widely used sentiment corpus

    XML Schema Clustering with Semantic and Hierarchical Similarity Measures

    Get PDF
    With the growing popularity of XML as the data representation language, collections of the XML data are exploded in numbers. The methods are required to manage and discover the useful information from them for the improved document handling. We present a schema clustering process by organising the heterogeneous XML schemas into various groups. The methodology considers not only the linguistic and the context of the elements but also the hierarchical structural similarity. We support our findings with experiments and analysis

    Data Mining : Masa Lalu, Sekarang, dan Masa Mendatang

    Get PDF
    Data mining telah menjadi disiplin ilmu yang dibangun dalam domain kecerdasan buatan (AI), dan rekayasa pengetahuan (KE). Data mining berakar pada machine learning dan statistika, tetapi merambah bidang lain dalam ilmu komputer dan ilmu lainnya seperti biologi, lingkungan, finansial, jaringan dan sebagainya. Data mining telah mendapatkan begitu besar perhatian pada dekade terakhir sehubungan dengan perkembangan hardware yang menyediakan kemampuan komputasi luar biasa yang memungkinkan pengolahan data besar. Tidak seperti kajian lain dalam AI dan KE, data mining dapat diperdebatkan sebagai sebuah aplikasi dibandingkan dengan sebuah teknologi, dengan demikian diharapkan akan menjadi topik yang hangat dibahas di masa mendatang, mengingat pertumbuhan data yang bersifat eksponensial. Paper ini memberikan kilas Balik perjalanan sejarah data mining, keadaan saat ini dan beberapa pandangan dan perkembangan ke depan

    Semi-supervised co-clustering on attributed heterogeneous information networks

    Get PDF
    trueThe embargo period should be 2 years -- not sure why under the drop down I can only select one year. Please validate.</p

    Fast Distributed PageRank Computation

    Full text link
    Over the last decade, PageRank has gained importance in a wide range of applications and domains, ever since it first proved to be effective in determining node importance in large graphs (and was a pioneering idea behind Google's search engine). In distributed computing alone, PageRank vector, or more generally random walk based quantities have been used for several different applications ranging from determining important nodes, load balancing, search, and identifying connectivity structures. Surprisingly, however, there has been little work towards designing provably efficient fully-distributed algorithms for computing PageRank. The difficulty is that traditional matrix-vector multiplication style iterative methods may not always adapt well to the distributed setting owing to communication bandwidth restrictions and convergence rates. In this paper, we present fast random walk-based distributed algorithms for computing PageRanks in general graphs and prove strong bounds on the round complexity. We first present a distributed algorithm that takes O\big(\log n/\eps \big) rounds with high probability on any graph (directed or undirected), where nn is the network size and \eps is the reset probability used in the PageRank computation (typically \eps is a fixed constant). We then present a faster algorithm that takes O\big(\sqrt{\log n}/\eps \big) rounds in undirected graphs. Both of the above algorithms are scalable, as each node sends only small (\polylog n) number of bits over each edge per round. To the best of our knowledge, these are the first fully distributed algorithms for computing PageRank vector with provably efficient running time.Comment: 14 page
    • ā€¦
    corecore