A Hybrid Algorithm For Improvement Of XML Documents Clustering

somaye ghazanfari

A Hybrid Algorithm For Improvement Of XML Documents Clustering

somaye ghazanfari

Abstract

As Extensible markup language (XML) documents are now widely used in the Web World, improving the speed and accuracy of search engines based on these documents is important. Clustering is a way that can be effective in improving the speed of the search engine. Clustering of XML documents can be divided into pair wise and incremental algorithms. The main challenge in the class of incremental algorithms such as Level Structure (XCLS), XCLS+ and XCLS++ is that the order of input XML documents influences the clustering. In this paper, the sensitivity of incremental XML clustering algorithms is introduced by a representative algorithm i.e. XCLS+. A typical solution to this problem has been proposed which includes two interleaved phases: online and semi-offline. Experimental results show that the proposed algorithm has a higher speed with a relatively higher precision for large number of documents compared to previous incremental algorithms such as XCLS+.

Keywords

Incremental algorithms, XML clustering, XCLS+, Input of documents,

Full Text:

PDF

References

Jiawei, H.â€”Kamber, M. Data mining: concepts and techniques, San Francisco, CA, itd: Morgan Kaufmann, Vol. 5, 2001.

Bray, T.â€”Paoli, J.â€”Sperberg-McQueen, C.M.â€”Maler, E.Yergeau, F. Extensible markup language (xml) 1.0, 2004.

Nayak, R. Fast and effective clustering of XML data using structural information, Knowledge and Information Systems, Vol. 14, No. 2, pp. 197â€“215, 2008.

Nayak, R.â€”Xu, S. XCLS: a fast and effective clustering algorithm for heterogeneous XML documents, In Advances in Knowledge Discovery and Data Mining Springer Berlin Heidelberg, 2006.

Alishahi, M.â€”Ravakhah, M.â€”Shakeriaski, B.â€”Naghibzade, M. XML document clustering based on common tag names anywhere in the structure, In Computer, 2009.CSICC 2009.

Naghibzadeh, M. Tag Name Structure-based Clustering of XML Documents, International Journal of Computer and Electrical Engineering (IJCEE), No. 2, 2010.

Qaramaleki, A. K. E.â€”Naderi, H. A New Online XML Document Clustering Based on XCLS++, International Journal of Computer Science and Business Informatics,Vol. 2, No. 1. 2013.

Nierman, A.â€”Jagadish, H. V. Evaluating Structural Similarity in XML Documents, In WebDB, Vol. 2, pp. 61-66, 2002.

Peng, J.â€”Dong, Q.â€”Yang, S. similarity in Chinese text processing, A New Similarity competing method based on concept, series F: Information science, Vol. 51, No. 9, pp. 1212-1230, 2008.

Ghosh, S.â€”Mitra, P. Combining content and structure similarity for XML document classification using composite SVM kernels, In ICPR, pp. 1-4, 2008.

Choi, I.â€”Moon, B.â€”Kim, H. J. A clustering method based on path similarities of XML data, Data and Knowledge Engineering, Vol. 60, No. 2, pp. 361-376, 2007.

Tran, T.â€”Nayak, R.â€”Bruza, P. Combining structure and content similarities for XML document clustering, In Proceedings of the 7th Australasian Data Mining Conference, Vol. 87, Australian Computer Society, Inc, 2008, pp. 219-225.

Kim, W. XML document similarity measure in terms of the structure and contents,In Proceedings of the International Conference on Computer Engineering and Applications (CEA 2008), 2008, pp. 205-21.

Viyanon, W.â€”Madria, S. K.â€”Bhowmick, S. S. XML data integration based on content and structure similarity using keys, In On the Move to Meaningful Internet Systems: OTM, Springer Berlin Heidelberg, pp. 484-493, 2008.

Dalamagas, T.â€”Cheng,T.â€”Winkel,K. J.â€”Sellis, T. A methodology for clustering XML documents by structure, Information Systems, Vol. 31, No. 3, pp. 187-228, 2006.

Lian, W.â€”Cheung, D. L.â€”Mamoulis, N.â€”Yiu, S. M. An efficient and scalable algorithm for clustering XML documents by structure, Knowledge and Data Engineering, IEEE Transactions on, Vol. 16, No. 1, pp. 82-96, 2004.

Zhao, Y.â€”Karypis, G. Criterion functions for document clustering: Experiments and analysis, Technical report, pp. 01-40, 2001

Refbacks

There are currently no refbacks.

ISSN: 1694-2507 (Print)

ISSN: 1694-2108 (Online)

Username
Password
Remember me