Parallel Ensemble Techniques for Data Mining Application

Govindarajan Muthukumarasamy

Parallel Ensemble Techniques for Data Mining Application

Govindarajan Muthukumarasamy

Abstract

Data mining is a powerful technique to extract hidden predictive information from large datasets. Classification is a very popular application of data mining. The goal of classification is to build a classifier with high accuracy and low cost. Since most data mining datasets are very large, using small subsets of the dataset can speedup data mining tasks. It is possible to generate a family of predictors using different subsets of training dataset and combine those predictors to achieve higher accuracy with low cost. This is the basic idea of ensemble techniques. Variants of ensemble techniques include bagging, AdaBoost, arcing etc. In this research work, parallel bagging and parallel AdaBoost with support vector machine as a base classifier are implemented with the support of the parallel hardware. The NSL-KDD datasets are used to examine the essential parameters for parallel ensemble techniques such as sample size, number of iterations, number of processors, and threshold. Experiments are conducted to demonstrate that ensemble techniques can be effectively parallelized. The parallel ensemble techniques will provide more accurate results.Â Â

Keywords

Classification;Data mining; parallelism; ensemble techniques;bagging;AdaBoost.

Full Text:

PDF

References

E. Bauer and R. Kohavi, (1999), â€œAn empirical comparison of voting classification algorithms: Bagging, boosting, and variantsâ€, Machine Learning: 36: 103-142.

L. Breiman, (1996), â€œBagging predictorsâ€, Machine Learning. 24(2): 1-3-140.

L. Breirnan, (1998), â€œArcing classifiersâ€, The Annals of Statistics, 26 (3):801-849.

Burges, C. J. C. (1998), â€œA tutorial on support vector machines for pattern recognitionâ€, Data Mining and Knowledge Discovery, 2(2):121-167.

Cao, L. J. and Francis, E. H. T. (2003), â€œSupport Vector Machine With Adaptive Parameters in Financial Time Series Forecastingâ€, The National University of Singapore, Singapore 119260, IEEE Transactions on Neural Networks, 14(6).

Cherkassky, V. and Mulier, F. (1998), â€œLearning from Data - Concepts, Theory and Methodsâ€, John Wiley & Sons, New York.

Y. Freund and R. Schapire, (1996), â€œExperiments with a new boosting algorithmâ€, In Proceedings of the 13th international Conference on Machine Learning, pages 148-136. Morgan Kaufmann.

Y. Freund and R. Schapire, (1999), â€œA short introduction to boostingâ€, Journal of Japanese Society for Artificial Intelligence, 14(5):771-780, 1999.

Ira Cohen, Qi Tian, Xiang Sean Zhou and Thoms S.Huang, (2007), "Feature Selection Using Principal Feature Analysis", In Proceedings of the 15th international conference on Multimedia, Augsburg, Germany, September, pp. 25-29.

M. Keams and L.G. Valiant, (1994), â€œCryptographic limitations on learning boolean formulae and finite automataâ€, Journal of the Association for Computing Machinery:55(1):67-95.

KDD'99 dataset, (2010), http://kdd.ics.uci.edu/databases, Irvine, CA, USA.

Vanajakshi, L. and Rilett, L.R. (2004), â€œA Comparison of the Performance of Artificial Neural Network and Support Vector Machines for the Prediction of Traffic Speedâ€, IEEE Intelligent Vehicles Symposium, University of Parma, Parma, Italy: IEEE: 194-199.

Refbacks

There are currently no refbacks.

ISSN: 1694-2507 (Print)

ISSN: 1694-2108 (Online)

Username
Password
Remember me

International Journal of Computer Science and Business Informatics