# A Novel Density based Clustering Method using Nearest and Farthest Neighbor with PCA

### Abstract

Common nearest-neighbor density estimators usually do not work well for high dimensional datasets. Moreover, they have high time complexity of O(n^{2}) and require high memory usage especially when indexing is used. In order to overcome these limitations, we proposed a new method that calculates distances to nearest and farthest neighbor nodes to create dataset subgroups. Therefore computational time complexity becomes of O(nlogn) and space complexity becomes constant. After subgroup formation, assembling technique is used to derive correct clusters. In order to overcome high dimensional datasets problem, Principal Component Analysis (PCA) in the clustering method is used, which preprocesses high-dimensional data. Many experiments on synthetic data sets are carried out to demonstrate the feasibility of the proposed method. Furthermore we compared this algorithm to the similar algorithm –DBSCAN- on real-world datasets and the results showed significantly higher accuracy of the proposed method.

### Downloads

### References

[2] Z. Nafar and A. Golshani, “Data Mining Methods for Protein-Protein Interactions,” in 2006 Canadian Conference on Electrical and Computer Engineering, 2006, pp. 991–994.

[3] W. Yu, G. Qiang, and L. Xiao-li, “A Kernel Aggregate Clustering Approach for Mixed Data Set and Its Application in Customer Segmentation,” 2006 Int. Conf. Manag. Sci. Eng., pp. 121–124, 2006.

[4] Zicheng Liao, H. Hoppe, D. Forsyth, and Yizhou Yu, “A Subdivision-Based Representation for Vector Image Editing,” IEEE Trans. Vis. Comput. Graph., vol. 18, no. 11, pp. 1858–1867, Nov. 2012.

[5] P. K. Vemulapalli, V. Monga, and S. N. Brennan, “Robust extrema features for time-series data analysis,” Pattern Anal. Mach. Intell. IEEE Trans., vol. 35, no. 6, pp. 1464–1479, 2013.

[6] Q. Song, J. Ni, and G. Wang, “A fast clustering-based feature subset selection algorithm for high-dimensional data,” Knowl. Data Eng. IEEE Trans., vol. 25, no. 1, pp. 1–14, 2013.

[7] Y. Chen, H. Sampathkumar, B. Luo, and X. Chen, “iLike: Bridging the Semantic Gap in Vertical Image Search by Integrating Text and Visual Features,” IEEE Trans. Knowl. Data Eng., vol. 25, no. 10, pp. 2257–2270, Oct. 2013.

[8] S. Yin and X. Zhu, “Intelligent Particle Filter and Its Application on Fault Detection of Nonlinear System,” IEEE Trans. Ind. Electron., pp. 1–1, 2015.

[9] S. Yin, X. Zhu, and O. Kaynak, “Improved PLS Focused on Key-Performance-Indicator-Related Fault Diagnosis,” IEEE Trans. Ind. Electron., vol. 62, no. 3, pp. 1651–1658, Mar. 2015.

[10] V. Athitsos, J. Alon, S. Sclaroff, and G. Kollios, “BoostMap: An Embedding Method for Efficient Nearest Neighbor Retrieval,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 30, no. 1, pp. 89–104, Jan. 2008.

[11] S. D. Bay and M. Schwabacher, “Mining distance-based outliers in near linear time with randomization and a simple pruning rule,” in Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining - KDD ’03, 2003, p. 29.

[12] M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander, “LOF:identifyingdensity-basedlocal outliers,” in Proceedings of the 2000 ACM SIGMOD international conference on Management of data - SIGMOD ’00, 2000, pp. 93–104.

[13] D. O. Loftsgaarden and C. P. Quesenberry, “A nonparametric estimate of a multivariate density function,” Annals of Mathematical Statistics, vol. 36, no. 3. pp. 1049–1051, 1965.

[14] P. Ciaccia, M. Patella, and P. Zezula, “M-tree: An Efficient Access Method for Similarity Search in Metric Spaces,” 23rd VLDB Conf., pp. 426–435, 1997.

[15] A. Beygelzimer, S. Kakade, and J. Langford, “Cover trees for nearest neighbor,” ICML ’06 Proc. 23rd Int. Conf. Mach. Learn., pp. 97–104, 2006.

[16] A. Faroughi, R. Javidan, and M. Emami, “A new density estimator based on nearest and farthest neighbor,” in 2016 8th International Symposium on Telecommunications (IST), 2016, pp. 185–190.

[17] I. Jolliffe, “Principal Component Analysis,” in Wiley StatsRef: Statistics Reference Online, Chichester, UK: John Wiley & Sons, Ltd, 2014.

[18] H. Haddad Pajouh, R. Javidan, R. Khayami, D. Ali, and K.-K. R. Choo, “A Two-layer Dimension Reduction and Two-tier Classification Model for Anomaly-Based Intrusion Detection in IoT Backbone Networks,” IEEE Trans. Emerg. Top. Comput., pp. 1–1, 2016.

[19] A. F. and A. Asuncion, “UCI Machine Learning Repository,” Irvine, CA: Univ. California, Irvine, 2010. [Online]. Available: http://archive.ics.

[20] A. K. Jain, “Data clustering: 50 years beyond K-means,” Pattern Recognit. Lett., vol. 31, no. 8, pp. 651–666, Jun. 2010.

[21] J. Wang and X. Su, “An improved K-Means clustering algorithm,” in 2011 IEEE 3rd International Conference on Communication Software and Networks, 2011, pp. 44–46.

[22] R. T. Ng and Jiawei Han, “CLARANS: a method for clustering objects for spatial data mining,” IEEE Trans. Knowl. Data Eng., vol. 14, no. 5, pp. 1003–1016, Sep. 2002.

[23] S. Guha, R. Rastogi, and K. Shim, “Cure: an efficient clustering algorithm for large databases,” Inf. Syst., vol. 26, no. 1, pp. 35–58, Mar. 2001.

[24] G. Karypis, Eui-Hong Han, and V. Kumar, “Chameleon: hierarchical clustering using dynamic modeling,” Computer (Long. Beach. Calif)., vol. 32, no. 8, pp. 68–75, 1999.

[25] X. X. Martin Ester, Hans-Peter Kriegel, Jörg Sander, “A Density Based Notion of Clusters in Large Spatial Databases with Noise,” Int. Conf. Knowl. Discov. Data Min., pp. 226–231, 1996.

[26] Zhong Wang, Yanling Hao, Zhilan Xiong, and Feng Sun, “SNN clustering kernel technique for content-based scene matching,” in 2008 7th IEEE International Conference on Cybernetic Intelligent Systems, 2008, pp. 1–6.

[27] E. Achtert, C. Böhm, H.-P. Kriegel, P. Kröger, I. Müller-Gorman, and A. Zimek, “Detection and Visualization of Subspace Cluster Hierarchies,” in Advances in Databases: Concepts, Systems and Applications, Berlin, Heidelberg: Springer Berlin Heidelberg, pp. 152–163.

[28] Yizong Cheng, “Mean shift, mode seeking, and clustering,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 17, no. 8, pp. 790–799, 1995.

[29] K. Fukunaga and L. Hostetler, “The estimation of the gradient of a density function, with applications in pattern recognition,” IEEE Trans. Inf. Theory, vol. 21, no. 1, pp. 32–40, Jan. 1975.

[30] M. R. Parsaei, R. Taheri and R. Javidan “Perusing The Effect of Discretization of Data on Accuracy of Predicting Naïve Bayes Algorithm,” Curr. Res. Sci., no. January, pp. 457–462, 2016.

[31] A. Gionis, H. Mannila, and P. Tsaparas, “Clustering aggregation,” Proc. - Int. Conf. Data Eng., vol. 1, no. 1, pp. 341–352, 2005.

[32] H. Chang and D.-Y. Yeung, “Robust path-based spectral clustering,” Pattern Recognit., vol. 41, no. 1, pp. 191–203, Jan. 2008.

[33] A. K. Jain and M. H. C. Law, “Data Clustering: A User’s Dilemma,” in Lecture Notes in Computer Science, 2005, pp. 1–10.

[34] L. Fu and E. Medico, “No Title,” BMC Bioinformatics, vol. 8, no. 1, p. 3, 2007.

[35] C. J. Veenman, M. J. T. Reinders, and E. Backer, “A maximum variance cluster algorithm,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, no. 9, pp. 1273–1280, Sep. 2002.

[36] M. R. Parsaei, S. M. Rostami, and R. Javidan, “A Hybrid Data Mining Approach for Intrusion Detection on Imbalanced NSL-KDD Dataset,” Int. J. Adv. Comput. Sci. Appl., vol. 7, no. 6, pp. 20–25, 2016.

*International Journal of Information & Communication Technology Research*,

*9*(2), 27-34. Retrieved from http://journal.itrc.ac.ir/index.php/ijictr/article/view/8

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

**Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)**