kNN-IS: an iterative spark-based design of the k-nearest neighbors classifier for big data

The k-Nearest Neighbors classifier is a simple yet effective widely renowned method in data mining. The actual application of this model in the big data domain is not feasible due to time and memory restrictions. Several distributed alternatives based on MapReduce have been proposed to enable this m...

Full description

Bibliographic Details
Main Authors:	Maillo, Jesus, Ramirez, Sergio, Triguero, Isaac, Herrera, Francisco
Format:	Article
Published:	Elsevier 2016
Subjects:	K-nearest neighbors; Big data; Apache Hadoop; Apache Spark; MapReduce
Online Access:	https://eprints.nottingham.ac.uk/34013/

_version_	1848794755835625472
author	Maillo, Jesus Ramirez, Sergio Triguero, Isaac Herrera, Francisco
author_facet	Maillo, Jesus Ramirez, Sergio Triguero, Isaac Herrera, Francisco
author_sort	Maillo, Jesus
building	Nottingham Research Data Repository
collection	Online Access
description	The k-Nearest Neighbors classifier is a simple yet effective widely renowned method in data mining. The actual application of this model in the big data domain is not feasible due to time and memory restrictions. Several distributed alternatives based on MapReduce have been proposed to enable this method to handle large-scale data. However, their performance can be further improved with new designs that fit with newly arising technologies. In this work we provide a new solution to perform an exact k-nearest neighbor classification based on Spark. We take advantage of its in-memory operations to classify big amounts of unseen cases against a big training dataset. The map phase computes the k-nearest neighbors in different training data splits. Afterwards, multiple reducers process the definitive neighbors from the list obtained in the map phase. The key point of this proposal lies on the management of the test set, keeping it in memory when possible. Otherwise, it is split into a minimum number of pieces, applying a MapReduce per chunk, using the caching skills of Spark to reuse the previously partitioned training set. In our experiments we study the differences between Hadoop and Spark implementations with datasets up to 11 million instances, showing the scaling-up capabilities of the proposed approach. As a result of this work an open-source Spark package is available.
first_indexed	2025-11-14T19:21:14Z
format	Article
id	nottingham-34013
institution	University of Nottingham Malaysia Campus
institution_category	Local University
last_indexed	2025-11-14T19:21:14Z
publishDate	2016
publisher	Elsevier
recordtype	eprints
repository_type	Digital Repository
spelling	nottingham-340132020-05-04T17:56:43Z https://eprints.nottingham.ac.uk/34013/ kNN-IS: an iterative spark-based design of the k-nearest neighbors classifier for big data Maillo, Jesus Ramirez, Sergio Triguero, Isaac Herrera, Francisco The k-Nearest Neighbors classifier is a simple yet effective widely renowned method in data mining. The actual application of this model in the big data domain is not feasible due to time and memory restrictions. Several distributed alternatives based on MapReduce have been proposed to enable this method to handle large-scale data. However, their performance can be further improved with new designs that fit with newly arising technologies. In this work we provide a new solution to perform an exact k-nearest neighbor classification based on Spark. We take advantage of its in-memory operations to classify big amounts of unseen cases against a big training dataset. The map phase computes the k-nearest neighbors in different training data splits. Afterwards, multiple reducers process the definitive neighbors from the list obtained in the map phase. The key point of this proposal lies on the management of the test set, keeping it in memory when possible. Otherwise, it is split into a minimum number of pieces, applying a MapReduce per chunk, using the caching skills of Spark to reuse the previously partitioned training set. In our experiments we study the differences between Hadoop and Spark implementations with datasets up to 11 million instances, showing the scaling-up capabilities of the proposed approach. As a result of this work an open-source Spark package is available. Elsevier 2016-06-14 Article PeerReviewed Maillo, Jesus, Ramirez, Sergio, Triguero, Isaac and Herrera, Francisco (2016) kNN-IS: an iterative spark-based design of the k-nearest neighbors classifier for big data. Knowledge-Based Systems . ISSN 1872-7409 (In Press) K-nearest neighbors; Big data; Apache Hadoop; Apache Spark; MapReduce http://www.sciencedirect.com/science/article/pii/S0950705116301757 doi:10.1016/j.knosys.2016.06.012 doi:10.1016/j.knosys.2016.06.012
spellingShingle	K-nearest neighbors; Big data; Apache Hadoop; Apache Spark; MapReduce Maillo, Jesus Ramirez, Sergio Triguero, Isaac Herrera, Francisco kNN-IS: an iterative spark-based design of the k-nearest neighbors classifier for big data
title	kNN-IS: an iterative spark-based design of the k-nearest neighbors classifier for big data
title_full	kNN-IS: an iterative spark-based design of the k-nearest neighbors classifier for big data
title_fullStr	kNN-IS: an iterative spark-based design of the k-nearest neighbors classifier for big data
title_full_unstemmed	kNN-IS: an iterative spark-based design of the k-nearest neighbors classifier for big data
title_short	kNN-IS: an iterative spark-based design of the k-nearest neighbors classifier for big data
title_sort	knn-is: an iterative spark-based design of the k-nearest neighbors classifier for big data
topic	K-nearest neighbors; Big data; Apache Hadoop; Apache Spark; MapReduce
url	https://eprints.nottingham.ac.uk/34013/ https://eprints.nottingham.ac.uk/34013/ https://eprints.nottingham.ac.uk/34013/

kNN-IS: an iterative spark-based design of the k-nearest neighbors classifier for big data

Similar Items