Exact fuzzy k-Nearest neighbor classification for big datasets

The k-Nearest Neighbors (kNN) classifier is one of the most effective methods in supervised learning problems. It classifies unseen cases comparing their similarity with the training data. Nevertheless, it gives to each labeled sample the same importance to classify. There are several approaches to...

Full description

Bibliographic Details
Main Authors: Maillo, Jesus, Luengo, Julian, García, Salvador, Herrera, Francisco, Triguero, Isaac
Format: Conference or Workshop Item
Published: 2017
Online Access:https://eprints.nottingham.ac.uk/44937/
_version_ 1848797032123203584
author Maillo, Jesus
Luengo, Julian
García, Salvador
Herrera, Francisco
Triguero, Isaac
author_facet Maillo, Jesus
Luengo, Julian
García, Salvador
Herrera, Francisco
Triguero, Isaac
author_sort Maillo, Jesus
building Nottingham Research Data Repository
collection Online Access
description The k-Nearest Neighbors (kNN) classifier is one of the most effective methods in supervised learning problems. It classifies unseen cases comparing their similarity with the training data. Nevertheless, it gives to each labeled sample the same importance to classify. There are several approaches to enhance its precision, with the Fuzzy k Nearest Neighbors (FuzzykNN) classifier being among the most successful ones. FuzzykNN computes a fuzzy degree of membership of each instance to the classes of the problem. As a result, it generates smoother borders between classes. Apart from the existing kNN approach to handle big datasets, there is not a fuzzy variant to manage that volume of data. Nevertheless, calculating this class membership adds an extra computational cost becoming even less scalable to tackle large datasets because of memory needs and high runtime. In this work, we present an exact and distributed approach to run the Fuzzy-kNN classifier on big datasets based on Spark, which provides the same precision than the original algorithm. It presents two separately stages. The first stage transforms the training set adding the class membership degrees. The second stage classifies with the kNN algorithm the test set using the class membership computed previously. In our experiments, we study the scaling-up capabilities of the proposed approach with datasets up to 11 million instances, showing promising results.
first_indexed 2025-11-14T19:57:25Z
format Conference or Workshop Item
id nottingham-44937
institution University of Nottingham Malaysia Campus
institution_category Local University
last_indexed 2025-11-14T19:57:25Z
publishDate 2017
recordtype eprints
repository_type Digital Repository
spelling nottingham-449372020-05-04T18:55:06Z https://eprints.nottingham.ac.uk/44937/ Exact fuzzy k-Nearest neighbor classification for big datasets Maillo, Jesus Luengo, Julian García, Salvador Herrera, Francisco Triguero, Isaac The k-Nearest Neighbors (kNN) classifier is one of the most effective methods in supervised learning problems. It classifies unseen cases comparing their similarity with the training data. Nevertheless, it gives to each labeled sample the same importance to classify. There are several approaches to enhance its precision, with the Fuzzy k Nearest Neighbors (FuzzykNN) classifier being among the most successful ones. FuzzykNN computes a fuzzy degree of membership of each instance to the classes of the problem. As a result, it generates smoother borders between classes. Apart from the existing kNN approach to handle big datasets, there is not a fuzzy variant to manage that volume of data. Nevertheless, calculating this class membership adds an extra computational cost becoming even less scalable to tackle large datasets because of memory needs and high runtime. In this work, we present an exact and distributed approach to run the Fuzzy-kNN classifier on big datasets based on Spark, which provides the same precision than the original algorithm. It presents two separately stages. The first stage transforms the training set adding the class membership degrees. The second stage classifies with the kNN algorithm the test set using the class membership computed previously. In our experiments, we study the scaling-up capabilities of the proposed approach with datasets up to 11 million instances, showing promising results. 2017-07-10 Conference or Workshop Item PeerReviewed Maillo, Jesus, Luengo, Julian, García, Salvador, Herrera, Francisco and Triguero, Isaac (2017) Exact fuzzy k-Nearest neighbor classification for big datasets. In: IEEE International Conference on Fuzzy Systems (FUZZ-IEEE 2017), 9-12 Jul 2017, Naples, Italy.
spellingShingle Maillo, Jesus
Luengo, Julian
García, Salvador
Herrera, Francisco
Triguero, Isaac
Exact fuzzy k-Nearest neighbor classification for big datasets
title Exact fuzzy k-Nearest neighbor classification for big datasets
title_full Exact fuzzy k-Nearest neighbor classification for big datasets
title_fullStr Exact fuzzy k-Nearest neighbor classification for big datasets
title_full_unstemmed Exact fuzzy k-Nearest neighbor classification for big datasets
title_short Exact fuzzy k-Nearest neighbor classification for big datasets
title_sort exact fuzzy k-nearest neighbor classification for big datasets
url https://eprints.nottingham.ac.uk/44937/