A comparison of clustering algorithms for data anonymization / Zahra Mahmoud

Organizations today can easily store massive amounts of data as the cost of storage has significantly plummeted over the years. Data is used to help them raise their brand's value. However, as data becomes easier to store in mass amounts, the security risk also increases. In the last two years...

Full description

Bibliographic Details
Main Author: Zahra, Mahmoud
Format: Thesis
Published: 2019
Subjects:
Online Access:http://studentsrepo.um.edu.my/10708/
http://studentsrepo.um.edu.my/10708/2/Zahra_Mahmoud.pdf
http://studentsrepo.um.edu.my/10708/1/Zahra_Mahmoud_%E2%80%93_Dissertation.pdf
Description
Summary:Organizations today can easily store massive amounts of data as the cost of storage has significantly plummeted over the years. Data is used to help them raise their brand's value. However, as data becomes easier to store in mass amounts, the security risk also increases. In the last two years alone, multiple data leaks have been reported, the latest being from the Ministry of Education in Malaysia. Over the years, there has been extensive research on data security. Literature review showed that many researches have employed methods such as data encryption or privacy protection data publishing (PPDP). This thesis focuses more on the latter, as data encryption has proven to be more costly. Many of the literature also focused on using generalization and suppression to achieve the level of anonymity it required. However, a heavily suppressed or generalized data may paint a different picture instead. The objective of this thesis is to find a method of data anonymization that is efficient and produces the least percentage of information loss. By comparing multiple different types of PPDP, the researcher then determined that the clustering method is the best fit for this purpose. Next, multiple types of existing clustering algorithms are compared to determine which has the best performance. The researcher then created an enhanced method to do a final comparison– the researcher manipulated the distance function to show how cluster distance difference can affect the outcome of the anonymized dataset.