Knowledge Sanitization on the Web
Vasileios Kagklis, University of Patras
Vassilios S. Verykios, Hellenic Open University
Giannis Tzimas, Technological Educational Institute of Western Greece
Athanasios K. Tsakalidis, University of Patras
The ever increasing collection of data – amply available from web information systems – floating in and out of bigger and bigger data centers has created a fertile ground for even more prosperous analyses and interrogations through big data analytics techniques, so as to provide enlightened answers to a large spectrum of data science and business problems. The ubiquitous application of these techniques does not come without a cost. As more and more key players realize the intricacies of this double-edged sword, they urge to employ novel techniques to secure not only their data, but also the patterns induced from these data.
Privacy preserving data mining is the research area that investigates techniques to preserve the privacy of data and patterns. Knowledge hiding, in particular, seeks to ensure the privacy of the sensitive patterns induced from the data, so as the quality of the original data is not affected much after its perturbation. The process of minimally perturbing/modifying the raw data, in order to remove any sensitive knowledge, is known as data sanitization. Knowledge Hiding consists of a wide variety of techniques, such as frequent pattern hiding, sequence hiding, classification rule hiding, data stream hiding and so on.
This tutorial makes an informed presentation of the recent approaches that deal with the sanitization of binary databases in such a way that sensitive frequent itemsets are excluded from the unearthing achieved from the application of frequent itemset mining algorithms, like Apriori. This problem is known as the frequent itemset hiding problem and it is approached by different techniques proposed over the last fifteen years or so. The goal of these techniques is the hiding of the sensitive frequent itemsets and the maintenance of non-sensitive frequent itemsets by minimally sanitizing the database at hand.
More specifically, in this tutorial we are going to provide a taxonomy of the works presented in the past few years in the area of frequent itemset hiding. This taxonomy consists of different categories, such as heuristic distortion-based approaches, heuristic blocking approaches, border-based approaches, database reconstruction approaches, inverse frequent itemset mining approaches and linear programming-based approaches. We also provide representative examples of algorithms from each category to highlight their unique characteristics.
The tutorial focuses on the detailed overview of the linear programming-based approaches. We provide a case study to show the workings of the most important linear programming-based techniques. Lastly, an experimental evaluation of these techniques is conducted in order to make a quantitative and qualitative comparison.
References
- A. Gkoulalas-Divanis and V. S. Verykios. Hiding sensitive knowledge without side effects. Knowledge Information Systems, 20(3):263-299, Aug. 2009.
- V. S. Verykios. Association rule hiding methods. WIREs Data Mining Knowledge Discovery, 3(1):28-36, 2013.
- S. Menon, S. Sarkar, and S. Mukherjee. Maximizing accuracy of shared databases when concealing sensitive patterns. Information Systems Research, 16(3):256-270, Sept. 2005.
- E. Leloglu, T. Ayav, and B. Ergenc. Coefficient-based exact approach for frequent itemset hiding. eKNOW2014: The Sixth International Conference on Information, Process, and Knowledge Management, 124-130, 2014.
- G. Lee, Y. C. Chen, S. L. Peng, and J. H. Lin. Solving the sensitive itemset hiding problem whilst minimizing side effects on a sanitized database. Communications in Computer and Information Science, 223:104-113