Privacy Security in Big Data and Privacy-Preserving Data Mining (PPDM)
Introduction
Big data is such a hot and well-known concept in recent years that it can often be heard or seen in everyday life. In this introduction, I would first explain the definition of big data and introduce the background of the privacy security problems in big data, and then describe and evaluate the problem before introducing and evaluating the goal, Privacy-Preserving Data Mining (PPDM).
In this day and age, it seems that most people enjoy the convenience of big data from different aspects. Consumers could be provided with targeted advertising and services using techniques and methods of big data so that the users’ experience could be improved (Evens and Van Damme, 2016). A better example of the benefits big data technology brings could be Apple’s Siri, which is a voice assistant that can respond to simple commands of users (Gandomi and Haider, 2015). Gandomi and Haider (2015) also demonstrated some other quite interesting applications of big data analysis such as producing a summary of one or several documents for people to get key information faster and easier, or analysing a person's personality or influence through information in social networks.
In fact, at present, researchers still have different opinions on the specific definition of big data. The currently widely accepted definition describes it as the assets of information with a high volume, velocity and variety, the valuable information of which can only be found and used through specific technology and analytical methods (De Mauro et al.
Computer technology has been greatly developed and widely used in recent years, making data easier to be collected, stored and analysed. The popularity of the big data market is growing. In fact, the big data market was dramatically increased, from 7.06 billion US dollars in 2011 to 49 billion in 2019, and it is expected to be more than doubled in 2026 (Liu, 2019).
The rapid development of technology can bring considerable improvement to people's lives. However, in addition to bringing benefits, technology can cause problems as well. Boyd and Crawford (2012) have described big data as invading personal privacy and reducing personal freedom. In fact, through daily activities such as visiting different websites, contacting others using phones, or even listening to music, people would give away personal information (Mai, 2016). The privacy leakage problem in everyday life seems to be common and inevitable.
As Xu et al., (2014) note, rapid development and wide application of data mining (finding useful information from data) technologies put personal privacy in an even more dangerous status. The better the big data industry is developed, the more dangerous personal privacy might be.
Bharathi’s (2017) research shows that the top three risk factors brought by big data are data brokering, global exposure to personal data and lack of governance-based security design. With the development of the internet, most information in this system is accessible for most users. In addition, data brokers, who collect and sell information about consumers, make the situation even more disordered. And currently, the secure methods of governing in this field are not well-developed. Big data techniques improve fast; however, people's vision and moral concepts cannot keep up with the development.
Technological progress should not be hindered by these obstacles. Under this circumstance, it may be an important task to find a way to get the useful information needed from big data without giving away sensitive information, which is called Privacy-Preserving Data Mining (PPDM).
PPDM methodologies are aimed to protect privacy to a certain extent while making data achieve its greatest value. So that data mining can still be efficient when applied to the converted data (Mendes and Vilela, 2017).
Except for some innovations of methods and techniques to solve this problem, there have been several other interesting approaches proposed to the problem of Privacy-Preserving Data Mining.
Zaïane (2004) pointed out in his research that it is prior to set a common definition of policy and standardise the basic rules so that people involved would not get confused. For example, to figure out what kind of information can be called privacy and what behaviour should be called leaking privacy. It is true that determining uniform standards before problems become complex is essential.
Xu et al., (2014) put forward that data providers, data collectors, data miners (people who find useful information from data) as well as decision-makers who make decisions based on the information from the data have different concerns about the security of privacy. For example, what data providers care about is how to control the sensitivity of the information they provide, while data collectors concern about how the data’s form can be changed to avoid privacy leakage. What is important is to weigh these issues to find the best solution. To consider more different roles in the problem should enable the solution widely recognized, which can be considered as a quite humane perspective.
Various views are raised to solve the Privacy-Preserving Data Mining (PPDM) problem. Nevertheless, further exploration is still required for its perfection. How to balance between losing useful information and leaking sensitive privacy should be what researchers in this field need to concentrate on.
Annotated Bibliography
Bharathi, S.V. (2017) ‘Prioritizing and Ranking the Big Data Information Security Risk Spectrum’, Global Journal of Flexible Systems Management, 18(3), pp.183–201.
Bharathi’s (2017) paper contributed to the assessment of data security risk, while few studies have done are related to this particular point. Bharathi listed twenty-five risk factors brought by big data and find out the top three of them. He gives detailed descriptions of the risks so that audience can understand clearly. In addition, the paper includes a part where Bharathi evaluate other researchers’ existing research works critically. Through this particular paper, the audience can get background information, understand the privacy risks of big data and have a certain understanding of the researches of other scholars in this field.
Gandomi, A. and Haider M. (2015) ‘Beyond the hype: Big data concepts, methods, and analytics’, International Journal of Information Management, 35(2), pp. 137-144.
Compared with Bharathi’s (2017) paper, this paper introduced more general concepts. Gandomi and Haider (2015) defined what big data means with its origin and detailed features. Then the researchers analysed how big data analysis techniques are used in the text, audio, video, and social media data and provided cases and examples for new big data analysis techniques. The paper highlighted the future developments in this field at the end. A large number of simple real-life examples are used in the description and explanation, making the paper not hard for readers without related background knowledge. The article is peer-reviewed and has been heavily cited. It would be a good choice for readers who are new to this field to get enough background information.
Mai, J. (2016) ‘Big data privacy: The datafication of personal information’, The Information Society: An International Journal, 32(3), pp. 192-199.
Mai’ s (2016) paper is highly related to privacy in the big data era. Comparing to Mendes’ (2017) research, Mai focuses more on the definition and features of privacy and the information extracted from big data. This paper is helpful for understanding the definition of privacy, how it varies among different individuals and how to distinguish privacy from the public. Therefore, after reading this paper, the risks of privacy security problems and the points that need to pay attention to in the PPDM process can be better understood.
Mendes, R., and Vilela, J.P. (2017) ‘Privacy-Preserving Data Mining: Methods, Metrics, and Applications’ IEEE Access, 5, pp. 10562–10582.
Mendes and Vilela (2017) introduced PPDM in this paper, using the example of a typical application of PPDM in relevant fields. And the researchers discuss the challenges and problems PPDM faces currently as well. The paper, which is more related to the field of computer science, is quite up-to-date with more specific concepts and opinions. It is more suitable for the audience with background knowledge of big data and privacy security to read this paper.
Xu, L. et al. (2014) ‘Information Security in Big Data: Privacy and Data Mining’, IEEE Access, 2, pp. 1149-1176.
The contribution that Xu et al.’ s (2017) paper makes is that it provides an interesting new perspective to the PPDM problem. Different concerns of different roles in the process are considered. The specific privacy issues and methods available to protect sensitive information correspond to a specific role. In addition, Xu et al.’ s paper tries to use game theory to find the optimal solution. The paper provides useful insights into this field.
References
Boyd, D. and Crawford, K. (2012) ‘Critical questions for big data: Provocations for a cultural, technological, and scholarly phenomenon’, Information, communication & society, 15(5), pp. 662-679.
De Mauro, A., Greco, M., and Grimaldi, M. (2016) ‘A formal definition of Big Data based on its essential features’ Library Review, 65(3), pp. 122-135.
Evens, T. and Van Damme, K. (2016) ‘Consumers’ willingness to share personal data: Implications for newspapers’ business models’, International Journal on Media Management, 18(1), pp. 25-41.
Liu, S. (2019) Forecast of Big Data market size, based on revenue, from 2011 to 2027 (in billion U.S. dollars). Available at: https://www.statista.com/statistics/254266/global-big-data-market-forecast/ (Accessed: 3 May 2019).
Zaïane, O. R. (2004) ‘Toward Standardization in Privacy-Preserving Data Mining’, Embrapa Informática Agropecuária-Artigo em anais de congresso (ALICE), pp. 7-17.