Personal Data Sensitivity in Japan: An Exploratory study

Abstract The purpose of this study was to investigate how ordinary Japanese people perceive and understand data sensitivity and sensitive data. Although the concept of sensitive data is described in an article of Japan’s revised personal data act, following the EU Data Protection Directive and the new data protection rule, there has been little research on whether this legally defined concept conforms to the general public’s perception of sensitive data in Japan and, if not, what differences exist between them. Using empirical data acquired through a questionnaire survey and appropriate statistical methods, we sought to clarify empirically the features of data sensitivity as perceived by ordinary Japanese people. This exploratory research revealed that ordinary Japanese tended to feel relatively low sensitivity to personal data related to their civic activities, which are typically mentioned in the official explanation of sensitive data, but they tended to feel a higher degree of sensitivity regarding financial-related personal data, which were not ordinarily considered sensitive data.


Introduction
The IT Strategic Headquarters of the Japanese government 1 formulated a new IT strategy entitled "Declaration to be the World's Most Advanced IT Nation" in June 2013(revised in June 2014, June 2015and May 2016).This strategy regards IT as a core component of structural reforms in Abenomics and aims to achieve the effective use of the vast quantities of personal data in various contexts such as medical care, disaster prevention, and stimulation of regional economies and business activities (Ministry of Internal Affairs and Communications, 2013).The increased need for the development of laws related to use of personal information for business resulted in a revision to Japan's Act on the Protection of Personal Information (APPI;Act No. 57 of 2003), which was passed by the congress in September 2015.
In the third paragraph of Article two of the revised act, sensitive data were given a unified definition for the first time in Japan.The act defines sensitive data as personal data with which special care must be taken to avoid unjust discrimination, prejudice, or other detriment.This definition also includes a non-exhaustive list of what constitutes sensitive data.Information related to race, creed, social status, medical records, criminal records, and damages suffered in crimes are included in this list.
The concepts of data sensitivity have been discussed actively, mainly in Europe, and it is obvious that the legally defined concept of data sensitivity in Japan was made based on the European view such as the definition of sensitive data in EC Directive 95 (Itakura, 2016).However, for ordinary Japanese people, data sensitivity or sensitive data seem to be unfamiliar concepts and, thus, it may be hard for them to understand what types of personal data will be granted this extra protection.Additionally, the perception of, and attitudes toward, the sensitivity of personal data may vary among different cultures.Given that the kind of personal data that leads to discrimination and prejudice will vary from culture to culture, it seems that cultural differences will arise as to what kinds of data should be given special consideration.
Through an analysis of the content of deliberations in the Diet, Itakura (2016) found out that the revision of APPI was undertaken by keeping these requirements in mind and that the introduction of the unified concept of sensitive data was an attempt to meet these requirements by the government.
Given this background, it is plausible that ordinary Japanese people are not familiar with the newly introduced concept of sensitive data.Indeed, the sensitivity of data may be a matter of little concern for the general public in Japan, and many of them did not necessarily desire legislation regarding this concept.In fact, the results of a brief survey conducted by the Cabinet Public Relations Office in 2015 showed that many Japanese people did not understand the concept well (Cabinet Public Relations Office, 2015).Additionally, because a focus had been placed on whether Japan's revised data protection law complied with the EU directives and rules at the stage of law design, it is hard to say that sufficient consideration had been given to the characteristics of Japanese society and features of Japanese people's perception of data sensitivity.Even academic research on sensitive data in Japan has focused largely on the positioning and interpretation of sensitive data in the legal system, on the influence of legislation on sensitive data in relation to business activities, and on technologies and skills for protecting sensitive data.Consequently, little attention has been paid to how ordinary Japanese people understand sensitive data.

Overview of the survey
To understand how ordinary Japanese people think about data sensitivity and what type of personal data they want to protect with special care, an online questionnaire survey was conducted in March 2016.The sample size and attributes of the respondents are shown in Table 1.The total valid sample size was 931 and the survey covered a broad range of demographic groups.In each demographic group, about a hundred respondents answered the questionnaire.The questionnaire consisted of three parts.The first part included questions related to certain attributes of the respondents, including gender, age, occupation, highest completed educational level, personal yearly income level, and their level of knowledge about sensitive data or data sensitivity.In the second part, all respondents were asked to read brief explanations (263 words) of sensitive data, written by the authors, based on the relevant paragraphs of Directive 95/46/EC, Japanese Industrial Standards, and the revised APPI.In the final part, respondents were asked to evaluate the degree of sensitivity of the 91 types of personal data listed in Table 2, using a four-point Likert scale.These sensitivity scores, as variables, were subjected to a factor analysis, and the features of ordinary Japanese people's perceptions of data sensitivity are discussed based on the sensitivity scores of the factors extracted.

Procedure and results of the empirical analysis
First, the suitability of the data for factor analysis was checked using two methods: Bartlett's test of sphericity and the KMO index.Bartlett's test of sphericity tests the null hypothesis that the correlation matrix calculated based on the data is proportional to an identity matrix in which all the diagonal elements are 1 and all off diagonal elements are 0, and this test should be significant for considering that the data are appropriate for factor analysis (Field, 2013).Results of the test showed our correlation matrix was significantly different from an identity matrix and significant correlations were observed between scores of each item (χ2 (4095) = 79676.52,p < 0.01).The Kaiser-Meyer-Olkin (KMO) measure of sampling adequacy was also used to check the suitability of the data.This indicates the ratio of the squared correlation between variables to the squared partial correlation between variables (Field, 2013).According to Kaiser (1974), the recommended level of this index is > 0.5.In the evaluation categories proposed by Hutcheson and Sofroniou (1999), > 0.9 means "marvellous".The calculated current KMO index was 0.966 and this was sufficient to conclude that the data were adequate for factor analysis.In sum, according to criteria adopted in many previous studies, the test and index indicated that the data were suitable for factor analysis.
An initial factor analysis was conducted to check the factor loadings of each variable (i.e., the sensitivity scores for each personal data item).In this factor analysis, we applied a ORBIT Journal DOI: https://doi.org/10.29297/orbit.v1i2.40principal axis factoring method, promax rotation, and Kaiser's criterion for factor extraction.As a result of the analysis, the following six items had factor loadings to each factor < 0.4: Educational institution attendance, Academic achievement record, Marital status/history, Employment record, Communication meta-data, and Communication Content.After eliminating these items from the original list of personal information, the second factor analysis (principal axis factoring method, promax rotation, and Kaiser's criterion) was applied to the data of the remaining 85 items.
Finally, 13 factors were extracted.Table 3 is the pattern matrix in which the relationships between personal data items and extracted factors are described by factor loadings, calculated after the promax rotation.Interpreting the meaning of each factor based on the content of high factor-loading data items, we gave a label to each factor.Factor one had strong relevance to various medical and health-related personal data, so it was given the name "Health and well-being".Factor two was named "ID number" because of the high factor loading of the basic pension number, health insurance number, and driving license number.The third factor had a strong correlation with personal data about political orientation, religious orientation, and sexual orientation, and was given the label "Civics".Factor four was given the label "Social categories", taking into account the high factor loadings of personal data in relation to family register, nationality, and racial/ethnic background.We named the fifth factor "Financial", based on the strong relationships 0 .9 5 8 0 .9 6 1 0 .9 4 9 0 .9 5 6 0 .9 1 2 0 .8 5 2 0 .9 4 6 0 .9 0 7 0 .9 3 4 0 .8 9 7 0 .9 3 9 0 .9 0 7 M i n i m u m C I -T c o r r e l a t i o n 0 .6 7 4 0 .6 3 5 0 .7 4 3 0 .7 2 0 .8 0 2 0 .6 5 9 0 .4 7 6 0 .7 7 5 0 .6 6 4 0 .7 6 3 0 .6 1 9 0 .8 2 5 0 .6 4 4 ORBIT Journal DOI: https://doi.org/10.29297/orbit.v1i2.40 with data items such as income, tax, pension, and bank account balance.Factor six had strong relationships with records of web browsing, search keywords, library borrowing, and shopping, so it was named "Data shadow".The seventh factor was named "Personal ID" because of high factor loadings with ID information such as name, day of birth, and place of birth.Factors eight and ten related highly to biometrics.Factor eight had a strong relationship with data on finger veins, palm veins, iris scans, fingerprints, and DNA.Some of these items are used to identify individuals when they try to use bank/credit cards or to enter a building/room, so we gave the name of "Sensitive biometrics" to the seventh factor.Similarly, we named factor ten as "Non-sensitive biometrics".Because personal data related to online life such as user ID, User name, Password, and E-mail address had high factor loadings on factor nine, it was named "Digital persona".Factor eleven related highly to the data representing networks or relationships in social life and was named "Social relationships".Personal data including records of drug addiction, criminal records, and records of being a victim of crimes showed high factor loadings with factor twelve and we gave it the name "Crimes".The last factor showed strong correlations with home address, mobile/home phone number, and school/place of employment, and we named it "Contact numbers".
Internal consistency and the unidimensionality of each factor were confirmed using Cronbach's α coefficient and the corrected item-total (CI-T) correlation coefficient.These coefficients are shown in the bottom rows of each factor in the pattern matrix.Generally, it is considered that the α coefficient should be > 0.7 for sufficient internal consistency (Leech et al., 2015).CI-T correlation coefficient shows the correlations between each item included in a set of items that have high factor loadings on a factor and the total score of the items.It is considered that if the correlation is > 0.4, the item will be a good component of the factor, and if all the items related to the factor can be regarded as good components, the factor will have high unidimensionality (Field, 2013).Thus, if the minimum value of CI-T correlation on a factor is > 0.4, it can be considered that the factor has high unidimensionality.As shown in Table 3, the analyses showed that the 13 extracted factors had sufficient internal consistency and high unidimensionality.Based on these results, this 13-factor structure was adopted for further analyses.
Through this factor analysis, we were able to simplify various personal data into 13 basic categories in accordance with similarities of sensitivity registered by ordinary Japanese people.Of the 13 factors, health and well-being (factor 1), civics (factor 3), social categories (factor 4), and crimes (factor 12) are strongly relevant to the legal concept of sensitive data.Some of the items included in these factors are mentioned or explicitly exemplified as sensitive data in Japan's revised APPI or as special categories of data in Directive 95/46/EU (European Parliament, 1995).In Regulation (EU) 2016/679 (European Parliament, 2016), sensitive biometrics (factor 8) was added into the range of special categories of personal data (depending on the situation, non-sensitive biometrics may also be included).Data shadow (factor 6) had high correlations with data items that seem to indicate people's orientation, beliefs, interests, and needs mentioned in factor 3, Civics.The remaining factors, including ID number (factor 2), financial (factor 5), personal ID (factor 7), digital persona (factor 9), social relationships (factor 11), and contact numbers (factor 13) are personal data categories not normally regarded as sensitive data.
Figure 1 shows how much sensitivity was registered by ordinary Japanese people about these 13 factors.The degree of sensitivity for each factor was measured by the average score of all items included in each factor.As shown in the figure, factors such as ID numbers, financial, sensitive biometrics, and crimes had relatively high mean scores, while the means of civics, data shadow, personal ID, and non-sensitive biometrics were relatively low.The factors in the red box are those mentioned frequently in descriptions relating to sensitive data and those that have particular relevance to the concept.Two features regarding data sensitivity, as perceived by ordinary Japanese people, can be detected from the figure.First, ordinary Japanese people tended not to feel high data sensitivity with regard to some of the factors mentioned specifically in the legally defined concept of sensitive data.The factor of civics had high factor loadings, with data on political orientation, history of exercising political rights, voting history, history of participating in labour activities or collective bargaining, trade union membership, philosophical orientation, religious orientation, and sexual orientation, all of which are frequently used to explain what sensitive data are.However, the ordinary Japanese questioned tended to feel relatively low sensitivity in relation to this factor (M civ = 1.787).They also felt relatively low sensitivity to the data shadow factor, through which the orientation or activities mentioned in the civics factor seem to be readily inferable and predictable (M dat = 1.773).
Second, the ordinary Japanese people questioned tended to feel a high degree of sensitivity to financial factors or factors readily associated with economic damage.The financial factor, consisting of personal data on family/personal income, taxes paid, pension received, ownership of shares of companies, real estate owned, and bank account balances, had a relatively high average score for sensitivity (M fin = 2.423).Additionally, ordinary Japanese tended to regard the ID number factor, consisting of data on basic pension/social security number, health insurance number, taxpayer number, other government ID number, citizen/resident identification number, driving licence number, credit card number, bank account number, and employee/student number, as highly sensitive data (M IDn = 2.460).Most of such ID numbers are not inherently 'financial' data but, in Japan, as in other countries, they are sometimes abused to commit fraud.Although these two factors received the highest sensitivity scores among the 13 factors, generally, such financial-related factors have not been included in definitions of sensitive data.
Furthermore, to assess whether there were statistically significant differences in the sensitivity of the factors above with gender, an independent samples t-test was conducted.
Figure 2 shows the mean scores of the male and female groups and the results of the t-test for each factor.The results of t-tests on the civics and data shadow factors showed no significant difference between males and females.Regarding ID numbers and financial ORBIT Journal DOI: https://doi.org/10.29297/orbit.v1i2.40factors, females tended to have a higher sensitivity than men; the difference was statistically significant at the 5% significance level for ID number and the 0.1% level for the financial factor.Thus, ordinary Japanese people felt a high degree of sensitivity to these factors, which are not typically mentioned in explanations of sensitive data, and the tendency was stronger in the female group.Figure 3 shows age-dependent differences in the degree of sensitivity to these four factors.The statistical significance of the differences was assessed using one-way ANOVA.Although gender differences were confirmed in the financial-related factor, the results of the ANOVA indicated that the high sensitivity score of these factors tended to be common in all age groups and there was no significant difference between them.However, in the civics and data shadow factors, which showed no difference between genders, there was a significant difference according to age.To assess which pairs of age groups had significant differences, multiple comparisons were conducted as a post hoc analysis after the ANOVA.Regarding the civics factor, there was a significant difference between the 30s, which was the most sensitive group, and the 20s and the over 60 age groups.For the data shadow factor, the age group who felt the lowest sensitivity was the over 60s; there was a significant difference between this group and all other age groups. ORBIT

Conclusions
The results of several statistical analyses revealed gaps between ordinary Japanese people's perceptions of data sensitivity and that of law and policy makers.Ordinary people tended not to feel highly sensitive about personal data relating to their civic life, or data shadow that might indicate civic activities, and this tendency was seen commonly in both men and women.Japanese socio-cultural factors may be responsible for this.It is alleged that Japanese people believe in the myth of a "homogeneous society" and thus are indifferent to others' political and religious orientations as well as sexual orientations (Oguma, 1995).Such indifference may be related to the low sensitivity to these factors.
On the other hand, in general, the Japanese tend to regard financial-related data as highly sensitive data.This tendency has been seen in other surveys.For example, Iizuka and Ogawa (2005) investigated the degree of psychological resistance to entering various personal data into a personal computer in public spaces and showed that people felt a high degree of resistance when they entered financial-related data, such as credit card numbers, passwords for credit cards, and bank account information.The translation of "sensitive data" may also have influenced their perceptions.The English word "sensitive" was translated as "YŌ-HAIRYO" in the revised act; this means "necessary (YŌ) to pay special attention to (HAIRYO)".However, ordinary Japanese may not understand "YŌ-HAIRYO" as related to potential damage that could result from unjust discrimination and ORBIT Journal DOI: https://doi.org/10.29297/orbit.v1i2.40 unfounded prejudices due to civics-related personal data abuse.In contrast, "YŌ-HAIRYO" seems to be readily associated with monetary and financial damage.
This research was an exploratory study on data sensitivity in Japan and is positioned as the first step of our research project.This research will be developed in two directions in the future.To deepen our understanding of Japanese people's recognition of sensitive data, we plan to continue empirical research on Japanese data sensitivity.For example, confirmatory statistical methods, like structural equation modelling, will be used to confirm the relationship between factors and to understand the high-order factor structure of data sensitivity in Japan.Additionally, follow-up interviews will be conducted to explore why the Japanese think the way they do.The research outcome would seem to be an empirical base for effective implementation of the new Japanese legislation.We also plan to conduct an international comparison study on data sensitivity.By conducting surveys on data sensitivity in various countries and regions to clarify local uniqueness of perceptions of sensitive data, and a cross-cultural study to compare such uniqueness, we will attempt to clarify the possibility of establishing a globally acceptable standard for sensitive data protection.The outcome of such a comparative study would seem to be beneficial in the field of data protection because, in the current networked world, personal data are distributed throughout the world, much like a currency.

F
r o n b a c h 's α 0 .9 7 4

Figure 1 .
Figure 1.Degree of sensitivity of each factor

Figure 2 .
Figure 2. Gender-based comparison of the degree of sensitivity to the factors

Figure 3 .
Figure 3. Age-based comparison of the degree of sensitivity to four factors

Table 1 .
Sample size and attributes of respondents

Table 2 .
List of personal data items analyzed Category name Items of personal data Basic personal data (number of items=9) Family name; Given name(s); DOB, Place of birth (town); Home address; Mobile number; Home phone number; School/place of employment; Marital status/history Minor physical illness; Cancer; Cardiac illness; Cerebral stroke; Venereal disease; Lifestylerelated disease; History of mental illness; History of injury; Physical disability; Mental disability; Degenerative conditions; Long-term infection; Genetic pre-disposition to cancer, heart disease or genetic-linked degenerative conditions User ID; Password; Handle/nickname/user name; Email address; Contact address in other communication systems; Content of social networking services

Table 3 .
Pattern Matrix (after promax rotation) Journal DOI: https://doi.org/10.29297/orbit.v1i2.40 * Based on the Levene's test of homogeneity of variances, the test was conducted using the normal ANOVA statistics.** Tukey's HSD was used for the multiple comparisons.