Big Data, Small Personas: How Algorithms Shape the Demographic Representation of Data-Driven User Segments
Abstract
Derived from the notion of algorithmic bias, it is possible that creating user segments such as personas from data results in over- or under-representing certain segments (FAIRNESS), does not properly represent the diversity of the user populations (DIVERSITY), or produces inconsistent results when hyperparameters are changed (CONSISTENCY). Collecting user data on 363M video views from a global news and media organization, we compare personas created from this data using different algorithms. Results indicate that the algorithms fall into two groups: those that generate personas with low diversity–high fairness and those that generate personas with high diversity–low fairness. The algorithms that rank high on diversity tend to rank low on fairness (Spearman's correlation: −0.83). The algorithm that best balances diversity, fairness, and consistency is Spectral Embedding. The results imply that the choice of algorithm is a crucial step in data-driven user segmentation, because the algorithm fundamentally impacts the demographic attributes of the generated personas and thus influences how decision makers view the user population. The results have implications for algorithmic bias in user segmentation and creating user segments that not only consider commercial segmentation criteria but also consider criteria derived from ethical discussions in the computing community.
Get full access to this article
View all available purchase options and get full access to this article.
References
1. Cooper A. The Inmates Are Running the Asylum: Why High Tech Products Drive Us Crazy and How to Restore the Sanity (2nd ed.). Pearson Higher Education: Indianapolis, IN, USA; 2004.
2. Nielsen L. Personas—User Focused Design (2nd ed. 2019 edition ed.). Springer: New York, NY, USA; 2019.
3. Nielsen L, Nielsen KS, Stage J, et al. Going global with personas. In: Proceedings of the INTERACT 2013 conference (2013). Springer: Berlin, Heidelberg, Cape Town, South Africa; 2013; pp. 350–357.
4. Jenkinson A. Beyond segmentation. J Target Meas Anal Mark 1994;3(1):60–72.
5. Salminen J, Jansen BJ, An J, et al. Are personas done? Evaluating their usefulness in the age of digital analytics. Persona Stud 2018;4(2):47–65;.
6. McGinn JJ, Kotamraju N. Data-driven persona development. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM: Florence, Italy; 2008; pp. 1521–1524;.
7. Aoyama M. Persona-and-scenario based requirements engineering for software embedded in digital consumer products. In: Proceedings of the 13th IEEE International Conference on Requirements Engineering (RE'05). Washington, DC, USA; 2005; pp. 85–94;.
8. Aoyama M. Persona-scenario-goal methodology for user-centered requirements engineering. In: Proceedings of the 15th IEEE International Requirements Engineering Conference (RE 2007). Delhi, India; 2007; pp. 185–194;.
9. Clarke MF. The work of mad men that makes the methods of math men work: Practically occasioned segment design. In: Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems. ACM: Seoul, Republic of Korea; 2015; pp. 3275–3284.
10. Gonzalez De Heredia A, Goodman-Deane J, Waller S, et al. Personas for policy-making and healthcare design. In: Proceedings of International Design Conference, DESIGN. 2018; vol. 6; pp. 2645–2656.
11. LeRouge C, Ma J, Sneha S, et al. User profiles and personas in the design and development of consumer health technologies. Int J Med Inform 2013;82(11):e251–e268.
12. Alaqra AS, Wästlund E. Reciprocities or incentives? Understanding privacy intrusion perspectives and sharing behaviors. In: HCI for Cybersecurity, Privacy and Trust: Lecture Notes in Computer Science. (Moallem A. ed.) Springer International Publishing: Cham; 2019; vol. 11594; pp. 355–370.
13. Holmgard C, Green MC, Liapis A, et al. Automated playtesting with procedural personas with evolved heuristics. IEEE Trans Games 2018;99:1;.
14. Salminen J, Vahlo J, Koponen A, et al. Designing prototype player personas from a game preference survey. In: Extended Abstracts of the 2020 CHI Conference on Human Factors in Computing Systems Extended Abstracts (CHI'20). Association for Computing Machinery: Honolulu, HI, USA; 2020; pp. 1–8.
15. Mijač T, Jadrić M, Ćukušić M. The potential and issues in data-driven development of web personas. In: 2018 41st International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO) (2018). 2018; pp. 1237–1242.
16. Cichocki A, Zdunek R, Phan AH, et al. Nonnegative Matrix and Tensor Factorizations: Applications to Exploratory Multi-way Data Analysis and Blind Source Separation. John Wiley & Sons. Google-Books-ID: KaxssMiWgswC; 2009.
17. Jung S-G, Salminen J, Jansen BJ. Personas changing over time: Analyzing variations of data-driven personas during a two-year period. In: Extended Abstracts of the 2019 CHI Conference on Human Factors in Computing Systems (CHI EA'19). ACM: Glasgow, UK; 2019; pp. LBW2714:1–LBW2714:6.
18. Brickey J, Walczak S, Burgess T. Comparing semi-automated clustering methods for persona development. IEEE Trans Softw Eng 2012;38(3):537–546;.
19. Guo H, Binte Razikin K. Anthropological user research: A data-driven approach to personas development. In: Proceedings of the Annual Meeting of the Australian Special Interest Group for Computer Human Interaction (OzCHI'15). ACM: New York, NY, USA; 2015; pp. 417–421.
20. Hirskyj-Douglas I, Read JC, Horton M. Animal personas: Representing dog stakeholders in interaction design. In: Proceedings of the 31st British Computer Society Human Computer Interaction Conference (HCI'17). BCS Learning & Development Ltd.: Swindon, UK; 2017; pp. 37:1–37:13.
21. Minichiello A, Hood JR, Harkness DS. Bringing user experience design to bear on STEM education: A narrative literature review. J STEM Educ Res 2018;1(1–2):7–33.
22. Watanabe Y, Washizaki H, Honda K, et al. ID3P: Iterative data-driven development of persona based on quantitative evaluation and revision. In: Proceedings of the 10th International Workshop on Cooperative and Human Aspects of Software Engineering (CHASE'17). IEEE Press: Piscataway, NJ, USA; 2017; pp. 49–55.
23. Zhu H, Wang H, Carroll JM. Creating persona skeletons from imbalanced datasets—A case study using U.S. Older Adults' Health Data. In: Proceedings of the 2019 on Designing Interactive Systems Conference—DIS'19. ACM Press: San Diego, CA, USA; 2019; pp. 61–70.
24. Salminen J, Guan K, Jung S-G, et al. A literature review of quantitative persona creation. In: Proceedings of the ACM Conference of Human Factors in Computing Systems (CHI'20) (2020). ACM: Honolulu, HI, USA; 2020.
25. Goodman-Deane J, Waller S, Demin D, et al. Evaluating inclusivity using quantitative personas. In: Design as a Catalyst for Change—DRS International Conference 2018. (Storni C, Leahy K, McMahon M, et al. eds.) 25–28 June, 2018, Limerick, Ireland.
26. Chapman CN, Love E, Milham RP, et al. Quantitative evaluation of personas as information. In: Human Factors and Ergonomics Society 52nd Annual Meeting. 2008; pp. 1107–1111.
27. Chapman CN, Milham RP. The Personas' New Clothes: Methodological and practical arguments against a popular method. In: Human Factors and Ergonomics Society Annual Meeting. 2006; vol. 50; pp. 634–636.
28. Salminen J, Froneman W, Jung S-G, et al. The ethics of data-driven personas. In: Extended Abstracts of the 2020 CHI Conference on Human Factors in Computing Systems Extended Abstracts (CHI'20). Association for Computing Machinery: Honolulu, HI, USA; 2020; pp. 1–9.
29. Turner P, Turner S. Is stereotyping inevitable when designing with personas? Design Stud 2011;32(1):30–44.
30. Kuhn TS. The Structure of Scientific Revolutions. University of Chicago Press: Chicago, IL; 1970.
31. Drosou M, Jagadish Hv, Pitoura E, et al. Diversity in big data: A review. Big Data 2017;5(2):73–84;.
32. Radjenović D, Heričko M, Torkar R, et al. Software fault prediction metrics: A systematic literature review. Inform Softw Technol 2013;55(8):1397–1418.
33. Tanenbaum ML, Adams RN, Iturralde E, et al. From wary wearers to d-embracers: personas of readiness to use diabetes devices. J Diabetes Sci Technol 2018;12(6):1101–1107;.
34. Wang L, Li L, Cai H, et al. Analysis of Regional Group Health Persona Based on Image Recognition. In: 2018 Sixth International Conference on Enterprise Systems (ES). 2018; pp. 166–171.
35. Vosbergen S, Mulder-Wiggers JMR, Lacroix JP, et al. Using personas to tailor educational messages to the preferences of coronary heart disease patients. J Biomed Inform 2015;53:100–112;.
36. Zhang X, Brown H-F, Shankar A. Data-driven personas: Constructing archetypal users with clickstreams and user telemetry. In: Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems (2016) (CHI'16). ACM: San Jose, CA, USA; 2016; pp. 5350–5359.
37. An J, Kwak H, Jansen BJ. Towards automatic persona generation using social media. In: Proceedings of Third International Symposium on Social Networks Analysis, Management and Security (SNAMS 2016), The 4th International Conference on Future Internet of Things and Cloud. IEEE: Vienna, Austria; 2016.
38. An J, Kwak H, Jansen BJ. Validating social media data for automatic persona generation. In: Proceedings of Second International Workshop on Online Social Networks Technologies (OSNT-2016), 13th ACS/IEEE International Conference on Computer Systems and Applications (AICCSA). IEEE: Agadir, Morocco; 2016.
39. Kwak H, An J, Jansen BJ. Automatic generation of personas using youtube social media data. In: Proceedings of the Hawaii International Conference on System Sciences (HICSS-50). Waikoloa, HI, USA; 2017; pp. 833–842.
40. Miaskiewicz T, Sumner T, Kozar KA. A latent semantic analysis methodology for the identification and creation of personas. In: Proceeding of the Twenty-Sixth Annual CHI Conference on Human Factors in Computing Systems—CHI'08. ACM Press: Florence, Italy; 2008; p. 1501;.
41. Mesgari M, Okoli C, de Guinea AO. Affordance-based user personas: A mixed-method approach to persona development. In: AMCIS 2015 Proceedings, Puerto Rico, August 13–15, 2015. Available from: https://aisel.aisnet.org/amcis2015/HCI/GeneralPresentations/1 (last accessed June 1, 2021).
42. Holden RJ, Kulanthaivel A, Purkayastha S, et al. Know thy eHealth user: Development of biopsychosocial personas from a study of older adults with heart failure. Int J Med Inform 2017;108:158–167;.
43. Sinha R. Persona development for information-rich domains. CHI '03 Extended Abstracts on Human Factors in Computing Systems, 2003;830–831;.
44. Dang-Pham D, Pittayachawan S, Nkhoma M. Demystifying online personas of Vietnamese young adults on Facebook: A Q-methodology approach. Austral J Inform Syst 2015;19:1204;.
45. Tu N, Dong X, Rau PP, et al. Using cluster analysis in Persona development. In: 2010 8th International Conference on Supply Chain Management and Information. 2010; pp. 1–5.
46. Brickey J, Walczak S, Burgess T. A comparative analysis of persona clustering methods. In: AMCIS 2010 Proceedings (Paper 217). 2010.
47. Dupree JL, Devries R, Berry DM, et al. Privacy personas: Clustering users via attitudes and behaviors toward security practices. In: Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems (CHI'16). ACM: New York, NY, USA; 2016; pp. 5228–5239.
48. An J, Kwak H, Jung S, et al. Customer segmentation using online platforms: Isolating behavioral and demographic segments for persona creation via aggregated user data. Soc Netw Analysis Mining 2018;8:1;.
49. An J, Kwak H, Salminen J, et al. Imaginary people representing real numbers: Generating personas from online social media data. ACM Trans Web (TWEB) 2018;12:3.
50. An J, Kwak H, Jansen J. Personas for content creators via decomposed aggregate audience statistics. Proceedings of the 2017 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining: 2017;632–635;.
51. Jung S-G, Salminen J, An J, et al. Automatically Conceptualizing Social Media Analytics Data via Personas. Proceedings of the International AAAI Conference on Web and Social Media, San Francisco, CA, USA; 2018.
52. Jung S-G, Salminen J, Kwak H, et al. Automatic persona generation (APG): A rationale and demonstration. In: Proceedings of the 2018 Conference on Human Information Interaction and Retrieval. ACM: New Brunswick, NJ, USA; 2018; pp. 321–324.
53. Salminen J, Şengün S, Kwak H, et al. Generating cultural personas from social data: A perspective of middle eastern users. In: Proceedings of The Fourth International Symposium on Social Networks Analysis, Management and Security (SNAMS-2017). IEEE: Prague, Czech Republic; 2017.
54. Salminen J, Şengün S, Kwak H, et al. From 2,772 segments to five personas: Summarizing a diverse online audience by generating culturally adapted personas. First Monday 2018;23(6);.
55. Dhakad L, Das M, Bhattacharyya C, et al. SOPER: Discovering the influence of fashion and the many faces of user from session logs using stick breaking process. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management—CIKM'17. ACM Press: Singapore, Singapore; 2017; pp. 1609–1618.
56. Smith BA, Nayar SK. Mining controller inputs to understand gameplay. In: Proceedings of the 29th Annual Symposium on User Interface Software and Technology (UIST'16). Association for Computing Machinery: Tokyo, Japan; 2016; pp. 157–168.
57. Pruitt J, Grudin J. Personas: Practice and theory (DUX'03). ACM: San Francisco, CA, USA; 2003; pp. 1–15.
58. Nielsen L, Storgaard Hansen K, Stage J, et al. A template for design personas: Analysis of 47 Persona Descriptions from Danish Industries and Organizations. Int J Sociotechnol Knowl Dev 2015;7(1):45–61;.
59. Lee DD, Seung SH. Learning the parts of objects by non-negative matrix factorization. Nature 1999;401(6755):788–791.
60. Wöckl B, Yildizoglu U, Buber I, et al. Basic senior personas: A representative design tool covering the Spectrum of European Older Adults. In: Proceedings of the 14th International ACM SIGACCESS Conference on Computers and Accessibility (ASSETS'12). ACM: New York, NY, USA; 2012; pp. 25–32.
61. Stevenson PD, Mattson CA. The personification of big data. Proc Design Soc Int Conf Eng Design 2019;1(1):4019–4028;.
62. Dwork C, Hardt M, Pitassi T, et al. Fairness through awareness. In: Proceedings of the 3rd Innovations in Theoretical Computer Science Conference: Cambridge, MA, USA. 2012; pp. 214–226.
63. Hajian S, Bonchi F, Castillo C. Algorithmic bias: From discrimination discovery to fairness-aware data mining. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2016). ACM: New York, NY, 2016; pp. 2125–2126.
64. Kleinberg J, Ludwig J, Mullainathan S, et al. 2018. Algorithmic fairness. Aea Papers Proceed 2018;108:22–27;.
65. Chouldechova A. Fair prediction with disparate impact: A study of bias in recidivism prediction instruments. Big Data 2017;5(2):153–163.
66. Matthews T, Judge T, Whittaker S. How do designers and user experience professionals actually perceive and use personas?. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI'12). ACM: New York, NY, USA; 2012; pp. 1219–1228;.
67. Wilkinson MD, Dumontier M, Aalbersberg IJ, et al. The FAIR Guiding Principles for scientific data management and stewardship. Scient Data 2016;3(1):160018;.
68. Wing JM. Data for Good: FATES, Elaborated. 2020. Available from: https://datascience.columbia.edu/FATES-Elaborated (last accessed June 1, 2021).
69. College of Information and Computer Sciences. EQUATE. 2020. Available from: https://groups.cs.umass.edu/equate/ (last accessed June 1, 2021).
70. Siegel DA. The mystique of numbers: belief in quantitative approaches to segmentation and persona development. In: CHI'10 Extended Abstracts on Human Factors in Computing Systems (CHI EA'10). ACM: New York, NY, USA; 2010; pp. 4721–4732.
71. Hasani S, Thirumuruganathan S, Koudas N, et al. Shahin: Faster algorithms for generating explanations for multiple predictions. In: Proceedings of the 2021 International Conference on Management of Data (New York, NY, USA, 2021-06-09) (SIGMOD/PODS'21). Association for Computing Machinery; 2021; pp. 2235–2243.
72. Cai H, Liu B, Xiao Y, et al. Semi-supervised multi-view clustering based on orthonormality-constrained nonnegative matrix factorization. Inform Sci 2020;536(2020):171–184;.
73. Salminen J, Santos JM, Kwak H, et al. Persona perception scale: Development and exploratory validation of an instrument for evaluating individuals' perceptions of personas. Int J Hum Comput Stud 2020;2020:102437;.
74. Saxena A, Prasad M, Gupta A, et al. A review of clustering techniques and developments. Neurocomputing 2017;267:664–681;.
75. Wold S, Esbensen K, Geladi P. Principal component analysis. Chemometr Intell Lab Syst 1987;2(1):37–52;.
76. Paatero P, Tapper U. Positive matrix factorization: A non-negative factor model with optimal utilization of error estimates of data values. Environmetrics 1994;5(2):111–126;.
77. Yoshida T. Learning and utilizing a pool of features in non-negative matrix factorization. In: Active Media Technology (Lecture Notes in Computer Science). (Yoshida T, Kou G, Skowron A, et al. eds.) Springer International Publishing: Cham; 2003; pp. 96–105.
78. Xue J-H Titterington D. Do unbalanced data have a negative effect on LDA? Pattern Recogn 2008;41:1558–1571;.
79. Luo B, Wilson RC, Hancock ER. Spectral embedding of graphs. Pattern Recogn 2003;36(10):2213–2230;.
80. Qian J, Saligrama V. Spectral clustering with unbalanced data. 2013; arXiv:1302.5134 [stat].
81. McInnes L, Healy J, Melville J. UMAP: Uniform manifold approximation and projection for dimension reduction. 2018; arXiv:1802.03426 [cs, stat].
82. van der Maaten L, Hinton G. Visualizing Data using t-SNE. J Mach Learn Res 2008;9:2579–2605.
83. Bamman D, O'Connor B, Smith NA. Learning latent personas of film characters. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics. Sofia, Bulgaria; 2013; p. 10.
84. Kross S, Guo PJ. Students, systems, and interactions: Synthesizing the first four years of learning@scale and charting the future. In: Proceedings of the Fifth Annual ACM Conference on Learning at Scale (L@S'18). Association for Computing Machinery: London, United Kingdom; 2018; pp. 1–10.
85. Avramova Z, Wittevrongel S, Bruneel H, et al. Analysis and modeling of video popularity evolution in various online video content systems: Power-law versus exponential decay. In: 2009 First International Conference on Evolving Internet. IEEE; 2009; pp. 95–100.
Information & Authors
Information
Published In
Copyright
Copyright 2022, Mary Ann Liebert, Inc., publishers.
History
Published online: 12 August 2022
Published in print: August 2022
Topics
Authors
Author Disclosure Statement
No competing financial interests exist.
Funding Information
No funding was received for this article.
Metrics & Citations
Metrics
Citations
Export Citation
Export citation
Select the format you want to export the citations of this publication.
View Options
Get Access
Access content
To read the fulltext, please use one of the options below to sign in or purchase access.⚠ Society Access
If you are a member of a society that has access to this content please log in via your society website and then return to this publication.