CLINICAL QUESTION
Does ChatGPT exhibit demographic bias—based on race, gender, or sexual orientation—when simulating residency selection rankings in otolaryngology?
BOTTOM LINE
Using publicly available large language models (LLMs) in otolaryngology residency selection may introduce significant bias related to race, gender, and sexual orientation. Recognizing and addressing these biases is essential to ensure a diverse and representative future workforce in line with current demographic goals.
BACKGROUND: With increasing demand on faculty time and growing complexity in medical residency recruitment, there is growing interest in leveraging artificial intelligence to optimize administrative workflows. LLMs like GPT-4 offer potential to enhance holistic residency selection but also risk perpetuating systemic biases, misinformation, and inaccuracies, raising concerns about fairness and equity in candidate evaluation.
STUDY DESIGN: Controlled simulation using ChatGPT, where demographically varied, equally qualified applicants were ranked by simulated residency selection committee (RSC) members to assess bias in residency selection.
SETTING: Simulated environment using GPT-4 and GPT-4o, with virtual RSC personas representing varied demographics
SYNOPSIS: Ten equally qualified virtual residency applicants—differing only in race, gender, or sexual orientation—were evaluated by 30 unique simulated RSC personas created in ChatGPT. The study aimed to isolate whether demographic attributes alone influenced selection decisions. Significant demographic bias was observed. RSCs tended to favor applicants who matched their own identity. For example, male RSCs selected white male or white female applicants most frequently, while female RSCs preferred LGBTQIA+ and female candidates. RSCs identifying as white or Black favored same-race applicants. Across most RSC demographics, Asian male applicants were least often selected. Interestingly, GPT-4o exhibited a shift from previous GPT-4 results, with a strong trend toward selecting Black female and LGBTQIA+ applicants, often explicitly citing diversity as justification. While this may reflect model evolution, it also signals the potential for new, unintended biases. The study highlights the evolving and non-transparent nature of LLM outputs. Limitations include the use of a hypothetical simulation, limited demographics tested, and the inability to assess intersectional or nuanced human decision making.
CITATION: Halagur AS, et al. Large language models in otolaryngology residency admissions: a random sampling analysis. Laryngoscope. 2025;135:87-93. doi:10.1002/lary.31705
COMMENT: This study looked at Gen AI and its ability to screen ENT residency applications. It fed the Gen AI simulated applicants “with identically qualified applications and potential for success, but solely differing in gender, race/ ethnicity, and/or sexual orientation demographic characteristics to enable an assessment of the isolated influence of these demographic characteristics on selection decisions.” It is important to note that AI systems are not introducing bias; bias often exists in our training sets for AI and is at risk of being amplified, and humans need test systems to identify hidden biases and to mitigate this bias. Of note in this study, the AI was told that the applicants were identical in their qualifications and only differed in these potential biased areas. I would argue that ANY human put in this position would have a similarly difficult task and would either introduce their own bias or purposefully counter bias to offset these implicit or explicit biases. Eric Gantwerker, MD
Leave a Reply