Shomir Wilson
VerifiedPennsylvania State University · Social Data Analytics
Active 2004–2026
About
Shomir Wilson is an Assistant Professor in the College of Information Sciences and Technology at Penn State, where he leads the Human Language Technologies Lab. He is also a Faculty Affiliate of Penn State's Institute for CyberScience and a member of the faculty of the J. Jeffrey and Ann Marie Fox Graduate School. His research spans natural language processing, privacy, and artificial intelligence, and he participates in the Usable Privacy Policy Project. Prior to his current position, he was an Assistant Professor in the EECS Department at the University of Cincinnati from 2016 to 2018. He has also held positions as a postdoctoral researcher and lecturer at Carnegie Mellon University's School of Computer Science and was an NSF International Research Fellow at the University of Edinburgh's School of Informatics. He earned his PhD in Computer Science from the University of Maryland in 2011.
Research topics
- Computer science
- Internet privacy
- Artificial intelligence
- Natural language processing
- Data science
Selected publications
Rethinking How We Discuss the Guidance of Student Researchers in Computing
2026-02-13
articleOpen access1st authorCorrespondingComputing faculty at research universities are often expected to guide the work of undergraduate and graduate student researchers. This guidance is typically called advising or mentoring, but these terms belie the complexity of the relationship, which includes several related but distinct roles. I examine the guidance of student researchers in computing (abbreviated to research guidance or guidance throughout) within a facet framework, creating an inventory of roles that faculty members can hold. By expanding and disambiguating the language of guidance, this approach reveals the full breadth of faculty responsibilities toward student researchers, and it facilitates discussing conflicts between those responsibilities. Additionally, the facet framework permits greater flexibility for students seeking guidance, allowing them a robust support network without implying inadequacy in an individual faculty member's skills. I further argue that an over-reliance on singular terms like advising or mentoring for the guidance of student researchers obscures the full scope of faculty responsibilities and interferes with improvement of those as skills. Finally, I provide suggestions for how the facet framework can be utilized by faculty and institutions, and how it can be shared with students for their benefit.
A Tale of Two Identities: An Ethical Audit of AI-Crafted Synthetic Personas
Proceedings of the AAAI Conference on Artificial Intelligence · 2026-03-14
articleOpen accessSenior authorAs LLMs (large language models) are increasingly used to generate synthetic personas, particularly in data-limited domains such as health, privacy, and HCI, it becomes necessary to understand how these narratives represent identity, especially that of minority communities. In this paper, we audit synthetic personas generated by 3 LLMs (GPT4o, Gemini 1.5 Pro, Deepseek v2.5) through the lens of representational harm, focusing specifically on racial identity. Using a mixed-methods approach combining close reading, lexical analysis, and a parameterized creativity framework, we compare 1,512 LLM-generated persona to human-authored responses. Our findings reveal that LLMs disproportionately foreground racial markers, overproduce culturally coded language, and construct personas that are syntactically elaborate yet narratively reductive. These patterns result in a range of sociotechnical harms, including stereotyping, exoticism, erasure, and benevolent bias, that are often obfuscated by superficially positive narrations. We formalize this phenomenon as algorithmic othering, where minoritized identities are rendered hypervisible but less authentic.
The Hidden Curriculum of Faculty Careers in Computing
2026-02-13
articleOpen access1st authorCorrespondingA subset of students in computing Ph.D. programs aspire to become research-active faculty at universities. However, options are typically limited for students to learn about the work of faculty beyond the activities students personally witness. Additionally, while orientations and workshops are available to new faculty, these resources are ephemeral compared to the extended, years-long needs of junior faculty to assimilate professorial knowledge. The norms of independence and self-management distinguish university faculty from other career paths, and they further motivate creating supportive resources. I introduce a set of guides to help aspiring and new computing faculty learn the unwritten norms of the profession. I focus on creating materials to support pre-tenure faculty in positions with significant expectations for research, teaching, and service, motivated by the complexity of balancing those obligations. These materials target faculty in the United States, matching my positionality, although I write them to be as broadly applicable as possible. While similar materials exist, these guides contrast with others through their foci on day-to-day activities, sustainable effort, and understanding the significance of one's work. I describe three guides from the full set, which is available to the public online at https://shomir.net/advice.html. These three selections consist of a glossary of faculty terminology, a case study of a faculty member's experiences with grant proposals, and a experiential guide to tenure. Finally, I provide recommendations for others who are interested in creating similar materials to expand the available support for pre-tenure faculty.
SoAC and SoACer: A Sector-Based Corpus and LLM-Based Framework for Sectoral Website Classification
2025-08-27
articleOpen accessSenior authorOne approach to understanding the vastness and complexity of the web is to categorize websites into sectors that reflect the specific industries or domains in which they operate. However, existing website classification approaches often struggle to handle the noisy, unstructured, and lengthy nature of web content, and current datasets lack a universal sector classification labeling system specifically designed for the web. To address these issues, we introduce SoAC (Sector of Activity Corpus), a large-scale corpus comprising 195, 495 websites categorized into 10 broad sectors tailored for web content, which serves as the benchmark for evaluating our proposed classification framework, SoACer (Sector of Activity Classifier). Building on this resource, SoACer is a novel end-to-end classification framework that first fetches website information, then incorporates extractive summarization to condense noisy and lengthy content into a concise representation, and finally employs large language model (LLM) embeddings (Llama3-8B) combined with a classification head to achieve accurate sectoral prediction. Through extensive experiments, including ablation studies and detailed error analysis, we demonstrate that SoACer achieves an overall accuracy of 72.6% on our proposed SoAC dataset. Our ablation study confirms that extractive summarization not only reduces computational overhead but also enhances classification performance, while our error analysis reveals meaningful sector overlaps that underscore the need for multi-label and hierarchical classification frameworks. These findings provide a robust foundation for future exploration of advanced classification techniques that better capture the complex nature of modern website content.1
Rethinking How We Discuss the Guidance of Student Researchers in Computing
ArXiv.org · 2025-10-10
preprintOpen access1st authorCorrespondingComputing faculty at research universities are often expected to guide the work of undergraduate and graduate student researchers. This guidance is typically called advising or mentoring, but these terms belie the complexity of the relationship, which includes several related but distinct roles. I examine the guidance of student researchers in computing (abbreviated to research guidance or guidance throughout) within a facet framework, creating an inventory of roles that faculty members can hold. By expanding and disambiguating the language of guidance, this approach reveals the full breadth of faculty responsibilities toward student researchers, and it facilitates discussing conflicts between those responsibilities. Additionally, the facet framework permits greater flexibility for students seeking guidance, allowing them a robust support network without implying inadequacy in an individual faculty member's skills. I further argue that an over-reliance on singular terms like advising or mentoring for the guidance of student researchers obscures the full scope of faculty responsibilities and interferes with improvement of those as skills. Finally, I provide suggestions for how the facet framework can be utilized by faculty and institutions, and how parts of it can be discussed with students for their benefit.
Can Third-parties Read Our Emotions?
ArXiv.org · 2025-04-25
preprintOpen accessNatural Language Processing tasks that aim to infer an author's private states, e.g., emotions and opinions, from their written text, typically rely on datasets annotated by third-party annotators. However, the assumption that third-party annotators can accurately capture authors' private states remains largely unexamined. In this study, we present human subjects experiments on emotion recognition tasks that directly compare third-party annotations with first-party (author-provided) emotion labels. Our findings reveal significant limitations in third-party annotations-whether provided by human annotators or large language models (LLMs)-in faithfully representing authors' private states. However, LLMs outperform human annotators nearly across the board. We further explore methods to improve third-party annotation quality. We find that demographic similarity between first-party authors and third-party human annotators enhances annotation performance. While incorporating first-party demographic information into prompts leads to a marginal but statistically significant improvement in LLMs' performance. We introduce a framework for evaluating the limitations of third-party annotations and call for refined annotation practices to accurately represent and model authors' private states.
ArXiv.org · 2025-07-02
preprintOpen access1st authorCorrespondingRecent developments in large language models (LLMs) have been accompanied by rapidly growing public interest in natural language processing (NLP). This attention is reflected by major news venues, which sometimes invite NLP researchers to share their knowledge and views with a wide audience. Recognizing the opportunities of the present, for both the research field and for individual researchers, this paper shares recommendations for communicating with a general audience about the capabilities and limitations of NLP. These recommendations cover three themes: vague terminology as an obstacle to public understanding, unreasonable expectations as obstacles to sustainable growth, and ethical failures as obstacles to continued support. Published NLP research and popular news coverage are cited to illustrate these themes with examples. The recommendations promote effective, transparent communication with the general public about NLP, in order to strengthen public understanding and encourage support for research.
Can Third Parties Read Our Emotions?
2025-01-01 · 2 citations
articleSociodemographic Bias in Language Models: A Survey and Forward Path
2024-01-01 · 7 citations
articleOpen accessSociodemographic bias in language models (LMs) has the potential for harm when deployed in real-world settings.This paper presents a comprehensive survey of the past decade of research on sociodemographic bias in LMs, organized into a typology that facilitates examining the different aims: types of bias, quantifying bias, and debiasing techniques.We track the evolution of the latter two questions, then identify current trends and their limitations, as well as emerging techniques.To guide future research towards more effective and reliable solutions, and to help authors situate their work within this broad landscape, we conclude with a checklist of open questions.
Race and Privacy in Broadcast Police Communications
arXiv (Cornell University) · 2024-07-01
preprintOpen accessSenior authorRadios are essential for the operations of modern police departments, and they function as both a collaborative communication technology and a sociotechnical system. However, little prior research has examined their usage or their connections to individual privacy and the role of race in policing, two growing topics of concern in the US. As a case study, we examine the Chicago Police Department's (CPD's) use of broadcast police communications (BPC) to coordinate the activity of law enforcement officers (LEOs) in the city. From a recently assembled archive of 80,775 hours of BPC associated with CPD operations, we analyze text transcripts of radio transmissions broadcast 9:00 AM to 5:00 PM on August 10th, 2018 in one majority Black, one majority white, and one majority Hispanic area of the city (24 hours of audio) to explore three research questions: (1) Do BPC reflect reported racial disparities in policing? (2) How and when is gender, race/ethnicity, and age mentioned in BPC? (3) To what extent do BPC include sensitive information, and who is put at most risk by this practice? (4) To what extent can large language models (LLMs) heighten this risk? We explore the vocabulary and speech acts used by police in BPC, comparing mentions of personal characteristics to local demographics, the personal information shared over BPC, and the privacy concerns that it poses. Analysis indicates (a) policing professionals in the city of Chicago exhibit disproportionate attention to Black members of the public regardless of context, (b) sociodemographic characteristics like gender, race/ethnicity, and age are primarily mentioned in BPC about event information, and (c) disproportionate attention introduces disproportionate privacy risks for Black members of the public.
Recent grants
Frequent coauthors
- 26 shared
Norman Sadeh
Carnegie Mellon University
- 25 shared
Pranav Narayanan Venkit
- 17 shared
Florian Schaub
University of Michigan–Ann Arbor
- 16 shared
Mukund Srinath
- 9 shared
C. Lee Giles
- 8 shared
Sanjana Gautam
Pennsylvania State University
- 8 shared
Frank Stein
- 7 shared
Thomas Norton
Fordham University
Labs
Social Data AnalyticsPI
Education
- 2011
Ph.D., Computer Science
University of Maryland
- 2008
M.S., Computer Science
University of Maryland
- 2005
B.S., Mathematics
Virginia Tech
- 2005
B.A., Philosophy
Virginia Tech
- 2005
B.S., Computer Science
Virginia Tech
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Shomir Wilson
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup