
Santiago Olivella
· Associate Professor, Political ScienceUniversity of North Carolina at Chapel Hill · Political Science
Active 1974–2025
About
Santiago Olivella is an associate professor of Political Science and Data Science and Society at the University of North Carolina, Chapel Hill. His research focuses on the use and development of novel computational methods for quantitative political research, particularly Bayesian and Machine Learning probabilistic models. He studies the inferential analysis of networks, the measurement of latent traits, and the political consequences of electoral and legislative institutions. Olivella has held various academic positions, including assistant and associate professorships at UNC, visiting roles at Harvard and Princeton, and an assistant professorship at the University of Miami. His educational background includes a Ph.D. in Political Science from Washington University in St. Louis, along with a master's and bachelor's degree in Political Science from the same university and Universidad de los Andes in Bogotá, Colombia.
Research topics
- Computer Science
- Sociology
- Artificial Intelligence
- Geography
- Gender studies
- Statistics
- Machine Learning
- Mathematics
- Demography
- Cartography
- Psychology
- Anthropology
- Mathematics education
Selected publications
A Statistical Model of Bipartite Networks: Application to Cosponsorship in the United States Senate
Political Analysis · 2025-09-17 · 1 citations
articleOpen accessAbstract Many networks in political and social research are bipartite, connecting two distinct node types. A common example is cosponsorship networks, where legislators are linked through the bills they support. However, most bipartite network analyses in political science rely on statistical models fitted to a “projected” unipartite network. This approach can lead to aggregation bias and an artificially high degree of clustering, invalidating the study of group roles in network formation. To address these issues, we develop a statistical model of bipartite networks theorized to arise from group interactions, extending the mixed-membership stochastic blockmodel. Our model identifies groups within each node type that exhibit common edge formation patterns and incorporates node and dyad-level covariates as predictors of group membership and observed dyadic relations. We derive an efficient computational algorithm to fit the model and apply it to cosponsorship data from the United States Senate. We show that senators who were perfectly split along party lines remained productive and pass major legislation by forming non-partisan, power-brokering coalitions that found common ground through low-stakes bills. We also find evidence of reciprocity norms and policy expertise impacting cosponsorships. An open-source software package is available for researchers to replicate these insights.
A Statistical Model of Bipartite Networks: Application to Cosponsorship in the United States Senate
arXiv (Cornell University) · 2023-05-10
preprintOpen accessMany networks in political and social research are bipartite, with edges connecting exclusively across two distinct types of nodes. A common example includes cosponsorship networks, in which legislators are connected indirectly through the bills they support. Yet most existing network models are designed for unipartite networks, where edges can arise between any pair of nodes. However, using a unipartite network model to analyze bipartite networks, as often done in practice, can result in aggregation bias and artificially high-clustering -- a particularly insidious problem when studying the role groups play in network formation. To address these methodological problems, we develop a statistical model of bipartite networks theorized to be generated through group interactions by extending the popular mixed-membership stochastic blockmodel. Our model allows researchers to identify the groups of nodes, within each node type in the bipartite structure, that share common patterns of edge formation. The model also incorporates both node and dyad-level covariates as the predictors of group membership and of observed dyadic relations. We develop an efficient computational algorithm for fitting the model, and apply it to cosponsorship data from the United States Senate. We show that legislators in a Senate that was perfectly split along party lines were able to remain productive and pass major legislation by forming non-partisan, power-brokering coalitions that found common ground through their collaboration on low-stakes bills. We also find evidence for norms of reciprocity, and uncover the substantial role played by policy expertise in the formation of cosponsorships between senators and legislation. We make an open-source software package available that makes it possible for other researchers to uncover similar insights from bipartite networks.
PAN volume 31 issue 4 Cover and Back matter
Political Analysis · 2023-09-12
paratextOpen accessAn abstract is not available for this content so a preview has been provided. As you have access to this content, a full PDF is available via the ‘Save PDF’ action button.
Race and ethnicity data for first, middle, and surnames
Scientific Data · 2023 · 21 citations
- Sociology
- Geography
- Demography
We provide the largest compiled publicly available dictionaries of first, middle, and surnames for the purpose of imputing race and ethnicity using, for example, Bayesian Improved Surname Geocoding (BISG). The dictionaries are based on the voter files of six U.S. Southern States that collect self-reported racial data upon voter registration. Our data cover the racial make-up of a larger set of names than any comparable dataset, containing 136 thousand first names, 125 thousand middle names, and 338 thousand surnames. Individuals are categorized into five mutually exclusive racial and ethnic groups - White, Black, Hispanic, Asian, and Other - and racial/ethnic probabilities by name are provided for every name in each dictionary. We provide both probabilities of the form ℙ(race|name) and ℙ(name|race), and conditions under which they can be assumed to be representative of a given target population. These conditional probabilities can then be deployed for imputation in a data analytic task for which self-reported racial and ethnic data is not available.
Name Dictionaries for "wru" R Package
Harvard Dataverse · 2022-08-01 · 1 citations
datasetOpen access<p>We provide datasets that that estimate the racial distributions associated with first, middle, and last names in the United States. The datasets cover five racial categories: White, Black, Hispanic, Asian, and Other. The provided data are computed from the voter files of six Southern states -- Alabama, Florida, Georgia, Louisiana, North Carolina, and South Carolina -- that collect race and ethnicity data upon registration. We include seven voter files per state, sourced between 2018 and 2021 from L2, Inc. Together, these states have approximately 36MM individuals who provide self-reported race and ethnicity. <p>The last name datasets includes 338K surnames, while the middle name dictionaries contains 126K middle names and the first name datasets includes 136K first names. For each type of name, we provide a dataset of P(race | name) probabilities and P(name | race) probabilities. We include only names that appear at least 25 times across the 42 (= 7 voter files * 6 states) voter files in our dataset. <p>These data are closely related to the the dataset: "Name Dictionaries for "wru" R Package", https://doi.org/10.7910/DVN/7TRYAC. These are the probabilities used in the latest iteration of the "WRU" package (Khanna et al., 2022) to make probabilistic predictions about the race of individuals, given their names and geolocations.
Figshare · 2022-01-01
datasetOpen access1st authorCorrespondingThe decision to engage in military conflict is shaped by many factors, including state- and dyad-level characteristics as well as the state’s membership in geopolitical coalitions. Supporters of the democratic peace theory, for example, hypothesize that the community of democratic states is less likely to wage war with each other. Such theories explain the ways in which nodal and dyadic characteristics affect the evolution of conflict patterns over time via their effects on group memberships. To test these arguments, we develop a dynamic model of network data by combining a hidden Markov model with a mixed-membership stochastic blockmodel that identifies latent groups underlying the network structure. Unlike existing models, we incorporate covariates that predict dynamic node memberships in latent groups as well as the direct formation of edges between dyads. While prior substantive research often assumes the decision to engage in international militarized conflict is independent across states and static over time, we demonstrate that conflict is driven by states’ evolving membership in geopolitical blocs. Our analysis of militarized disputes from 1816 to 2010 identifies two distinct blocs of democratic states, only one of which exhibits unusually low rates of conflict. Changes in monadic covariates like democracy shift states between coalitions, making some states more pacific but others more belligerent. Supplementary materials for this article are available online.
arXiv (Cornell University) · 2022-05-12 · 8 citations
preprintOpen accessPrediction of individual's race and ethnicity plays an important role in social science and public health research. Examples include studies of racial disparity in health and voting. Recently, Bayesian Improved Surname Geocoding (BISG), which uses Bayes' rule to combine information from Census surname files with the geocoding of an individual's residence, has emerged as a leading methodology for this prediction task. Unfortunately, BISG suffers from two Census data problems that contribute to unsatisfactory predictive performance for minorities. First, the decennial Census often contains zero counts for minority racial groups in the Census blocks where some members of those groups reside. Second, because the Census surname files only include frequent names, many surnames -- especially those of minorities -- are missing from the list. To address the zero counts problem, we introduce a fully Bayesian Improved Surname Geocoding (fBISG) methodology that accounts for potential measurement error in Census counts by extending the naive Bayesian inference of the BISG methodology to full posterior inference. To address the missing surname problem, we supplement the Census surname data with additional data on last, first, and middle names taken from the voter files of six Southern states where self-reported race is available. Our empirical validation shows that the fBISG methodology and name supplements significantly improve the accuracy of race imputation across all racial groups, and especially for Asians. The proposed methodology, together with additional name data, is available via the open-source software WRU.
Harvard Dataverse · 2022-11-14
datasetOpen accessSenior authorA common approach when studying the quality of representation involves comparing the latent preferences of voters and legislators, commonly obtained by fitting an item-response theory (IRT) model to a common set of stimuli. Despite being exposed to the same stimuli, voters and legislators may not share a common understanding of how these stimuli map onto their latent preferences, leading to differential item-functioning (DIF) and incomparability of estimates. We explore the presence of DIF and incomparability of latent preferences obtained through IRT models by re-analyzing an influential survey data set, where survey respondents expressed their preferences on roll call votes that U.S. legislators had previously voted on. To do so, we propose defining a Dirichlet Process prior over item-response functions in standard IRT models. In contrast to typical multi-step approaches to detecting DIF, our strategy allows researchers to fit a single model, automatically identifying incomparable sub-groups with different mappings from latent traits onto observed responses. We find that although there is a group of voters whose estimated positions can be safely compared to those of legislators, a sizeable share of surveyed voters understand stimuli in fundamentally different ways. Ignoring these issues can lead to incorrect conclusions about the quality of representation.
Race and ethnicity data for first, middle, and last names
arXiv (Cornell University) · 2022 · 1 citations
- Computer Science
- Artificial Intelligence
- Sociology
We provide the largest compiled publicly available dictionaries of first, middle, and last names for the purpose of imputing race and ethnicity using, for example, Bayesian Improved Surname Geocoding (BISG). The dictionaries are based on the voter files of six Southern states that collect self-reported racial data upon voter registration. Our data cover a much larger scope of names than any comparable dataset, containing roughly one million first names, 1.1 million middle names, and 1.4 million surnames. Individuals are categorized into five mutually exclusive racial and ethnic groups -- White, Black, Hispanic, Asian, and Other -- and racial/ethnic counts by name are provided for every name in each dictionary. Counts can then be normalized row-wise or column-wise to obtain conditional probabilities of race given name or name given race. These conditional probabilities can then be deployed for imputation in a data analytic task for which ground truth racial and ethnic data is not available.
arXiv (Cornell University) · 2022-05-12
preprintOpen accessSenior authorA common approach when studying the quality of representation involves comparing the latent preferences of voters and legislators, commonly obtained by fitting an item-response theory (IRT) model to a common set of stimuli. Despite being exposed to the same stimuli, voters and legislators may not share a common understanding of how these stimuli map onto their latent preferences, leading to differential item-functioning (DIF) and incomparability of estimates. We explore the presence of DIF and incomparability of latent preferences obtained through IRT models by re-analyzing an influential survey data set, where survey respondents expressed their preferences on roll call votes that U.S. legislators had previously voted on. To do so, we propose defining a Dirichlet Process prior over item-response functions in standard IRT models. In contrast to typical multi-step approaches to detecting DIF, our strategy allows researchers to fit a single model, automatically identifying incomparable sub-groups with different mappings from latent traits onto observed responses. We find that although there is a group of voters whose estimated positions can be safely compared to those of legislators, a sizeable share of surveyed voters understand stimuli in fundamentally different ways. Ignoring these issues can lead to incorrect conclusions about the quality of representation.
Frequent coauthors
- 118 shared
Albert Solé
- 29 shared
Josep M. Anglada
Institute of Advanced Chemistry of Catalonia
- 27 shared
Antoni Riéra
Universitat de Barcelona
- 22 shared
Josep María Bofill
Universitat de Barcelona
- 17 shared
Kosuke Imai
Harvard University
- 15 shared
Xavier Verdaguer
Institute for Research in Biomedicine
- 15 shared
Agustí Lledó
- 14 shared
R. A. Abramovitch
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Santiago Olivella
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup