SUMMER 2025 PROJECTS | Institute for Social Research and Data Innovation

Documenting Geography Variables in the IPUMS USA Full Count datasets

IPUMS USA disseminates full count microdata from the 1850-1950 decennial censuses. This massive dataset of 800 million individual records constitutes our richest source of quantitative information about U.S. residents of the United States. The geography variables in this dataset are incredibly important because they allow scientists to precisely locate individuals and spatially associate them with other geographic features, enabling fine-grained spatial analysis of the full population across a long period of U.S. history. For the 2025 Summer Diversity Fellowship, we seek motivated students to help us (1) create documentation for the IPUMS USA website that describes the different geography variables in the full count datasets and (2) develop a “Use It for Good” blog post that describes census enumeration districts (the smallest geographic unit identified in the publicly available full count data) and various GIS-based resources that have been developed to visualize and map enumeration district data.

Required Qualifications:

Excellent communication skills

Preferred Qualifications*:

GIS or mapping skills
Data processing and/or statistical analysis

Scheduling Requirements: Fellows will be expected to be fully in-office for the first few weeks of the fellowship, and then the potential to work remotely from the mid-point until the end. Fellows are required to attend weekly cohort meetings in-person at 50 Willey Hall.

Developing a bioinformatic Data Lineage and Meta-data Tracking system

Dr. Steve Johnson & Dr. Tim Meyer

Fellows will work with Dr. Johnson and Mr. Meyer to identify, assess and implement software to track data lineage and meta-data that is produced during complex data transformation pipelines. Many bioinformatic processes require multiple steps and produce many intermediate files and datasets that have meta-data that depend on how the data was produced or transformed. Tracking this information manually is error prone and time intensive. Software packages exist for tracking data lineage (i.e. DataHub, Amundsen, Open Metadata, etc), but these packages need to be adapted for bioinformatics pipelines. For this project the team will develop requirements, identify software, document required changes, implement and test the software for use in the SenNet Tissue Mapping Center (https://med.umn.edu/news/u-m-help-lead-national-network-map-rare-cells-implicated-human- health-and-disease). The goal will be to generalize this to other projects across the University. Accurate tracking of data lineage is critical for analysis and replicability of studies.

Preferred Qualifications*:

Experience with databases and SQL
Knowledge of shell scripting languages, i.e. Linux Bash
Some programming experience in Python or Java
Knowledge or interest in bioinformatics may also be helpful

Scheduling Requirements: This project is eligible for hybrid work arrangements. Fellows are required to attend weekly cohort meetings in-person at 50 Willey Hall.

Racial Socialization Messages in Ethnically Diverse Black Families

Dr. J’Mag Karbeah and Dr. Alexandra VanBergen

Fellows will work with Dr. J’Mag Karbeah and Dr. Alexandra VanBergen to explore and examine the messages that ethnically diverse Black youth receive from their parents about their cultural group identity, values, and beliefs. During this project, fellows will help conduct focus group interviews with parents and youth from various cultural backgrounds to better understand how ethnic-racial socialization messages shape how young people prepare for potentially discriminatory experiences and the potential mental health impact of these experiences. In addition to conducting interviews, fellows will learn how to analyze qualitative research.

Preferred Qualifications*:

Experience working with culturally, ethnically, and socioeconomically diverse populations
Previous experience working with ethnically diverse Black youth
An interest and/or experience in qualitative research

Scheduling Requirements: Fellows will be required to be on campus for a weekly in-person project meeting. Fellows are welcome to work in-person or remotely for the rest of their time. This project, due to the nature of data collection will require fellows to attend focus group sessions in person. Fellows are also required to attend weekly cohort meetings in-person at 50 Willey Hall.

Using Machine Learning to Identify Key Social Determinants in Adolescence of Educational Attainment

Dr. Xiaoran Sun and Dr. Juan Del Toro

Fellows will work with Dr. Xiaoran Sun and Dr. Juan Del Toro to identify key social determinants across different contexts (e.g., family, school, peers, neighborhood) for educational attainment using the National Longitudinal Study of Adolescent to Adult Health (Add Health) dataset. Specifically, we will use the machine learning approach to train predictive models of educational attainment based on a comprehensive set of predictors identified across 200 prior studies. The main aims include: (1) To identify which social determinants in adolescence are most predictive of educational attainment among White, Black, Latinx, and Asian adults, (2) to interpret prediction patterns of the key determinants within each racial/ethnic group, and (3) to reveal racial/ethnic differences in the prediction models. Fellows will take part in the literature review and data preparation for analysis and the training and testing of machine learning models, as well as in drafting the manuscript for publication.

Required Qualifications:

Familiarity with R or Python

Preferred Qualifications*:

Interest in learning to conduct machine learning analysis is required.
Prior experiences with analysis of large longitudinal datasets are preferred.

Scheduling Requirements: This project is eligible for hybrid work arrangements. Fellows are required to attend weekly cohort meetings in-person at 50 Willey Hall.

Evaluating the Performance of Causal Discovery Algorithms

Dr. Erich Kummerfeld, Dr. Sisi Ma, and Dr. Bryan Andrews

Fellows will work with Dr. Kummerfeld, Dr. Ma, and Dr. Andrews to design, develop, and run simulation studies to evaluate the performance of multiple causal discovery algorithms under different conditions. Causal discovery is a new and growing research area that develops and studies algorithms capable of inferring the causal structure of real world phenomena from measurements of the relevant variables (i.e. data). Causal discovery has the potential to revolutionize how scientists learn about causal mechanisms in domains that range from gene regulatory networks to climate change. However, the performance and properties of these algorithms are not well understood.For this project the team will identify 2-3 important problems about the performance of causal discovery algorithms and implement simulation studies to answer those questions. The goal is to inform the best practices for researchers who wish to use causal discovery algorithms to investigate real world research topics, as well as point method development researchers towards new directions for inventing new causal discovery algorithms with improved performance and reliability.

Preferred Qualifications*:

Experience with Python, R, or Java
Knowledge of introductory statistics concepts like probability, correlation, distribution, and likelihood
Knowledge or interest in machine learning or artificial intelligence may also be helpful

Scheduling Requirements: This project is eligible for hybrid work arrangements. Fellows are required to attend weekly cohort meetings in-person at 50 Willey Hall.

*We encourage people to apply even if they don’t meet any of the preferred qualifications. If you meet any of the preferred qualifications, please clearly indicate this in your application materials.