Zhanna Kaufman
PhD Student
I am a fifth year PhD Student at Umass Amherst, working with Dr. Yuriy Brun at the Laboratory for Software Engineering Research (LASER). My research is focused in trust in software systems, mainly scoped to three areas - the first is how people perceive and respond to machine learning (ML) models depending on model characteristics, particularly model bias and interaction behavior. As part of this work, I look at how best to give people an understanding of model behavior, by presenting explanations of underlying model operation or facilitating structured interactions.

The second area of my research supports the creation of trustworthy software with ML assistance. Specifically, we use machine learning in combination with theorem prover tools such as Coq to automate proof synthesis. The aim of this work is to use automation to make proving code correctness accessible to large projects and to developers without expertise in formal verification.

More recently, I have begun work on understanding human robustness to incorrect natural language descriptions of formally specified postconditions. The goal of this work is to help developers calibtrate their dependence on ML assistance when writing tests for code, and more effectively use code assistance to create trustworthy software.

Before coming to UMASS, I worked for six years in industry as a Cyber Research and Innovation Engineer. I received my Bachelor's in Electrical Engineering from Boston University in 2015 and my MS in Computer Science from Worcester Polytechnic Institute in 2018.
My latest completed projects
Your Model Is Unfair, Are You Even Aware? Inverse Relationship Between Comprehension and Trust in Explainability Visualizations of Biased ML Models

Your Model Is Unfair, Are You Even Aware? Inverse Relationship Between Comprehension and Trust in Explainability Visualizations of Biased ML Models
Accepted to IEEE Vis 2025

Systems relying on ML have become ubiquitous, but so has biased behavior within them. Research shows that bias significantly affects stakeholders’ trust in systems and how they use them. Further, stakeholders of different backgrounds view and trust the same systems differently. Thus, how ML models’ behavior is explained plays a key role in comprehension and trust. We survey explainability visualizations, creating a taxonomy of design characteristics. We conduct user studies to evaluate five state-of-the-art visualization tools (LIME, SHAP, CP, Anchors, and ELI5) for model explainability, measuring how taxonomy characteristics affect comprehension, bias perception, and trust for non-expert ML users. Surprisingly, we find an inverse relationship between comprehension and trust: the better users understand the models, the less they trust them. We investigate the cause and find that this relationship is strongly mediated by bias perception: more comprehensible visualizations increase people’s perception of bias, and increased bias perception reduces trust. We confirm this relationship is causal: Manipulating explainability visualizations to control comprehension, bias perception, and trust, we show that visualization design can significantly (p < 0.001) increase comprehension, increase perceived bias, and reduce trust. Conversely, reducing perceived model bias, either by improving model fairness or by adjusting visualization design, significantly increases trust even when comprehension remains high. Our work advances understanding of how comprehension affects trust and systematically investigates visualization’s role in facilitating responsible ML applications.

QEDCartographer: Automating Formal Verification Using Reward-Free Reinforcement Learning

QEDCartographer: Automating Formal Verification Using Reward-Free Reinforcement Learning
ICSE 2025

Formal verification is a promising method for producing reliable software, but the difficulty of manually writing verification proofs severely limits its utility in practice. Recent methods have automated some proof synthesis by guiding a search through the proof space using a theorem prover. Unfortunately, the theorem prover provides only the crudest estimate of progress, resulting in effectively undirected search. To address this problem, we create QEDCartographer, an automated proof-synthesis tool that combines supervised and reinforcement learning to more effectively explore the proof space. QEDCartographer incorporates the proofs’ branching structure, enabling reward- free search and overcoming the sparse reward problem inherent to formal verification. We evaluate QEDCartographer using the CoqGym benchmark of 68.5K theorems from 124 open-source Coq projects. QEDCartographer fully automatically proves 21.4% of the test-set theorems. Previous search-based proof-synthesis tools Tok, Tac, ASTactic, Passport, and Proverbot9001, which rely only on supervised learning, prove 9.6%, 9.8%, 10.9%, 12.5%, and 19.8%, respectively. Diva, which combines 62 tools, proves 19.2%. Comparing to the most effective prior tool, Proverbot9001, QEDCartographer produces 26% shorter proofs 27% faster, on average over the theorems both tools prove. Together, QEDCartographer and non-learning-based CoqHammer prove 31.8% of the theorems, while CoqHammer alone proves 26.6%. Our work demonstrates that reinforcement learning is a fruitful research direction for improving proof-synthesis tools’ search mechanisms.

My model is unfair, do people even care? visual design affects trust and perceived bias in machine learning

My model is unfair, do people even care? visual design affects trust and perceived bias in machine learning
IEEE Vis 2023

Machine learning technology has become ubiquitous, but, unfortunately, often exhibits bias. As a consequence, disparate stakeholders need to interact with and make informed decisions about using machine learning models in everyday systems. Visualization technology can support stakeholders in understanding and evaluating trade-offs between, for example, accuracy and fairness of models. This paper aims to empirically answer “Can visualization design choices affect a stakeholder's perception of model bias, trust in a model, and willingness to adopt a model?” Through a series of controlled, crowd-sourced experiments with more than 1,500 participants, we identify a set of strategies people follow in deciding which models to trust. Our results show that men and women prioritize fairness and performance differently and that visual design choices significantly affect that prioritization. For example, women trust fairer models more often than men do, participants value fairness more when it is explained using text than as a bar chart, and being explicitly told a model is biased has a bigger impact than showing past biased performance. We test the generalizability of our results by comparing the effect of multiple textual and visual design choices and offer potential explanations of the cognitive mechanisms behind the difference in fairness perception and trust. Our research guides design considerations to support future work developing visualization systems for machine learning.

Passport: Improving automated formal verification using identifiers

Passport: Improving automated formal verification using identifiers
TOPLAS 2023

Machine learning technology has become ubiquitous, but, unfortunately, often exhibits bias. As a consequence, disparate stakeholders need to interact with and make informed decisions about using machine learning models in everyday systems. Visualization technology can support stakeholders in understanding and evaluating trade-offs between, for example, accuracy and fairness of models. This paper aims to empirically answer “Can visualization design choices affect a stakeholder's perception of model bias, trust in a model, and willingness to adopt a model?” Through a series of controlled, crowd-sourced experiments with more than 1,500 participants, we identify a set of strategies people follow in deciding which models to trust. Our results show that men and women prioritize fairness and performance differently and that visual design choices significantly affect that prioritization. For example, women trust fairer models more often than men do, participants value fairness more when it is explained using text than as a bar chart, and being explicitly told a model is biased has a bigger impact than showing past biased performance. We test the generalizability of our results by comparing the effect of multiple textual and visual design choices and offer potential explanations of the cognitive mechanisms behind the difference in fairness perception and trust. Our research guides design considerations to support future work developing visualization systems for machine learning.