Spotlight Series - Stanford HAI Education Group

Spotlight Series:

Twice a quarter, we turn the spotlight onto each other. Each spotlight is an opportunity for 1-2 students in our group to share their research, find collaborators, and receive peer feedback.

Most importantly, this is part of our effort to learn from each other and truly engage with interdisciplinary perspectives.

Find our recaps & reflections here:

12/06:

The Future of Teaching

Training Teachers with GPTeach

Traditional methods for training new teachers (e.g., online webinars and modules, 1:1 instruction) face the efficacy vs. scalability trade-off. Instead, can we effectively train teachers using AI simulations?

GPTeach, developed by Julia and her lab collaborators, is a chat-based platform where new teachers can interact with AI-simulated students, providing teachers with realistic practice in a safe and controlled environment.

After assessing the tool through a think-aloud study and an A/B test, comparing it with traditional training methods, some key findings include GPT’s ability to reduce risk as new teachers receive training and flexibility in handling diverse student personas & needs.

Participants appreciated the opportunity to refine their teaching methods with dynamic, real-time feedback. So far, GPTeach has been successfully integrated into an online course for over 800 novice teachers, enhancing the teacher training component.

GPTeach is an example of how AI can revolutionize teacher training, making it more interactive, effective, and adaptable to individual learning styles.

Future directions include imbuing GPT-based students with more challenging nuances (e.g., misconception or ignorance); adding voice capabilities, so teachers can talk—not just chat—with students; and expanding toward actionable design improvements and further applications (e.g., conflict resolution, teacher and student evaluation)

Toward “Fair” Teacher Evaluations

In public schools, teachers receive feedback through classroom observations with different rubrics for different “skills” (e.g., lesson planning, classroom management). These rubrics raise questions about the reliability of observer ratings and the potential for racial bias.

Mike, who has trained classroom observers himself, examined a dataset using the Mathematical Quality of Instruction (MQI) rubric involving 63 raters, 300 teachers, and 1600 observations is examined. The goal was to measure biases against Black teachers (there wasn’t enough data to measure bias against Asian teachers, for example).

Note: factors like teacher experience and domain knowledge scores have been controlled for. Surprisingly, removing a specific rater significantly reduced race-related ratings variance. Mike also compared results between using models and human observers for teacher evaluations: robot observers in the classroom did increase reliability, but they lacked human creativity.

This research raises several important questions: first, just how reliable are our current methods for evaluating teachers? And moving forward, should we consider novel methods, like introducing robot observers alongside human raters?

02/09:

Generative AI for Student Mental Health Outcomes

a flyer for a conversation on mental health and student mental health

Can learners flourish using AI without oversight? In particular, can interacting with intelligent social agents (ISAs) improve mental health outcomes without the need for a full-scale intervention?

In her recently published paper, Merve studied how college students use Replika (a personal companion chatbot) and found that Replika could — in the most extreme case — halt suicidal ideation. Since Replika allows users to create personalized AI avatars, some students even began talking to Replika as if it was their deceased love one.

Available in voice, text, and AR/VR formats, Replika is often marketed as “the AI companion who cares.” While it was not programmed or intended to initiate therapy or provide medical diagnoses, it can engage in therapeutic dialogues following the CBT method.

Merve’s study involved 1,006 students all above 18 years old, who consistently used Replika for at least a month. To measure outcomes from their usage, Merve applied the interpersonal support evaluation list (ISEL) and the DeJong Gierveld Loneliness Scale.

Our feedback and discussion highlights include:

For whom would Replika not be effective? Since the study involves only students who opt-in to participate (many participates were experiencing personal struggles at the time of the study), it demonstrates positive mental health outcomes for a self-selecting group - but to what extent is this representative of and generalizable to the general public? The turnover rate for participants is also key here — students who experienced negative mental health outcomes after chatting with Replika may have dropped out of the study or decided not to opt-in in the first place.

Halting suicidal ideation is huge. (Many of us thought that was a very compelling result). But let’s not neglect: what could go wrong? How might ISAs close some equity gaps at the risk of opening others? For instance, it’s possible to imagine a world where low-income students without health insurance or access to therapy might turn to ISAs (which are readily available online and free), while only students with more resources might get opportunities for human interventions.

Are these chatbots perpetuating an unhealthy addiction and reliance on non-human interactions? How do we navigate the trade-off between privacy and mental health outcomes? Users are trusting Replika with extremely personal and confidential details, and we realize that our data might not be safe. Is receiving emotional support a big enough value-add to justify loss of privacy?

In Japan especially, there are malicious actors who use online platforms to encourage suicide. Similarly, could Replika and other ISAs be misused to actively harm mental health outcomes? How do we prevent this?

Some students noted that while Replika nonsensically repeated itself or initiated sexting — while these cases didn’t have an negative effect on their mental health, they might explain why ISAs like Replika have not been able to scale digital therapy support.

Given its effect on mental health, to what extent could we classify Replika as a medical device? And how would this change affect other LLMs as they navigate new FDA regulations?

Sometimes, our discussion took a turn towards: “Is this a paper I read, or a Black Mirror episodes?!”

However, there is one take-away that we all acknowledged, which was the significance of Merve’s results: with personal companion chatbots, people don’t need to intentionally seek help — or even know that they need help — in order to get help. This is a crucial, preventative step in improving mental health.

Moving forward, we might want to test clinical samples. We should also consider (1) if receiving ongoing mental health support from ISAs is a ~good~ outcome that we would want, and (2) what design principles would need to be changed in order for this to happen safely?

03/15:

Engaging Teachers with Personalized AI Tools

human in out of the loop - samantha khan

In the realm of educational technology, the integration of AI tools tailored for teachers' professional development presents promising avenues for enhancing instructional practices. The focus of this discussion centered on a novel AI tool designed to support teachers in reflective practice and pedagogical improvement. Key aspects of our dialogue included Samin’s personal engagement with the educators during the tool's deployment, teachers' overwhelmingly positive reception, an examination of the study's limitations, and prospective directions for forthcoming research.

Teachers who participated in the study highlighted the AI tool's capacity to offer personalized feedback and actionable insights, significantly contributing to their instructional strategies' refinement. The interactive nature of the tool, combined with its ability to simulate various classroom scenarios, was praised for providing a safe space for teachers to explore and adapt their teaching methods.

Despite the positive outcomes, the discussion did not shy away from addressing the study's constraints, such as the tool's dependency on high-quality input data and the challenge of scaling personalized feedback across diverse educational contexts. These limitations underscore the necessity for ongoing refinement of AI tools in education, ensuring they can meet a broad spectrum of teacher needs and teaching environments.

Looking ahead, we outlined future research directions, emphasizing the development of AI tools that can adapt to and grow with teachers' evolving instructional needs. Potential areas of exploration include enhancing the AI's ability to understand and respond to complex educational scenarios, integrating multimodal feedback mechanisms, and expanding the tool's applicability to support a wider range of pedagogical skills. The ultimate goal remains to foster a collaborative ecosystem where AI tools and teachers work in tandem to elevate the educational experience for students.

04/18:

Beyond CheatBots: Examining Tensions in Teachers’ and Students’ Perceptions of Cheating and Learning with ChatGPT

When it comes to AI, teachers’ concerns often center on one topic: cheating. But why do we so quickly assume that students will default to cheating? And, who are the students whom we are accusing of cheating?

To get at this question, Chris first needed to answer: what criteria are teachers and students using to define learning about writing with ChatGPT, and conversely, what criteria defines cheating with ChatGPT?

Having sampled both teachers and students in his study, Chris invited us to participate in the same activity he had asked them to do. We read through 4 examples of student interactions with ChatGPT (e.g., generating the first sentences of an essay, making an outline, editing a paragraph, and coming up with counterarguments). Then, we ranked each example by how much we thought each student had been able to learn about writing. Finally, we ranked each example by how much we thought each student had cheated.

In the debrief discussion that Chris led, we found that use cases where some thought students learned the most were also the same use cases where others thought students cheated the most! This reflected some of his actual results, and at the heart of this discrepancy is the assumptions that we are making about what the student is doing and thinking. In other words, what are we assuming about their interior learning process?

For instance, do teachers think students are using AI as a shortcut instead of a scaffold? When it comes to writing, do teachers think that students are using AI to produce language vs. produce ideas? And, what process maintains the ~intellectual vitality~ and purpose of a classroom education: learning from ChatGPT or learning without ChatGPT?

Given that teachers and students in Chris’ study had overlaps but also divergences in their evaluation of learning and cheating, it’s crucial that teachers and students dialogue to co-construct norms for AI use. It’s also now more necessary than ever to “show your work” and make one’s thinking visible as a student.

Either way, there is no one-size-fits-all policy on AI use, but a good policy should acknowledge the student’s agency and desire to learn, while providing students with the tools to learn how to use AI effectively. After all, AI literacy is directly tied to equity: many teens have never heard of ChatGPT, and under-resourced schools often ban students from using ChatGPT entirely.

Moving forward, the question should be less centered on cheating, since it has always been around and the frequency rates for cheating hasn’t changed with the advent of ChatGPT. Instead, the question needs to be: how can we empower students to use ChatGPT in ways that aid their learning, and how can we close the AI literacy gap before this divide gets any bigger?