The co-founder of Koko (non-profit that offers peer mental health support) has a Twitter thread (https://twitter.com/RobertRMorris/status/1611450197707464706) about an experiment where they fed requests for help to GPT-3 and help providers could send those AI-generated support messages rather than their own. They found that AI responses were rated higher but also "once people learned the messages were co-created by a machine, it didn’t work." But there have been some interesting questions about the ethics... 🧵 #gpt3
Rob Morris on Twitter

“We provided mental health support to about 4,000 people — using GPT-3. Here’s what happened 👇”

Twitter
I'm a little confused by this response about informed consent (https://twitter.com/RobertRMorris/status/1611582827224797185) but I think it illustrates a significant problem among some researchers with conflating "research ethics" with "would an IRB allow me to do it" which is potentially really harmful. I would hope that the reason to seek informed consent isn't because a regulatory body forces you to, but because it is the right and ethical thing to do. (2/n)
Rob Morris on Twitter

“@royperlis This would be exempt. The model was used to suggest responses for help providers, who could opt in to use it or not. We didn’t use any PII, all anonymous data, no plan to publish. But MGH's IRB is formidable... Couldn't even use red ink in our study flyers if i recall...”

Twitter
But regardless, based on the thread I think that though the help providers were aware of the AI (since they were choosing to use it), it seems that the people seeking help were not aware. Though based on the "once people learned" finding, at least some of them must have been debriefed? Were they essentially following typical protocol for a deception experiment? (Though if that were the case I would have expected that as an answer re: consent rather than "we didn't have to".) (3/n)
The Twitter thread emphasizes that they weren't using PII, but prompts from people seeking mental health support are still potentially quite sensitive, and some folks on Twitter were concerned about data going back to OpenAI - I assume that GPT-3 can run internally though? In which case I suppose the privacy risks would be the same as when people choose to use the system at all. (4/n)
But I think that even outside of privacy concerns, a lot of people just don't like the idea of such sensitive content potentially being used to train AI without their consent, which is something that we should know from the backlash against Crisis Text Line. (5/n)
In fact, a lot of people are upset about being "experimented on" without their consent regardless of the context. Even though this is sometimes framed as "it's just A/B testing!" when it happens on a platform/product, sensitive contexts (e.g. mental health, emotion) are a special case. (We actually found this when studying reactions to the Facebook emotional contagion study: https://cmci.colorado.edu/~cafi5706/UnexpectedExpectationsNMSPreprint.pdf ) (6/n)
I do think that maybe they found something interesting with this experiment! (Hard to tell from a quick Twitter thread.) But this is such a sensitive context that I would have expected clarifications about ethics and care to be front and center, and seeing a lot of responses from people who are upset definitely illustrates that. (7/7)
@cfiesler what do you think was (might be) interesting?
@danmcquillan I mean I do think whether we like it or not we'll get to a point where AI is strongly involved in just these kinds of things - hopefully with consent - so understanding how that could impact people's experiences could be important. There are also definitely ways of doing that kind of research ethically.
@cfiesler ok, got you. i hope it's not inevitable, and i think it would be a very bad idea indeed.
@cfiesler i suppose it does at least offer a sharp focus for differentiating machinic solutionism and actual care
@cfiesler In my experience is pretty notorious that ppl who do online behavioural research don't have the right approvals in place, and when asked often resort to A/B testing argument. You wouldn't use it when taking to a participant of controlled trial though, so why here? There is a participant, group assignment and intervention: pretty much how WHO defines controlled trials.

@cfiesler The flip side of deceptively using AI for mental health treatment is deceptively using chat logs of mental health treatments to train enterprise customer service AI.

"Suicide hotline shares data with for-profit spinoff, raising ethical questions"

"The Crisis Text Line’s AI-driven chat service has gathered troves of data from its conversations with people suffering life’s toughest situations."

https://www.politico.com/news/2022/01/28/suicide-hotline-silicon-valley-privacy-debates-00002617

Suicide hotline shares data with for-profit spinoff, raising ethical questions

The Crisis Text Line’s AI-driven chat service has gathered troves of data from its conversations with people suffering life’s toughest situations.

POLITICO
@cfiesler wow, that's just awful what they did, and I see they got ripped apart for it. Skipping regulatory channels. That dude is shifting story. Nothing worth learning when it damages like that. So it begins......
@cfiesler let me know if you want an intro to Rob, in case he's able to share more about the protocol
@natematias I would encourage him to share it in the Twitter thread! There are a lot of very upset people.
@cfiesler @natematias honest question, does doing this on twitter ever actually work? i'd personally advice not to do anything on twitter, write a long-form description of the protocol, and post it on their website eventually
@jbigham @natematias Well, he chose to share the findings on Twitter without sharing anything about the ethical considerations. If he wants to address the accusations of unethical research, those accusations are on Twitter, so I think it makes sense to address them on Twitter.
@cfiesler @natematias idk, i think the dynamics of twitter makes that basically impossible. glad i'm not there anymore!
@jbigham @natematias I mean if for some reason it's impossible to address the ethical considerations for research on Twitter then I would suggest not posting about research there at all. 🤷‍♀️ (Which one might well argue lol.)

@cfiesler @natematias specifically, my argument is it's not possible to do this on Twitter **now**… even a great explanation wouldn't spread, nobody's looking to RT it. much more likely is people would be super primed to argue with it even if it's kind of reasonable. and all the while the old stuff keeps spreading without knowledge of the new stuff.

but, yeah, twitter is double-edged like that. great way to get your message out, but watch out!

@cfiesler
Forgive me for a third comment in quick succession :) but I do also think that "it's just A/B testing" is something that has a tricky relationship with consent and trust. Have been watching the recent Duolingo UX debacle and am impressed by the level of anger and confusion expressed by users who seemingly randomly received radically different UIs and views on their learning. Perhaps some companies are too ready to experiment without explicit consent for participation.

@emmatonkin @cfiesler Duolingo has always been an A/B testing hellscape. I don’t think it would be unfair to call it the service’s core principle, to try to discover the most effective way to teach languages by treating the user base as a large test group. It has always been contentious. For one thing, there’s a built in disregard for disability, which didn’t seem to conflict with the values of the organization enough to do more about it than acknowledge that it’s unfortunate. ..Thanks. You too.

I don’t think there’s any malice in this. I think it’s a result of tunnel vision driven by idealism, to make a free learning service that works better than the often very expensive existing ones, using the kinds of hooks a video game might have to help people stay engaged. It’s a noble goal, and I do think it probably has made a positive contribution to the state of the specific category of language learning products to which it belongs.

But their simplistic “data speaks loudest” approach to deciding how best to teach human communication of all things is entirely absurd. They cannot measure learning. They hope that it’s a necessary byproduct of what they do measure, but their absolute focus on numbers makes learning that doesn’t create better numbers undesirable. They actively discourage learners from taking their time with lessons in favor of advancing further at a faster pace. I strongly suspect that they have managed to transform the pop up tips from a helpful option that allows some needed flexibility into a way to ensure that people will continue to advance well past the limits of the vocabulary and grammar they actually know. It’s not that it can’t or doesn’t teach anything at all, but it does profoundly undermine itself.

Ethical questions aside, the anger and frustration has been there the whole time. I think everyone wants it to work and to be good. My guess is that the amount and depth of the dismay is probably not just about this change, or even just the test group thing, so much as it’s about a long history of disregarding user feedback in favor of data while not actually being able to deliver on the promise of more effective language learning. The belief that you can disregard disabled people in order to serve the average person better is just one symptom of a much larger misunderstanding of people and learning, and I think it really shows in how it feels to use Duolingo. It’s not surprising that users would be mad about another disruptive change that doesn’t address the most serious problem at all.

@robotrecall @cfiesler
I think perhaps the core principle of that particular service has changed over time. Many people probs recall the brand's history as a (formerly) community-led attempt at building a) the most effective way to study languages and b) a large dataset through which to do so, but, well, things change, and so does consent. Participating in a community project is very different to purchasing a service that *still treats you as an uninformed test subject*.
@robotrecall @cfiesler
If I am donating £ for participation in a community project i believe in and it also wants me to try some new features to help others, I might - could be fun. If I am paying for an IPO'd commercial service and they then decide to get all GladOS on me w/o consent or withold parts of the training material *for which I pay*, they cannot expect the same tolerance or approval of their new shareholder-pleasing mission.
@robotrecall @cfiesler
Same goes wrt disability and accessibility. Community led project that is still learning and trying to do better? Perhaps it's unfair, but we would probably cut them a lot more slack than a commercial service that fails in these ways, because they are perceived as a work in progress. A commercial service with no community mission gets no such tolerance - we (probably fairly) assume they are motivated by the almighty $.

@cfiesler I find that we’ve normalized “just A/B testing” too quickly.

For example, when the App Store was introduced, you deliberately bought/installed something, in part based on its description. Then you deliberately installed updates, based on their release notes.

This agency has since been taken away: release notes are just an endless repetition of “we update xyz in order to make the app better”, and the actual app may or may not change without your consent or control.

@cfiesler there’s a difference between A/B testing and clinical trials. Toying with peoples mental health issues using AI is both clinically unethical and technologically unethical.

Also, Facebook is low bar.

@cfiesler another slightly less extreme example: it is absolutely infuriating how Duolingo experiments on users. with the recent change in the pathways/unit system - a MAJOR change that affects how you use the app completely - it wasn't just A/B tested by user (bad enough for trying to compare progress with friends), but also by device, so even though both of my devices (one desktop one mobile-android) were on the same account, I had two different interfaces/systems to work with...
@cfiesler
Afaik GPT3 is available only via API unless something significant has changed?
@cfiesler gpt3 can only be run on OpenAI’s servers, which are Azure severs
@Riedl so to clarify: in this case the prompts used would be accessible to OpenAI?
@cfiesler @Riedl and probably all the text generated by the model as well
@cfiesler the prompts pass through OpenAI’s systems running on Azure servers that they control. I don’t know what OpenAI retains or logs, but there is no technical reason they can't see everything going in or out.
@Riedl @cfiesler even if they didn't look at the data, a failure to build a hard firewall is already a problem.

@cfiesler I believe the author later clarified that the people were not directly chatting with the model. It was used more as a tool to help peers craft their responses.

While having a human in the loop does mitigate some of the PII issues, the lack of informed consent is still stands.

@rajatsahay @cfiesler
The question I'd have is: did the model see input from the vulnerable person? Or was it preprocessed/summarised in some way, and if so, how and for what purposes? For example, if the model saw said input, an absolute minimum might be to remove identifiers/pseudonymise, though sharing even pseudonymised personal info with a third party service without explicit informed consent is still a serious issue.

@emmatonkin @cfiesler I think the demo video showed that the operators had the option of directly forwarding the responses to the model. I'm assuming (hoping) the humans acted as filters for personal stuff.

What's worse is that stuff like this just sets precedent for even more outrageous applications of LLMs.

@rajatsahay @cfiesler
Ack, though. A) I wonder what guidance, training, eval they were given because that's quite some task to carry out in a hurry and B) hang on, is the LLM responding with no context other than the last message received, then? More usual to give it context for a (seemingly) relevant answer.

Totally agreed re precedent. Not only does it need careful regulation, but I suspect this is already in breach of existing regs.

@rajatsahay @cfiesler tbh I don't think "i gave the thread a quick scan, decided it contained no personal data and sent it w/o explicit consent to a cloud service that famously retains data" is a fair responsibility to give to J Random Employee, at least not without significant work to ensure they were adequately trained, that risks were handled and that the task was realistic. Setting them up for trouble otherwise.

@emmatonkin @cfiesler your response perfectly highlights a huge problem with AI hype. Most companies cite human moderation to deploy borderline illegal services, claiming their "AI model" gives unreal performance- and are within the letter of the law.

When the model inevitably fails, any blame for the misaligned decisions is put directly on the same moderators- who usually receive little to no training on how to handle these situations.

@rajatsahay Yep, I watched the video that showed exactly how it worked. And there was an option for the peer to just send the AI's response without editing it. 🤷‍♀️
@cfiesler Sadly, too many researchers have no moral compass other than "the law"

@cfiesler
"We didn’t use any PII, all anonymous data"

I would love to see the method they used to ensure that the data are anonymised. Unless, shockingly, it turns out to be "Assert on the basis of total convenience and no analysis whatsoever that the data are anonynised and do as you wish from that point".

@cfiesler I totally agree!! I also think all papers should discuss their consent process. I’ve seen a lot of papers not discussing this and I think it’s important. Great thread!!
@cfiesler Reminds me of work Sharif Mowlobocus did about gay men’s health bots being more effective chat partners than humans, and humans more effective when thought to be bots. Only in reverse.
@cfiesler @cfiesler I’m not at all surprised that once people found out that the “empathetic sounding” responses they were receiving were from AI & not a real person, the responses didn’t work. Because empathy can only come from someone who can feel empathy, which AI cannot do. It’s the right brain to right brain connection of creating a relationship that is the greatest factor in successful therapy. 1/2
@cfiesler Of course it feels empty when one of the participants has an electronic brain and no ❤️. We’ve seen this struggle many times in literature/film. [Insert picture of Data from #StarTrek here] 2/2
@cfiesler Humans are easily fooled. Could AI parents say better things- give more support than human parents? Is it easier to fake humans rather than try to distribute actual human kindness. Can AI only be as good for us as the best of us or are we envisioning building something super human- like a lucky stab at the mind of a god?. Is it more cost effective to maintain big decision making algorithms than pay human leaders to consult, decide take responsibility the outcomes. #WLinAI