Beyond Artificial Sociopaths

To teach AI what our shared values are, we need to first figure them out ourselves. The work of the Meaning Alignment Institute harnesses AI to help us in the process.

Isabel Bortagaray

November 13, 2024 • 5 min read

By Guy Mackinnon-Little

In premodern times, the coordinates of a meaningful life were handed down to us by despots and deities. In our late liberal present, each of us is tasked with mapping them anew—a task we are largely left to face alone. The story by now is well told: once infallible institutions face a crisis of legitimacy, algorithmic media has turned the common ground sometimes called “consensus reality” into a hall of mirrors, and the complexity of planetary society has outpaced the ability of our cognitive wetware to keep up. Many would count the rise of large language models as one more chapter in this story of semantic collapse, but the team at the Meanining Alignment Institute (MAI) think they have the potential to be part of the long-awaited plot twist that moves us towards something new.

Comprised of co-founders Joe Edelman, Ellie Hain, and Oliver Klingefjord, MAI is building out both concrete tools and broad collations of practitioners in service of ensuring “human flourishing post-AGI.” You may have seen Hain’s Adam Curtis-esque video manifesto “EXIT THE VOID,” while Klingefjord previously worked on AI-assisted deliberation at the AI Objectives Institute. Edelman has been in the game a little longer: developing the organizational metrics at Couchsurfing.com in the mid-2000s before co-founding the Center for Humane Technology with Tristan Harris in 2013, where he coined the term “Time Well Spent” for a family of metrics adopted by teams at Facebook, Google, and Apple.

The trio’s most recent experiment is a case study in how we might align AI to shared human values rather than individual preferences or lump-sum averages of humanity. Typically, alignment research focuses on ensuring a given AI abides by “operator intent”—meaning it does what the user wants it to do. This framework presents a clear collective action problem where obedience to a single user’s goals has the possibility to cause societal harm—ultimately incentivizing the creation of what MAI term “artificial sociopaths.” Yet, incorporating broad notions of the good into AI training has ran into the thorny problem of translating plural and often poorly defined notions of what we mean by “the good” into a measurable benchmark. MAI’s prototype aims to tackle both these issues head on, offering a precise way to define both values themselves and their broader relational context.

There are two parts to the case study. First is a method of collecting people’s values or “sources of meaning.” “We use LLMs trained to ask people about their decision-making processes,” Edelman tells me. “These models listen to stories of real-life choices from people’s past and focus on what individuals pay attention to during these decisions. We then use this ’path of attention’ as a way of specifying a person’s values.”

This approach borrows from philosopher Charles Taylor’s notion of values as a means of evaluating our choices and actions, a practical guide to experience rather than an abstract, fuzzy concept used to describe what we did after the fact. Instead of “freedom,” this approach is more likely to yield an underlying value like “Can I encourage the person to make their own choice without imposing any other agenda?”—something that’s immediate, actionable, and context-sensitive.

Secondly, these values are pooled into a data structure called a “moral graph” which provides a basis for reconciling individual values into collective wisdom. Because the values collected are lived and concrete, their relative wisdom can be graded in contextual rather than absolute terms.

In addition to providing values, participants also weigh in on whether a fictional person in a ChatGPT-generated story describing a transition from one value to another has become relatively wiser. Did the new value clarify what was important about the former value? Did it balance it with another important value?

“Our findings contradict the common intuition that people always defend their own values,” says Edelman, “Because we’re used to seeing people argue over values, one might expect that no one would admit someone else’s value is wiser than their own. However, we found exactly the opposite: when people are shown a value that’s wiser than theirs, they will tend to agree that it is indeed wiser.”

This input feeds into the graph structure, revealing patterns of shared meaning across diverse value systems. The resulting network is remarkable: first, it illuminates areas of consensus even among seemingly irreconcilable values, second, it identifies instances where perceived value clashes are in fact context-dependent rather than fundamental. One case, for example, explores how ChatGPT should provide advice on parenting challenges. The consensus favors nurturing a child's curiosity and understanding their interests over discipline alone, but not when the child’s daily environment and routine are completely unstructured. Instead of deferring to individual preferences or blanket notions of the good, the moral graph presents a democratic means of surfacing the “wisest” notions of what matters within a given population.

What remains is to put this promising proof of concept to the test. MAI is currently planning a scaled-up version of the initial study that would form the basis for actually fine-tuning a model on the resulting moral graph. Challenges they’ll face include tweaking the UX to get good participant responses across different cultural contexts (the original study was limited to Americans), and ensuring the process doesn’t inadvertently privilege more articulate participants (a classic problem in direct democracy). Another question is whether the enthusiasm exhibited by study participants—a post-participation survey revealed that participants found the value-elicitation process personally clarifying, trusted the process, and gained respect for those with different values to their own along the way—will port to enthusiasm for the resulting chatbot. An AI that adheres to collective wisdom rather than user-focused sycophancy will challenge users, even if comes with considerable pay-off. The potential reward, however, is an assistant that serves as a conduit to a deeper understanding of ourselves and others rather than a mere productivity aid.

This broader orientation towards collective meaning makes MAI optimistic users will find value in such a tool. If case study participants enjoyed articulating their shared wisdom, perhaps this is an indication of what we’re all longing for right now. “Proving these systems can work is important, but it's not enough for large-scale adoption. That will take a broader political shift. The current political order—liberalism—puts individual freedom in the highest place, and fairness somewhere below it,” says Edelman. “The systems we’re building go beyond freedom and fairness: they try to find collective wisdom; they try to locate what makes lives and communities meaningful. That’s why we’re called the Meaning Alignment Institute. What can join freedom and fairness as an organizing principle for society? We think it’s likely to be meaning.”