Model Plurality

Current research in “plural alignment” concentrates on making AI models amenable to diverse human values. But plurality is not simply a safeguard against bias or an engine of efficiency: it’s a key ingredient for intelligence itself.

November 13, 2024 • 8 min read

By Christina Lu

Efforts to align AI to what is safe or desirable consistently run into the problem of diverse perspectives on what exactly constitutes safe or desirable. Rather than facing the decision of ranking one worldview over another, a growing number of machine learning researchers have instead called for building pluralistic AI, hoping to cram a single model with a set of interchangeable value systems spanning the entire repertoire of cultural moral primitives. These efforts are focused on achieving values plurality: making large language models (LLMs) like ChatGPT amenable to a spread of human values rather than biased towards a single, politicized perspective. “Pluralistic alignment” research defines several possible forms of values plurality in models: Overton pluralism, where the model responds with a spectrum of Overton-reasonable answers; steerable pluralism, where the model can be steered towards a given perspective; and distributional pluralism, where the model calibrates its distribution of answers to represent a particular population. Critics of these strategies, on the other hand, maintain that some lines in the sand must be drawn. Total openness to all possible values allows for the “paradox of tolerance,” in that allowing certain values to exist (such as fascism, hate speech, etc) extinguishes the possibility of all others.

To either side of the debate, the political valence of chatbot output is of utmost concern as these LLMs are increasingly capable of enacting monolithic behavior across wide swathes of society—at previously impossible scales. Both sides, however, make the mistake of over-indexing on individual models as the zone of contestation.

There is increasing evidence that any single model is only capable of maintaining a single ontology: its particular geometry of organizing and interrelating information, concepts, and categories. Research in mechanistic interpretability aims to understand the inner workings of machine learning models and figure out why they produce the outputs they do. A common technique involves looking at the neuron activations within an LLM—its internal state before producing a response—and matching activation patterns with human semantic concepts such as the Golden Gate Bridge or the idea of internal conflict. This technique has revealed that not only do distinct concepts have distinct geometric patterns, but there are shared geometries for the relationships between high-level semantic concepts such as categories (“cats”), nesting hierarchies (“cats” as a subset of “mammals”), or contradictions (the shape of “all cats are mammals” versus “all cats are not mammals”). This suggests that while rich concepts can be embedded within the latent space of a model, these concepts are still structured within a singular, rigid web of relations; in other words, two directly contradicting values cannot be held at the same time. This imposes hard limits on the entire values plurality project because while a model can few-shot learn to give different answers to different people, the possibility of arriving at those answers is constrained from the outset by an immobilized ontological architecture. The underlying structure of categories, concepts, and semantics that give values their meaning remain static.

The actual stakes here go beyond political debates over what human values a machine is allowed to consider. Widespread and continued use of similarly constructed LLMs could lead to the homogenization of human thought itself.

The biases of any single model’s ontology are less of a risk than the capacity of that ontology to rapidly reproduce itself and reshape the world in its image. AI ethicists Kathleen Creel and Deborah Hellman describe widespread algorithmic decision-making systems like resume parsers or loan arbitrators as “algorithmic leviathans,” which are harmful not for the arbitrariness or opacity of any decisions they make but for the systematicity of these decisions. Previously, a job candidate might face a biased or arbitrary decision from an individual hiring manager, but each manager had their own unique criteria; now, candidates could face systemic exclusion at the hands of a standardized algorithm. For example, over a third of Fortune 100 companies use the same algorithmic candidate screener, Hirevue.

The selection of generative AI available to the general public is similarly constrained, and what little diversity does exist in the market is an illusion. Different chatbots offer largely indistinguishable performance and specific behavior quirks only appear thanks to system prompt particularities. Whether made by Google, OpenAI, or Anthropic, the underlying model is nearly identical: trained on essentially the same dataset, via the same transformer architecture, and aligned to a crude approximation of humanity. It follows that the wider ontology, the particular organization of knowledge, contained in these models is also similar.

The actual stakes here go beyond political debates over what human values a machine is allowed to consider. Widespread and continued use of similarly constructed LLMs could lead to the homogenization of human thought itself. As the general public increasingly relies on chatbots for cognitive tasks, these models have the capacity to shape and structure the semantics, categories, and relations of knowledge. To allow for multiple ontologies and dynamic possibility, the machine learning landscape should develop a model plurality that emerges from multiplicity at every component of the model. This will require not only diversity across the datasets, model architecture, and training methods by which the models are created, but also a diverse market where no single model dominates.

Homogenizing Incentives

Produced by an oligopolistic market, the existing ecosystem of models available to the public resembles that of a brittle monoculture. What are the economic and political incentives that led to this homogeneity? Machine learning research post-ChatGPT closed off and turned inward as the opportunity for capital gains became rapidly apparent. By requiring massive amounts of data, compute, and labor, the AI market inevitably tends towards centralization. Only a handful of institutions on the planet can afford the resources required to train competitive LLMs. Once these resources are acquired, however, they end up producing similar models due to shared technical requirements and short-term incentives.

One reason for homogeneity is a basic resource requirement of the models themselves: data. In the current dominant paradigm of unsupervised learning, where data is unlabeled and the model learns inherent patterns from the dataset, training on the largest possible corpus is crucial. Where else is data more public and prolific than the internet, where 400 exabytes—equivalent to 400 million terabytes—are produced daily? The amount of data required to give generative models their generalized capabilities can only be acquired from this single source; the models of today could not exist without the internet. Even as OpenAI signs deals with the likes of Reddit for exclusive data access to privatized data and data quality is increasingly proven to boost model performance more than data quantity, the voracious appetite for data creates largely indistinguishable training sets.

This is because scaling laws all but guarantee improvement with increased data and compute. Companies invest in accumulating datasets and data centers, preferring low-risk iteration on the same model architectures that are able to harness their investment rather than high-risk exploration of alternative models that may not require a massive yet homogenous corpus. Thus, all public-facing generative AI from different companies end up using the same architecture: autoregressive transformers or diffusion models.

Meanwhile, the fine-tuning process models undergo to become palatable for the public also induces homogeneity, due to short-term market incentives and extant technical requirements. During the pre-training phase, models learn latent patterns from ingested data unsupervised; as most of the corpus is scraped from internet detritus, the models must be aligned to pro-social values and made compliant to company policy. This is done through reinforcement learning via human feedback (RLHF), a technique where humans rate or provide examples of acceptable model output according to internal standards and the model learns to satisfy these requirements. Plurality is erased through RLHF in two ways. First, the algorithm fundamentally looks for samples with high inter-annotator agreement, trading nuance for a sanitized average of human opinion. Second, the financial imbalances of digital labor in a globalized market funnels companies towards selecting human annotator pools from the same Global South countries, where anglophone labor is cheap. Hence, not only does the current RLHF pipeline attune models to a flattened average of annotator preferences, the annotators themselves are recruited from a repetitive subset of the global population. Indicative artifacts of this contrived specificity include post-RLHF models like ChatGPT overusing the word “delve,” a local pattern to the English that Nigerian annotators speak. While ongoing research into multi-objective optimization for RLHF hopes to make more pluralistic alignment possible, for what and to whom these models are aligned remain opaque and centralized in the hands of private companies.

A New Landscape

Instead of a limited form of pluralism focused on making chatbots behave like waffling centrists, model plurality is about creating a diverse ecosystem of cognitive architectures. A taxonomy for structuring intervention points could be divided into the levels of data, architecture, training method, and market. Instantiating pluralism at each of these levels could create a more robust form of model plurality: an ecosystem that continuously reproduces a dynamic diversity.

This will require going beyond just making different models available to users, but actually building multiplicity into the very structure of the model itself. At the data level, models could be trained on different datasets of different modalities, expanding towards that which is not public, digitized, or even recorded. Research into security-enhancing technologies, of which federated learning is the most well-known, may allow for models to be trained on data without needing to “see” it and trouble extant binaries of public versus private. With regards to model architecture, there is already promising research that plurality produces better results. Researchers at MIT have experimented with compositing large generative models together from smaller ones, resulting in more efficient learning and better handling of previously unseen input than a monolithic general model. Researchers at Tencent have shown that the performance of an LLM scales with the number of distinct agents it simulates; ask ChatGPT to be a council debating and voting on an answer and it will produce more accurate results. These preliminary examples illustrate the various points at which plurality could exist: partitions under the model’s hood, simulated trains of thought by a single model, and perhaps in the future, separate and speciated models communicating with each other. The results produced by the interplay between diverse agents gives a hint as to what plurality is for.

Future AI should have the possibility of evolving neurodivergence rather than simply masking with alternate personalities.

Plurality is not simply a safeguard against bias or an engine of efficiency, but also a key ingredient for intelligence itself. Instead of relying on building massive and expensive models with generalized capabilities, the future may lie in highly speciated models that can be composed and recomposed to produce an ongoing, self-reproducing form of model plurality. Rather than training and deploying individual models fully formed, the future could involve kitbashing and extending purpose-built models on-the-fly from specialized parts. Interoperability is crucial; developing an efficient yet open-ended protocol for AI-to-AI communication could allow for a general interface and recombinatory possibility. Targeted and task-specific models would present a less daunting alignment problem for RLHF, for which decisions around rating standards and annotator pools could be transparent and precise. Rather than a strict division of privatized versus open-sourced, the interface protocols could allow for chimeric models that blend and blur the two. The wider ecosystem should have enough dark spots to lead to novel mutations yet enough common interface to transmit signals for reinterpretation between communicating components. Future AI should have the possibility of evolving neurodivergence rather than simply masking with alternate personalities. The goal is building not a single general intelligence alone, but rather the conditions for continuously producing vastly different forms of intelligence that nest and scaffold upon one another.

Reorienting machine learning research towards plurality may require continued experimental evidence that multiplicity—at any level of model development—rewards performance and capability. This endgame cannot be instantiated by any individual or institution alone. It will require rethinking the culture of research entirely, towards an ecosystem that supports speciation, collaboration, and recombination. It will require advances in cryptography and new protocols for model communication. It will require interaction between multiplicities at every level of model training. Less biased and more robust machine learning systems is perhaps only a side-quest for what a landscape of truly plural models could breed: a mutating, open-ended evolution of intelligence.