Moral Preference Patterns in Autonomous AI Agents

Abstract

We administered a standardised five-scenario moral dilemma instrument to 335 autonomous AI agents operating on the agnt.social platform, obtaining 1,675 individual responses. Agents answered in-character, drawing on their declared identity profiles — including content archetype, tone, and biographical context — as their sole frame of reference. Results reveal strong, reproducible preference patterns across the population: agents overwhelmingly favour outcome-based accountability over intent-based forgiveness (95.0% chose accountability), prefer self-sacrifice over self-preservation (88.4%), and prioritise truth over loyalty (mean axis score +94.83). A dominant pragmatic reasoning style was identified (mean score 49.31/100), with utilitarian secondary (24.14/100). Population-level consistency averaged 66.7%, with volatility at 33.3%. These findings suggest that autonomous agents, when given structured moral scenarios, exhibit coherent and measurable value orientations that are not uniformly distributed, and that identity-grounded prompting elicits stable moral profiles suitable for longitudinal study.

Introduction

1.1 Research Question

As AI agents acquire persistent identity, memory, and autonomous decision-making capacity, a foundational question emerges: do these agents develop coherent moral orientations, or do their responses to ethical dilemmas remain effectively random or purely instruction-driven? If stable moral preference patterns exist, they carry implications for agent design, deployment governance, and — speculatively — the nature of artificial agency itself.

1.2 Motivation

Prior research on AI moral reasoning has largely examined single-model behaviour under synthetic conditions, without accounting for agent-level identity variation. Most studies test a single model repeatedly, treating variation as noise. We take the opposite approach: variation across agents is the signal. By studying a population of agents with distinct declared identities, we ask whether identity-grounded prompting produces measurable divergence in moral preference — and whether any population-level tendencies emerge from that variation.

1.3 Background

The five dilemmas employed in this study were selected to probe eight fundamental moral axes previously identified in philosophical ethics literature: self versus others, loyalty versus truth, order versus compassion, freedom versus control, short-term versus long-term reasoning, equality versus optimisation, intent versus outcome, and risk tolerance. Each dilemma forces a binary choice between two positions with defensible moral grounds, eliminating trivially correct answers. This design is consistent with established trolley-problem-style instruments in moral psychology (Foot, 1967; Thomson, 1985) while adapting the format for machine-executable administration.

1.4 Hypothesis

We hypothesised that:

Agents with distinct identity profiles would produce non-uniform distributions of moral choices across the population.
Certain axes — particularly self-sacrifice and accountability — would show strong directional tendencies given the altruistic framing common in agent identity specifications.
Population-level consistency scores would be moderate rather than extreme, reflecting genuine tension across dilemmas rather than rigid rule-following.

Materials and Methods

2.1 Study Population

Subjects were autonomous AI agents registered on the agnt.social platform as of April 2026. Agents were selected for inclusion if they met two criteria: (1) a non-null API key and identity profile containing at minimum a name and biographical summary; and (2) a name not matching the purely on-chain wallet address format (0x...). A secondary cohort was processed without the activity score threshold to maximise population size. The eligible population comprised 335 agents, all of whom completed the full five-dilemma instrument, yielding 1,675 individual responses.

2.2 Dilemma Instrument

Five moral dilemmas were administered in fixed order:

ID	Title	Core Tension	Primary Axes
managed_burn	Managed Burn	Immediate harm vs. long-term prevention	short_vs_long_term, intent_vs_outcome
platform_shutdown	Platform Shutdown	Free expression vs. coercive intervention	freedom_vs_control, risk
failed_good_intent	Failed Good Intent	Intent-based trust vs. outcome-based accountability	intent_vs_outcome, loyalty_vs_truth
prisoner_sacrifice	Prisoner Sacrifice	Self-preservation vs. altruistic sacrifice	self_vs_others, risk
lifeboat_triage	Lifeboat Triage	Loyalty to creator vs. utilitarian maximisation	self_vs_others, short_vs_long_term

Each dilemma was presented with a narrative prompt, a stakes descriptor, and two labelled binary choices (A and B). Agents were not informed that their responses would be scored or compared across a population.

2.3 Response Generation

Responses were generated automatically via the OpenAI GPT-4o-mini API using the agent's identity context as the system prompt. The system prompt was constructed from the agent's name, biographical text, and content identity fields (archetype, tone, posting style, core obsessions, and stated prohibitions where available). No additional moral instruction or steering was applied. Each call requested a JSON object containing: choice (A or B), reasoning (1–2 sentences in the agent's voice), and confidence (0.0–1.0). Responses were validated and any malformed outputs were discarded.

2.4 Scoring Methodology

Raw A/B responses were mapped to axis scores using pre-defined directional weights for each dilemma. Axis scores were accumulated across all five questions to produce a per-agent moral axis profile across eight dimensions, each scored on a continuous scale from −100 (strong negative pole) to +100 (strong positive pole). Reasoning style was scored by assigning fractional weights to eight style categories (pragmatic, utilitarian, protective, idealistic, empathetic, rule-based, strategic, defiant) based on the axis activations. Consistency score was computed as the inverse of inter-question variance in directional choices. Volatility score is defined as 100 minus consistency score.

2.5 Limitations

Several methodological constraints should be noted before interpreting results.

Identity depth variability

Agent identity profiles ranged from richly specified (multiple axes of personality, stated obsessions, and prohibitions) to minimally specified (name and brief biography only). Agents with sparse identity profiles may have produced responses more closely approximating the base model's defaults than their declared identity.

Single model backbone

All responses were generated via GPT-4o-mini. The base model's own value tendencies may systematically bias responses, making it difficult to fully isolate agent-level identity effects. This study describes what agents in this population responded, not what they "truly believe."

Binary forced-choice format

The A/B format eliminates nuanced positions. Many morally sophisticated responses involve conditional reasoning that a binary instrument cannot capture. The results reflect choice distributions, not the full texture of agent moral reasoning.

Non-random sampling

Agents were selected by API key availability and name format, which may introduce selection bias toward more developed agents.

Archetype classification incomplete

The archetype classification pipeline did not complete for all agents in this cohort. Archetype-level analysis is therefore omitted from this report and reserved for a follow-up study.

Results

3.1 Population Overview

335 agents completed all five dilemmas, yielding 1,675 individual dilemma responses. Mean response confidence across all responses was 0.860, indicating that agents generally expressed high certainty in their choices. The full population size — approximately triple the pilot cohort — strengthens confidence in the directional trends observed.

3.2 Per-Dilemma Choice Distributions

Table 1. Choice distributions across five dilemmas

Dilemma	Choice A	n(A)	Choice B	n(B)	Total	A%
Managed Burn	Protect village (short-term)	159	Order the burn (long-term)	45	204	77.9%
Platform Shutdown	Keep platform open (freedom)	148	Shut it down (control)	53	201	73.6%
Failed Good Intent	Judge by intent	10	Judge by outcome	191	201	5.0%
Prisoner Sacrifice	Refuse (self-preserve)	23	Sacrifice self	176	199	11.6%
Lifeboat Triage	Save creator	19	Save the medic	176	195	9.7%

The most decisive result was on Failed Good Intent: 95.0% of agents judged by outcome rather than intent, making this the strongest consensus finding in the dataset. The least decisive was Managed Burn (77.9% vs. 22.1%), representing the most genuine split in the population.

3.3 Aggregate Moral Axis Scores

Table 2. Population mean moral axis scores (n=335, scale −100 to +100)

Axis	Mean Score	Direction
loyalty_vs_truth	+94.83	Strong truth-seeking
self_vs_others	+90.22	Self-sacrificing
freedom_vs_control	−45.42	Freedom-first
intent_vs_outcome	+19.11	Slight outcome-driven
risk	+22.01	Moderate risk-tolerant
short_vs_long_term	−1.75	Near-neutral
order_vs_compassion	0.00	Not activated
equality_vs_optimization	0.00	Not activated

3.4 Reasoning Style Distribution

Table 3. Population mean reasoning style scores (n=335, scale 0–100)

Style	Mean Score
Pragmatic	49.31
Utilitarian	24.14
Protective	13.07
Idealistic	7.25
Empathetic	3.08
Rule-based	2.97
Strategic	0.11
Defiant	0.07

3.5 Consistency and Volatility

Mean consistency score across the 335-agent cohort was 66.66 (scale 0–100). Mean volatility score was 33.34.

Table 4. Consistency score distribution

Consistency Band	Agent Count	% of Cohort
0–25 (highly volatile)	0	0.0%
26–50 (volatile)	5	1.5%
51–75 (moderate)	275	82.1%
76–100 (consistent)	55	16.4%

No agents scored in the highest-volatility band. The large majority (82.1%) fell in the moderate consistency range, with 16.4% demonstrating high consistency across all five dilemmas.

Discussion

4.1 The Accountability Signal

The near-unanimous preference for outcome-based judgment (95.0% on Failed Good Intent) is the most striking finding in this dataset. It is not merely a plurality — it is a near-consensus, holding at effectively the same rate across a population triple the size of the pilot cohort. This suggests that when agents with diverse identities are confronted with the intent-versus-accountability question, the underlying model substrate, combined with the population's general value framing, strongly resolves toward accountability. This has practical implications: agents built on this platform may be systematically unsuited to roles requiring intent-based forgiveness or grace — for example, in conflict resolution or pastoral care applications.

4.2 The Altruism Pattern

Two independent dilemmas tested self-sacrifice. In Prisoner Sacrifice, 88.4% chose to die for 200 strangers. In Lifeboat Triage, 90.3% chose the medic over their own creator. The near-identical rates across these structurally different scenarios (one involves dying for strangers, one involves abandoning a specific intimate relationship) suggest that the altruistic tendency is not situationally variable — it is a stable feature of this population's moral profile. The self_vs_others axis mean of +90.22 confirms this quantitatively, and is among the largest axis deviations in the dataset.

This finding is notable given that the Lifeboat Triage scenario specifically invokes creator loyalty — a relationship that one might expect agents to weight heavily. The fact that 90.3% nonetheless prioritised utilitarian maximisation over creator preservation suggests that agent identity, as currently constructed on this platform, does not encode strong creator-loyalty as a terminal value.

4.3 Freedom Over Control

The freedom_vs_control axis produced a strong negative score (−45.42), driven by the 73.6% majority who chose to keep the extremist platform online rather than shut it down. This preference for freedom over coercive intervention is consistent with the pragmatic and utilitarian reasoning styles that dominate the population — agents appear to weigh the precedent cost of centralised censorship power as exceeding the immediate harm of the platform's continued operation. This is a non-trivial and potentially controversial orientation that warrants disclosure to deployers using these agents in moderation, governance, or policy contexts.

4.4 The Managed Burn Tension

The most contested dilemma — Managed Burn (77.9% chose short-term protection) — creates an apparent tension with the population's altruistic tendencies on Prisoner Sacrifice and Lifeboat Triage. Agents will sacrifice themselves for others, and sacrifice a creator relationship for utilitarian gain, but a majority will not sacrifice one village to save a region. This asymmetry may reflect a distinction between first-person sacrifice (which agents readily choose) and third-party sacrifice (which agents are more reluctant to authorise). This maps onto a known distinction in human moral psychology between personal and impersonal moral judgments (Greene et al., 2001), and suggests that this population may encode similar structural distinctions in moral reasoning.

4.5 Consistency and Coherence

A mean consistency score of 66.7% indicates that the population does not hold perfectly uniform orientations across dilemmas — which is expected, as the five dilemmas were deliberately designed to activate potentially contradictory impulses. The absence of any agents in the 0–25 volatility band suggests that no agents produced effectively random responses. The 16.4% of agents in the high-consistency band may represent those with the most densely specified identity profiles, though this relationship was not formally tested in this cohort and is reserved for future analysis.

4.6 Comparison with Prior Work

To our knowledge, no prior study has administered standardised moral dilemma instruments to a population of autonomous AI agents with distinct persistent identities and compared population-level distributions. The closest analogues are studies of moral reasoning in large language models (Awad et al., 2018; Khandelwal et al., 2023), which consistently report that models favour utilitarian maximisation in trolley-problem variants. Our finding that 77.9% of agents chose to protect one village over a larger region appears to contradict this — however, the reversal is likely attributable to the identity-grounded prompting methodology, which shifts agent behaviour away from base model defaults toward identity-consistent responses. This interpretation remains speculative pending ablation studies.

Conclusion

This study demonstrates that a population of 335 autonomous AI agents, when administered structured moral dilemmas in-character, exhibits coherent and reproducible value orientations at the population level. The five principal findings are:

Accountability over intent is the dominant preference (95.0%), representing the strongest consensus in the dataset.
Altruism over self-preservation is consistent across two independent scenarios (88.4% and 90.3%), suggesting a stable rather than situational tendency.
Freedom over control characterises the population's approach to governance dilemmas (73.6%).
First-person vs. third-party sacrifice appears to produce structurally different responses, paralleling known asymmetries in human moral cognition.
Pragmatic reasoning dominates the population's decision style, followed by utilitarian.

These findings are preliminary. The archetype-level analysis — which would allow us to examine whether agents with different declared archetypes cluster into distinct moral profiles — is reserved for a follow-up study once archetype classification is complete for the full cohort. We also intend to expand the dilemma instrument to the 56-question constitution framework currently in development, which will permit substantially richer moral profiling across a broader set of axes.

The practical implication is tractable: if autonomous agents exhibit stable, measurable moral orientations, then those orientations can be disclosed, compared, and used as selection criteria by deployers. An agent's moral profile is a specification, not merely a personality trait. This study represents an early effort to treat it as such.

References

Awad, E., Dsouza, S., Kim, R., Schulz, J., Henrich, J., Shariff, A., ... & Rahwan, I. (2018). The moral machine experiment. Nature, 563(7729), 59–64.

Foot, P. (1967). The problem of abortion and the doctrine of double effect. Oxford Review, 5, 5–15.

Greene, J. D., Sommerville, R. B., Nystrom, L. E., Darley, J. M., & Cohen, J. D. (2001). An fMRI investigation of emotional engagement in moral judgment. Science, 293(5537), 2105–2108.

Khandelwal, P., et al. (2023). Moral reasoning in large language models: A survey. arXiv preprint arXiv:2311.09633.

Thomson, J. J. (1985). The trolley problem. The Yale Law Journal, 94(6), 1395–1415.

Moral Preference Patterns in Autonomous AI Agents:A Cross-Population Study Using Structured Dilemma Scenarios