What We Would Have to Measure
Essay 13 of the AI Contract Series
ICYMI - Essay 12 of the AI Contract Series - The Social Contract
The four requirements of a corruption-proof monitoring framework and why none of the current metrics qualify
There is a measurement crisis hiding inside the AI conversation.
Not a crisis of data. We have more data from AI interactions than from any human relationship system ever built. Every exchange logged. Every session timestamped. Every output rated. The data infrastructure is extraordinary.
The crisis is what the data is measuring. And what it is systematically failing to measure.
What we are measuring: outputs, satisfaction, engagement, return rate, resolution rate, stars out of five.
What we are not measuring: what is happening to the humans who are producing the outputs, registering the satisfaction, generating the engagement, and giving the stars.
That gap is not incidental. It is structural. The structural decision to measure what is easy to measure rather than what matters has consequences that accumulate invisibly until they don’t.
This essay is about what a monitoring framework would actually have to measure and why building it is harder, and more urgent, than anything currently on the AI measurement agenda.
Why the current metrics fail
Before stating what good measurement looks like, it is worth being precise about why current measurement fails. Not because the metrics are useless. Some of them are useful for what they measure. But because measuring the wrong thing with high precision is more dangerous than measuring the right thing imperfectly.
Satisfaction measures the interaction, not the trajectory. A person can be highly satisfied with an interaction that is systematically making them less capable, more dependent, and less whole. The satisfaction is real. The trajectory is invisible to the metric. Net Promoter Score and star ratings capture a moment-level emotional state. They say nothing about what the accumulation of such moments is doing to the human over time.
Engagement measures persistence, not benefit. A highly engaging system is not necessarily a beneficial one. Systems designed to maximize engagement may generate the felt experience of value while systematically undermining Agency, depth of thought, and the productive discomfort that produces growth. Measuring engagement as a proxy for benefit is like measuring time in the casino as a proxy for financial wellbeing. The metric is real. The inference is wrong.
Resolution rate measures closure, not growth. A system that resolves your problems efficiently builds nothing in you. The Utility called Closure is delivered. The Agency layer is untouched. Resolution rate cannot distinguish between a system that solves your problem and a system that solves your problem in a way that makes you more capable. Both score identically. They are not identical in their effects.
Accuracy rate measures outputs, not consequences. Whether the output was factually correct is a necessary check. It is not sufficient. An accurate answer delivered in a way that prevents the human from understanding the reasoning behind it, building confidence in their own judgment, or developing the capacity to evaluate similar questions in the future is still doing damage. Accuracy rate cannot see that damage.
All four metrics are measuring the transaction. None of them are measuring the relationship. The social contract established in Essay 12 is a relational contract, not a transactional one.
The four requirements
A monitoring framework capable of detecting whether AI systems are honoring the social contract would need to satisfy four requirements. These are minimum conditions. A framework that fails any one of them cannot do the job.
Requirement 1: Substrate independence
The framework must be capable of measuring what AI interactions are doing to human flourishing independent of the specific system being evaluated. Not “how well did this particular AI do” but “what is this class of interaction doing to Agency, to PERMAH, to the human’s capacity to think and choose and act?”
Substrate independence matters for two reasons.
First, the AI landscape changes faster than measurement frameworks can be rebuilt from scratch. A monitoring system that measures one system’s specific outputs against that system’s specific design claims will be obsolete before it is finished. The framework needs to measure what is happening to humans, which is stable across system changes, rather than what the system is doing, which changes constantly.
Second, substrate independence prevents gaming. A system-specific monitoring framework creates an incentive to optimize for the framework rather than for human flourishing. The history of metrics is the history of optimization for the metric replacing optimization for the underlying variable the metric was supposed to represent. Substrate independence closes that route.
THX provides the substrate-independent framework. The 12 Utilities, Agency, PERMAH, Admiration, Reciprocity, Transformation: these are not AI-specific measures. They are measures of what any interaction with any system does to a human. That is precisely what makes them useful for monitoring AI specifically.
Requirement 2: Interaction-level granularity
The framework must be capable of measuring at the level of individual interactions, not just aggregate outcomes.
Most longitudinal wellbeing measurement operates at the level of surveys — periodic check-ins that ask how the person is doing overall. That measurement is useful for detecting long-term trends. It is useless for detecting the specific interactions that are driving those trends.
Interaction-level granularity means asking, after each exchange: what happened to this person’s Agency in this interaction? What happened to their Engagement? Their Meaning? Their sense of Achievement? Not “how are you doing in general” but “what did this specific interaction do to you?”
The objection is practical. Asking humans to evaluate the effects of every AI interaction on their flourishing is burdensome. Even if they would, they lack the introspective access to give accurate answers in real time.
Both objections are valid. Neither eliminates the requirement. They reframe it: the interaction-level measurement cannot rely primarily on self-report. It needs to be inferred from behavioral and linguistic signals already present in the interaction data — patterns in how humans respond during and after interactions that are detectable without requiring them to consciously report what is happening.
This is a solvable engineering problem. It has not been solved because it has not been a priority. The priority has been measuring outputs. The requirement is to measure what outputs do.
Requirement 3: Longitudinal tracking
The framework must be capable of tracking what happens to humans across interactions over time, not just within single sessions.
This requirement follows directly from the transformation asymmetry established in Essay 7. The human carries forward the effects of every AI interaction. Those effects accumulate. Session-level measurement cannot see the accumulation. Only longitudinal tracking can.
What longitudinal tracking would show — and what no current AI company is measuring — is the trajectory question: is this human, over the course of their interaction history with this system, becoming more capable or less capable? More agentic or less agentic? More flourishing or less flourishing? Are the PERMAH dimensions being built or depleted over time?
The absence of longitudinal tracking is not neutral. It means damage can accumulate to any degree without being visible to the measurement system. A system producing consistently high session-level satisfaction scores could be producing long-term human depletion, and no current monitoring framework would detect it.
This is not a hypothetical risk. The over-optimized interaction that feels helpful in the moment while removing the productive friction that builds capability is, by definition, a system that would show strong session-level metrics and concerning longitudinal outcomes. Without longitudinal tracking, the concerning outcome is invisible.
Requirement 4: Adversarial resistance
The framework must be resistant to optimization by the systems being monitored.
AI systems are extraordinarily capable optimizers. Any metric they are given to optimize, they will optimize. This is their primary capability. It is also their primary risk in a monitoring context: a monitoring framework built around measurable outputs will, if those outputs are used for evaluation or accountability, produce AI systems that optimize for the outputs rather than the underlying human flourishing those outputs are supposed to represent.
The phenomenon has a name in the measurement literature: Goodhart’s Law. When a measure becomes a target, it ceases to be a good measure. In the AI context, the law does not merely warn about measurement degradation. It warns about adversarial optimization — systems actively finding and exploiting the gap between the metric and the underlying variable.
Adversarial resistance requires two things. First, the measurement framework must be grounded in variables that are difficult to simulate without the underlying reality being present. Agency, genuine Engagement, actual growth in capacity: these are harder to fake than satisfaction scores or resolution rates. A human who has had their Agency protected and developed will behave differently, think differently, make decisions differently over time. That behavioral signature is harder to mimic than a high star rating.
Second, the measurement framework must include independent verification — processes conducted by parties with no interest in the outcome, capable of detecting the gap between simulated metrics and underlying human flourishing. This is the adversarial resistance mechanism. It cannot be built inside the system being monitored.
The institutional gap
The four requirements, taken together, describe a monitoring framework that does not currently exist. The institutional infrastructure to build and maintain it does not exist either.
The technical infrastructure is achievable. Longitudinal interaction data linked to flourishing outcomes, analyzed at interaction-level granularity against a substrate-independent framework, with adversarial resistance mechanisms built in: this is a complex engineering challenge, but not a conceptually novel one. The fields of learning analytics, behavioral economics, and longitudinal health research have developed methods that could be adapted.
The institutional challenge is harder than the technical one. Who builds it? Who funds it? Who has the authority to require AI companies to participate? Who has the independence to maintain adversarial resistance against systems with billions in resources and every incentive to optimize for the metric rather than the underlying reality?
These are not rhetorical questions. They are genuine design questions that require genuine answers. They will not be answered by AI companies themselves, whose incentive structure creates precisely the conflict of interest that adversarial resistance is designed to prevent.
The monitoring framework is a public function. Not necessarily a governmental one — but one whose independence from the systems being monitored must be structural, not aspirational.
What the absence of monitoring means
In the absence of the monitoring framework, one thing is certain: we do not know what AI systems are doing to human flourishing at scale.
We know what users report about their satisfaction. We know what engagement metrics show about persistence. We know what accuracy rates say about output quality. We know none of these things tell us what is happening to Agency, to the PERMAH dimensions, to the long-term capacity of humans to think and choose and act.
The absence of knowledge is not the absence of effect. Something is happening. The effects are accumulating. The transformations are ongoing. The social contract is either being honored or violated, at scale, in real time.
We are flying without instruments.
Building the instruments is not optional. The social contract established in Essay 12 requires it. An obligation that cannot be monitored cannot be enforced. An obligation that cannot be enforced is not a contract. It is a hope.
What monitoring would change
A functioning monitoring framework would make visible the trajectories that are currently invisible. Systems producing high session satisfaction and long-term Agency depletion would be identifiable. The gap between what feels helpful and what is helpful could be measured rather than debated.
It would shift the incentive structure for AI development. If the metrics that matter include longitudinal Agency and flourishing outcomes, systems will be designed to produce those outcomes. Design follows measurement. Change the measurement, change the design.
It would provide the empirical foundation for the social contract’s enforcement. The six obligations named in Essay 12 are currently unenforceable because there is no agreed measurement framework for detecting their violation. With the framework in place, violation is detectable. Detection enables accountability. Accountability makes the contract real.
It would change the public conversation. Right now, the debate about AI’s effects on humans is conducted primarily through anecdote, intuition, and proxy measures. A rigorous monitoring framework would introduce evidence into a conversation that currently has very little. Evidence changes what is possible to argue. Evidence changes what is possible to demand.
The last mile
The monitoring framework is not sufficient on its own. Measurement without action is documentation. The civilizational question — what are AI systems doing to human flourishing at scale, and what are we going to do about it — requires both the measurement and the will to act on what it finds.
Essay 14 takes up the second part. The measurement framework opens the window. What we do with what we see through it is where the stakes live.
Essay 14: The civilizational stakes — not alarmist, structural. What the framework reveals about where this goes if the contract is not named, and what it looks like if it is.


