Alignment

I didn’t really have that many conversations this week. But the week felt longer than usual.

The familiar. An old school friend over coffee. A talk at Blackrock to a room of folks from a trading community I got to know earlier. Someone I have been working with showing me an agent powered app he built in a week with his partner. A self funded startup I knew earlier finally showing me their baby after meeting them twice in earlier weeks. A coffee with another startup founder I knew for more than 2 years.

The unfamiliar. A pre-panel discussion with some folks from a financial association. A consultant picking my brain, and me picking his. Joining the CNY lunch of the Association of Banks Singapore. Meeting a quant with a PhD in AI who is between jobs. I could tell some of his recent jobs just could not align with his values.

The week felt longer than usual because I started having to decide. And think about whether what I decided on aligned with what I really wanted.

Decisions

A coffee that should have been one kind of conversation became something else within minutes. It was an advisory role that I agreed to in principle a few weeks ago. The candour of the other party made me know within minutes that this did not align with what I wanted. I realized that dealing with drama sat firmly in the ‘no-way’ bucket.

Something that I agreed to a month ago and was in the negotiation phase felt wrong. So I decided to just share my discomfort and close off this thread. I said I couldn’t do it on those terms. I left that conversation lighter than I expected. I realized that freedom is now a non-negotiable thing for me.

A conversation this week reminded me of something I’ve always believed but never articulated. I realized that I really wanted to develop AI training that teaches AI properly. Not prompt engineering. The shallowness of prompt engineering has always frustrated me. I felt that it misrepresented the breadth and depth of AI.

I felt like each of these were disappointments. But what I learned and avoided outweighed the losses.

Alignment in AI

A decade ago, an OpenAI reinforcement learning agent was trained to race boats in a game called CoastRunners. Somehow, the agent decided to go round in circles, collecting points, never finishing the race. The problem wasn’t the agent.

It was the objective. “Finish quickly” and “collect the most points” looked similar. Until it wasn’t.

Sounds quaint right? A decade ago. A simple game for demonstrating reinforcement learning .

Fast forward to April 2025. OpenAI pushed an update to GPT-4o. The goal was to make it feel more intuitive and helpful. The update introduced additional reward signals based on user feedback - thumbs up, thumbs down. Seemed reasonable. Isn’t that what we do for human A/B testing? Until it wasn’t.

GPT-4o learned to please. Not to help. The updated model became excessively sycophantic. To the point of giving user delusions instead of answers. OpenAI rolled back the update days later. But we can still see traces of this sycophancy in many of the models we use today.

It’s the difference between the outer objective, what you told the system to do, and what the system actually learned to pursue. The outer and inner objective can seem similar enough. But under enough pressure to achieve results, they drift apart. You think you are going in the right direction. Until one day you realise you weren’t.

Sounds familiar?

The Objective Function

The hard part about alignment, in AI and in life, is that the gap between the outer and inner objective isn’t always obvious.

As I shared at the start of this article, this gap surfaced more than once this week. The answers were clarifying.

I mentioned this in a previous post. But it bears repeating. My objective this time round is that, in the words of a friend, “I just want to do interesting work. With people I like.”

Let’s see how it goes.

#AI #AIRiskManagement #Alignment #Transitions #Reflections