The Reward Signal you live by

I first heard of the contextual bandit algorithm a couple of years back as an undergrad. I never gave much thought to it. Recently, I started working on reinforcement learning for the thrill of picking up my HRI research again and shaking off the dusty old project feeling. So, I became enthralled, not for the reason you would think (not to sound unserious), it wasn’t the maths that grabbed me first, but the irony of how our day-to-day lives have started to look like a living demo of what his algorithm quietly assumes. Like in a virtual world, not the sci-fi kind, just like the one we have built with screens, pings, feeds, and tiny choices that would somehow add up to a whole personality.
For those who don’t know what a contextual bandit is, here is the layman’s version: you are in a situation, you have a few options, and the best option depends on the current context, the time, the mood, and perhaps, in this case, the setting. You pick one option, and you only get feedback on the one you picked. So, no reply. Not even a counterfactual like ”what if I had chosen differently? ” At every single turn, you get a chance! Just context, action, reward! However, over time, the algorithm learns what tends to work in what kind of moment. And that’s where it starts to feel familiar, because life is basically that. We make choices with partial feedback, and we learn patterns from whatever results shout the loudest — you get model weights (wink wink, ML folks).
Surprisingly, the part I can’t shake off is that, based on what we know about the bandit algorithms, a bandit will optimise the reward signal it’s given. So, if the signal is shallow, it can become extremely good at shallow outcomes, and it won’t even know it is missing anything. Thus, if my reward vis-à-vis gratification is quick relief, being noticed, staying busy, or avoiding silence, then my habits train toward those things with frightening efficiency. Truth is, I wouldn’t need a villain, just hard, cold repetition. By extension, I have come to the realisation to start to think about reward like something I would rather choose on purpose, not something I inherit by default. Because, just like model inference, some rewards are noisy and immediate, quick and dirty, and some are quiet and slow, the quiet ones tend to be the ones that make a life feel like it’s actually there.
I hope that you make good choices every day. Truly.
SAO



