This IMO is the pathway to AGI, as it combines all sense-plan-do data into a time coordinated stream and mimics how humans transfer learning to children via demonstration recording and behavior authoring.
If we can create robotics with locomotion and dexterous manipulation, egocentric exploration, and a behavior authoring loop that uses human behavior demonstration and trajectory reinforcement - well, we’ll have the AI we’ve been all talking about.
Probably the most exciting area of research that most people don’t know or care about.
That’s why head mounted all day ego centric AR is so important - it gives eyes ears and sense perception to our learning systems with human directed egocentric behaviors, guiding the whole thing. Just like pushing your kid down the street in the stroller.
Just to make sure I understand your excitement: we need guinea-pigs ahem people to wear 'head mounted all day ego centric AR' with who knows how many integrated sensors for long stretches on end, so we can finally get to our fabled A.G.i?
That is some B.F. Skinner level future we're aiming for--only this time around, humans become the fully surveilled 'teaching machine'.
Well no...not guinea pigs. But correct conceptually - if it's opt-in only and perfectly transparent to everyone what is happening, which in this specific case of Aria it absolutely is.
If we want to make machines with equivalent or better capacity as humans we have to transfer the process for scientific discovery, including the sum of our cognitive capacity and knowledge to them.
If you quantize human adult-infant interactions, then it boils down to Human adults introducing learning trajectories, labeling input data and biasing weights with reinforcing behaviors for new reinforcement agents. If we can re-build the infrastructure to do precisely that, where the agent is in the place of the infant and society is in the place of the "Human Adult" then we will have re-built at scale the process for human development.
The best way we know how to do this today is implementing transfer learning approaches from the basic human developmental research. I started down this road back in 2010 trying to follow the work of Frank Guerin out of the University of Aberdeen [1] [2].
But what about observer effects? People act differently when recorded, and rarely do we catch humans acting natural when knowingly observed (some of the early 24h/day Twitch streamers come to mind). And what happens once trials are done? How would people feel about their actions becoming part of a technology potentially able to replace them?
Even when this barrier can be overcome (i.e. people become accustomed to wearing these devices), I worry about the opt-in nature of it. We've yet to see a disruptive technology adhering to this principle through-and-through, and if current learning efforts are anything to go by, training data is not something companies want to willingly let go or lose out on.
Taken both, this path has the potential to be quite coercive if no strong guarantees or safeties can be upheld, especially if early exciting trials generate an interest-boom similar to the one we're seeing right now in the LM-space.
This is a great point and why, I advocate so vociferously thatall of these systems and future organizations that are going this direction should be cooperatively owned, based on mutual voluntary Democratic principles, rather than owned by a small subset of wealthy individuals in your standard business construct.
That would be a welcome future, indeed. And hopefully, not just upheld in some regions of the world, but everywhere where AR-backed AGI gets off the ground. And this governing structure would need to work for some decades at least. Which would be quite a feat.
That still leaves my first question regarding observer effects and how people would respond to such a technology on an individual level. It would have the capacity to reshape behaviour towards preferential and/or optimal interactions, would it not? Seeing how we do not want reinforce models with 'erroneous' interactions?
TBH I don't know, and I think there's a real chance that there's going to be actual changes in how people behave as a result - which, if it's integrated like many other social changes will become another layer in the fabric of society, displacing another layer. For better or worse I think it's just an exposure thing.
You are persistently surveilled in London and Shanghai and New York City - yet people act just as unhinged in ways they did before cameras were installed.
I'm not sure what other data acquisition/technology arc is possible though, and open to ideas.
> You are persistently surveilled in London and Shanghai and New York City - yet people act just as unhinged in ways they did before cameras were installed.
Unhinged people do, but ordinary people? I'd be willing to bet that normal people who are in areas where they are aware they're on camera don't behave as their normal selves. It's hard to see how it could be otherwise.
Is that model (parents giving labeled input and affecting some weights in the child’s head with reinforcement) really a good fit for the reality of how people learn to do things?
It’s my understanding (though I haven’t looked at the primary sources myself) that one of the facts that inspired Chomsky’s language theories and work for instance, was that when you quantify the information communicated by parents to language learning children, there’s actually not very much of it. Not nearly enough to support that what’s going on is anything like the kind of learning embodied by machine learning models.
If that’s true, and there is something of how to act intelligently / humanly already encoded in children (maybe genetically?) and not communicated by this sort of training, wouldn’t ignoring that and trying to get to it purely in this machine learning way be.. at least not at all informed by evidence / examples of it working in nature?
So this is extremely complicated and nuanced with respect to intelligence acquisition, and I don’t think there’s a definitive right or wrong answer.
I certainly acknowledge my own bias with this however, with respect to what Chomsky discusses, I make the distinction that most of the “code/data/information” that you need in order for the language capacity to develop is actually embedded in our biological mechanical systems. That is to say, if you were to take a human infant and never expose it to another human with respect to generating sounds for language, the infant would still develop some sort of sound based communication system. We see this with feral children, mute children, deaf children. They still have a verbal function, even if it’s not connected to any semblance of coherency.
So in that sense it’s like you’re given all of the building blocks for language out of the gate biologically and then the people who are around you tell you how to assemble them into some thing that is functional. This is why different languages have different rules yet language acquisition is consistent across cultures.
This is why I am insistent on holistically understanding the computing infrastructure and systems because the sensors processors, etc. are the equivalent to our cells, genes muscles, bones, etc. Most people don’t think about computing systems and generally intelligent systems this way.
If you go back and look at the work of wiener and early Cybernetics it does discuss a lot of this, however, after Cybernetics was absorbed into artificial intelligence, which was an absorbed into computer science, it doesn’t really look holistically at systems of systems, unfortunately, in the general case.
And I would argue that all of machine learning currently is very much moving in to the direction that I am describing where is exposure to frequency of correlated data that gives you your effective understanding of the world, and being able to predict the future state. That’s what I mean when I say multi-modal is “sequential and consistent in time” with respect to causal action.
As with most technology, there are plusses and minuses.
If used correctly (if is doing lots of heavy lifting here) this type of system, eye gaze, imu & microphones would provide much much better hearing aids than the current state of the art, at a much cheaper price (go look up the price of hearing aids, its _extortion_ )
Using gate analysis, it would be possible predict when someone is prone to falls, allowing much longer independence for older people.
Assuming that its possible to understand who you are talking to and what they said, you could mitigate and support dementia much more than we can now.
However.
You also have a vast network of headsets with highly accurate always on location, able to see what you are looking at, who you talk to, what you say, and in somecases what you feel about things.
Add in some basic object/facial recognition and you have an authoritarian's wet dream.
now is the time to regulate, but alas, that wont happen.
Applications of embodied AI very interesting. Additionally a lot of hard problems are increasingly being solved in simulation like this. See Wayve's GAIA world model
I’ll “yes and” here…beyond AR/VR a more powerful use case is multi-modal learning (with RL) which is what Meta is probably the leader in IMO.
Example paper here: “ Towards Continual Egocentric Activity Recognition: A Multi-modal Egocentric Activity Dataset for Continual Learning”
https://arxiv.org/abs/2301.10931
This IMO is the pathway to AGI, as it combines all sense-plan-do data into a time coordinated stream and mimics how humans transfer learning to children via demonstration recording and behavior authoring.
If we can create robotics with locomotion and dexterous manipulation, egocentric exploration, and a behavior authoring loop that uses human behavior demonstration and trajectory reinforcement - well, we’ll have the AI we’ve been all talking about.
Probably the most exciting area of research that most people don’t know or care about.
That’s why head mounted all day ego centric AR is so important - it gives eyes ears and sense perception to our learning systems with human directed egocentric behaviors, guiding the whole thing. Just like pushing your kid down the street in the stroller.