Yann LeCun: AI Doesn’t Need Our Supervision
When Yann LeCun gives talks, he's apt to include a slide showing a famous painting of a scene from the French Revolution. Superimposed over a battle scene are these words: "THE REVOLUTION WILL NOT BE SUPERVISED."
LeCun, VP and chief AI scientist of Meta (formerly Facebook), believes that the next AI revolution will come about when AI systems no longer require supervised learning. No longer will they rely on carefully labeled data sets that provide ground truth in order for them to gain an understanding of the world and perform their assigned tasks. AI systems need to be able to learn from the world with minimal help from humans, LeCun says. In an email Q&A with IEEE Spectrum, he talked about how self-supervised learning can create more robust AI systems imbued with common sense.
He'll be exploring this theme tomorrow at a virtual Meta AI event titled Inside the Lab: Building for the Metaverse With AI. That event will feature talks by Mark Zuckerberg, a handful of Meta's AI scientists, and a discussion between LeCun and Yoshua Bengio about the path to human-level AI.
Yann LeCunCourtesy Yann LeCun
You've said that the limitations of supervised learning are sometimes mistakenly seen as intrinsic limitations of deep learning. Which of these limitations can be overcome with self-supervised learning?
Yann LeCun: Supervised learning works well on relatively well-circumscribed domains for which you can collect large amounts of labeled data, and for which the type of inputs seen during deployment are not too different from the ones used during training. It's difficult to collect large amounts of labeled data that are not biased in some way. I'm not necessarily talking about societal bias, but about correlations in the data that the system should not be using. A famous example of that is when you train a system to recognize cows and all the examples are cows on grassy fields. The system will use the grass as a contextual cue for the presence of a cow. But if you now show a cow on a beach, it may have trouble recognizing it as a cow.
Self-supervised learning (SSL) allows us to train a system to learn good representation of the inputs in a task-independent way. Because SSL training uses unlabeled data, we can use very large training sets, and get the system to learn more robust and more complete representations of the inputs. It then takes a small amount of labeled data to get good performance on any supervised task. This greatly reduces the necessary amount of labeled data [endemic to] pure supervised learning, and makes the system more robust, and more able to handle inputs that are different from the labeled training samples. It also sometimes reduces the sensitivity of the system to bias in the data-an improvement about which we'll share more of our insights in research to be made public in the coming weeks.
What's happening now in practical AI systems is that we are moving toward larger architectures that are pretrained with SSL on large amounts of unlabeled data. These can be used for a wide variety of tasks. For example, Meta AI now has language-translation systems that can handle a couple hundred languages. It's a single neural net! We also have multilingual speech-recognition systems. These systems can deal with languages for which we have very little data, let alone annotated data.
Other leading figures have said that the way forward for AI is improving supervised learning with better data labeling. Andrew Ng recently talked to me about data-centric AI, and Nvidia's Rev Lebaredian talked to me about synthetic data that comes with all the labels. Is there division in the field about the path forward?
LeCun: I don't think there is a philosophical division. SSL pretraining is very much standard practice in NLP [natural language processsing]. It has shown excellent performance improvements in speech recognition, and it's starting to become increasingly useful in vision. Yet, there are still many unexplored applications of classical" supervised learning, such that one should certainly use synthetic data with supervised learning whenever possible. That said, Nvidia is actively working on SSL.
Back in the mid-2000s Geoff Hinton, Yoshua Bengio, and I were convinced that the only way we would be able to train very large and very deep neural nets was through self-supervised (or unsupervised) learning.This is when Andrew Ng started being interested in deep learning. His work at the time also focused on methods that we would now call self-supervised.
How could self-supervised learning lead to AI systems with common sense? How far can common sense take us toward human-level intelligence?
LeCun: I think significant progress in AI will come once we figure out how to get machines to learn how the world works like humans and animals do: mostly by watching it, and a bit by acting in it. We understand how the world works because we have learned an internal model of the world that allows us to fill in missing information, predict what's going to happen, and predict the effects of our actions. Our world model enables us to perceive, interpret, reason, plan ahead, and act. How can machines learn world models?
This comes down to two questions: What learning paradigm should we use to train world models? And what architecture should world models use? To the first question, my answer is SSL. An instance of that would be to get a machine to watch a video, stop the video, and get the machine to learn a representation of what's going to happen next in the video. In doing so, the machine may learn enormous amounts of background knowledge about how the world works, perhaps similarly to how baby humans and animals learn in the first weeks and months of life.
To the second question, my answer is a new type of deep macro-architecture that I call Hierarchical Joint Embedding Predictive Architecture (H-JEPA). It would be a bit too long to explain here in detail, but let's just say that instead of predicting future frames of a video clip, a JEPA learns abstract representations of the video clip and the future of the clip so that the latter is easily predictable based on its understanding of the former. This can be made to work using some of the latest developments in non-contrastive SSL methods, particularly a method that my colleagues and I recently proposed called VICReg (Variance, Invariance, Covariance Regularization).
A few weeks ago, you responded to a tweet from OpenAI's Ilya Sutskever in which he speculated that today's large neural networks may be slightly conscious. Your response was a resounding Nope." In your opinion, what would it take to build a neural network that qualifies as conscious? What would that system look like?
LeCun: First of all, consciousness is a very ill-defined concept. Some philosophers, neuroscientists, and cognitive scientists think it's a mere illusion, and I'm pretty close to that opinion.
But I have a speculation about what causes the illusion of consciousness. My hypothesis is that we have a single world model engine" in our prefrontal cortex. That world model is configurable to the situation at hand. We are at the helm of a sailboat; our world model simulates the flow of air and water around our boat. We build a wooden table; our world model imagines the result of cutting pieces of wood and assembling them, etc. There needs to be a module in our brains, that I call the configurator, that sets goals and subgoals for us, configures our world model to simulate the situation at hand, and primes our perceptual system to extract the relevant information and discard the rest. The existence of an overseeing configurator might be what gives us the illusion of consciousness. But here is the funny thing: We need this configurator because we only have a single world model engine. If our brains were large enough to contain many world models, we wouldn't need consciousness. So, in that sense, consciousness is an effect of the limitation of our brain!
What role will self-supervised learning play in building the metaverse?
LeCun: There are many specific applications of deep learning for the metaverse, some of which are things like motion tracking for VR goggles and AR glasses, capturing and resynthesizing body motion and facial expressions, etc.
There are large opportunities for new AI-powered creative tools that will allow everyone to create new things in the metaverse, and in the real world too. But there is also an AI-complete" application for the metaverse: virtual AI assistants. We should have virtual AI assistants that can help us in our daily lives, answer any question we have, and help us deal with the deluge of information that bombards us every day. For that, we need our AI systems to possess some understanding of how the world works (physical or virtual), some ability to reason and plan, and some level of common sense. In short, we need to figure out how to build autonomous AI systems that can learn like humans do. This will take time. But Meta is playing a long game here.