The following content is a compilation and contemplation of my work during the doctoral application period in 2020.10.
Computer Vision is an essential component of Artificial Intelligence, and it has been proven that Convolutional Neural Networks (CNNs) are the most effective method. I believe that its development has gone through the following stages:
In 1959, Hubel and Wiesel's biological experiments on the visual centers of cat brains demonstrated that vision is based on simple structures.
In 1970, David Marr's book "Vision" revealed that vision is hierarchical.
Kunihiko Fukushima's paper marked the application of neural networks to the study of vision.
The successful classification of digits in the MNIST dataset by Yann Lecun's use of backpropagation and convolutional neural networks at New York University in the 1990s signaled the gradual maturation of the machine vision field.
In the year 2000, the explosion of Artificial Intelligence began in 2012 when Alex Krizhevsky used a model structure similar to Yann Lecun's, called AlexNet, with strong computational power, and based on CNN and backpropagation, significantly improved the accuracy of the ImageNet image classification problem.
From 2012's 8-layer network AlexNet to 2014's approximately 20-layer network GoogleNet and VGG, and then to 2015's 152-layer ResNet, as neural network models became increasingly "deep," their expressive ability and accuracy also increased significantly, giving rise to the term Deep Learning.
In practice, the expressive ability of neural network models has been shown to be extremely powerful, and their success in machine vision is sufficient to demonstrate this. Backpropagation learning allows people to have end-to-end training models. Unlike previously carefully selected features, neural network models + backpropagation give us end-to-end expressive ability. In other words, the quality of models in the past was largely dependent on the quality of feature selection (e.g., Histogram of Gradient, etc.), while now deep learning models can learn and train on massive data to find features themselves. This powerful model expressive ability has injected tremendous power into the Artificial Intelligence and Machine Learning fields. DeepMind's Q-Learning and Deep Learning-based DQN model and Alpha Go are typical examples of this.
Generative Adversarial Networks (GANs), proposed by Ian Goodfellow in 2014, have garnered widespread attention due to their strong learning and sampling abilities for data distributions, as well as their high-quality generation results. The core idea involves optimizing an adversarial loss function to simultaneously train a generator and a discriminator, enabling the generator to efficiently learn the data distribution and perform sampling. The ideal scenario is for the discriminator to be unable to distinguish between real and generated images.
As more researchers delve into the field, GANs have undergone significant development in terms of theory, models, and techniques.
From the perspective of generation results, GANs have evolved from an initial resolution of 28x28 to a 64x64 resolution in DCGAN and a super high, highly realistic resolution of 1024x1024 in PGGAN in 2017, with the quality and size of generated images continually improving.
From the perspective of network structure design, GANs have progressed from multilayer perceptrons to the introduction of convolutional neural networks and then to the incorporation of Self-attention, coarse to fine structures, and other complex models capable of addressing a wider range of problems.
From the perspective of loss function optimization, the introduction of metrics such as Wasserstein Distance has made GAN training more stable.
From the perspective of research directions, GANs have been used for data augmentation in semi-supervised learning problems, conditional GANs for style transfer tasks, image generation, and super-resolution, among others.
At the same time, GAN's training instability, mode collapse issues, and the "black box" problem remain challenges to be continuously improved and solved.
Interestingly, this adversarial learning approach has various manifestations in different fields. For example, actor-critic in reinforcement learning incorporates some adversarial ideas. Some researchers train networks using rewards from reinforcement learning instead of loss functions based on GAN's principles. Additionally, other researchers have two neural networks communicating with each other to study the generation of language. I have an immature but bold hypothesis that research on intelligence agents (entities) represented by multiple neural networks interacting (adversarially or cooperatively) may be a direction for the future development of Artificial Intelligence, which can also help us better study collective intelligence problems.
Reinforcement Learning is a significant category of machine learning problems, distinguished from supervised learning and unsupervised learning by the "delayed" nature of its "supervision." Through trial and error, an intelligent agent learns a value function in an environment by repeatedly making mistakes and learning. The underlying mathematical model for this approach is the Markov Decision Process from Dynamic Control Theory. The theory has seen significant development over the past few decades, including model-based Dynamic Programming, Policy Iteration, and Value Iteration, as well as model-free Monte-Carlo Learning, Temporal Difference Learning, and model-free control methods such as Q-learning and SARSA. Function approximation for value functions and policies has significant value for model extensibility and generalization. These methods are known as Value Function Approximation and Policy Gradient.
Deep Reinforcement Learning, which combines deep learning and reinforcement learning, is a method that has emerged from their development. Its core idea involves using neural network models to model value functions and policies, learning through gradient descent, and searching for the optimal strategy. Deep reinforcement learning has seen substantial development in various areas, such as robot control, game AI, and autonomous driving. For example, training the DQN model to play Atari games and learning Go through Alpha Go, as mentioned earlier.
If we consider the above methods as intelligent agents learning in an environment to find the best strategy, a natural extension is: can multiple intelligent agents learn simultaneously? For example, in StarCraft, instead of having a single intelligent agent controlling all soldiers, can multiple intelligent agents control them, leading to better coordination strategies? This falls under the category of Multi-agent Reinforcement Learning. Further, is there a way to develop a swarm intelligence method, where intelligence emerges from the learning of each microscopically independent, simple intelligent entity?
Multi-agent Reinforcement Learning is a subproblem of reinforcement learning. For simplicity, I roughly divide current multi-agent reinforcement learning into three categories: centralized, decentralized, and hierarchical. Centralized methods involve each intelligent agent sharing the same parameters, while hierarchical methods have a higher-level controller for each lower-level intelligent agent, with the possibility of different parameters but requiring learning data feedback to the higher level. Decentralized methods involve each intelligent agent learning independently without a higher-level controller. Centralized methods are mainstream, with each intelligent agent sharing the same parameters. Hierarchical methods are gradually gaining attention, while decentralized research is still in its early stages.