Reinforcement Learning (RL) has been developed for decades, achieving remarkable advances in chess (Silver et al., 2018), video games (Ye et al., 2020), smart grids (Chen et al., 2022), robotic control (Ibarz et al., 2021), and even the control of tokamak plasmas (Degrave et al., 2022). Among these achievements, researchers have also developed theories and algorithms for Multi-Agent Reinforcement Learning (MARL), an important branch of RL, playing a notable role in recent years in solving decision-making problems of Multi-Agent System (MAS). However, in contrast to RL with a single agent, the difficulty in decision-making for MARL increases significantly due to the growing number of agents. As a result, it is necessary to study how to accelerate the learning process of MARL (Wang et al., 2021).
Knowledge Transfer (KT) is one of the effective approaches to enhance the learning speed of MARL (Ilhan et al., 2021). As a type of Transfer Learning (TL) method, KT focuses on transferring the knowledge of one agent to others in a MAS. Starting from Episode Sharing, which intuitively shares successful trajectories in episodic tasks, KT has been applied in various multi-agent tasks (Tan, 1993). Action-Advising-based Knowledge Transfer (KT-AA) is one of the most popular KT methods. Instead of transferring whole trajectories, agents with KT-AA only share actions in specific circumstances, evidently reducing the load of communication channels (Silva et al., 2020). While KT-AA brings benefits in communication, it also makes it critical to address the when-to-transfer issue. To this end, Chernova and Veloso (2009) proposed a confidence-based trigger condition to determine whether an agent is familiar with a certain state, estimating the quality of decision-making. Once an agent is predicted to make bad or uncertain decisions, it acts as an advisee and transfers its current state observation to the advisor. Then, the advisor will make a decision for the advisee, selecting and sending back a better action to the advisee. Although Chernova and Veloso (2009) did pioneering work and designed the basic action-advising framework, their study still has space for improvement. To reduce communication, Torrey and Taylor (2013) proposed to teach on a budget, in which the advisors send action advice to the advisees with a limited amount of advice. Benefiting from the delicately designed trigger conditions to determine when an advisor should teach, this transfer approach can get satisfactory performance with a significantly lower communication load. Later, Amir et al. (2016) extended the teaching-on-a-budget framework, introducing a kind of hybrid trigger condition by considering both the advisors' monitoring and the advisees' requirements.
Another important branch of KT-AA is to explore its potential in learning-from-scratch scenarios. Most of the above studies assume that there is at least one moderate-level expert in the system, while for the learning-from-scratch scenarios, all the agents learn simultaneously from random policies. In these circumstances, one can not define a fixed advisor before the learning process starts, making it harder to design trigger conditions. Silva et al. (2017) modified and applied the teaching-on-a-budget framework to the learning-from-scratch scenarios. Unlike the above algorithms in which the advising agents are fixed, this study requires all agents to perform as both advisors and advisees. The role of an agent will be determined by two sets of metrics, which take state familiarity and Q values into consideration. Hou et al. (2021) introduced memetic computation into the MARL system and proposed an Evolutionary Transfer Learning (eTL) method. By modeling the learning agents as memetic automatons, eTL provides two metrics to evaluate whether an agent has learned a better policy than the others in the training process, forming a dynamic indicator as the trigger condition to provide action advice. Moreover, eTL also employed stochastic mutation, an important operator in memetic computing, in the KT process, further enhancing the performance.
While KT-AA has primarily shown its ability in learning-from-scratch scenarios, it faces two main challenges. The first challenge arises when the number of agents grows, causing an exponential increase in the computation and communication load, which indicates that it is hard to scale (Wang et al., 2021). The other challenge lies in the design of trigger conditions (Omidshafiei et al., 2019). Without any agent possessing an assured decision-making ability, we should reconsider the qualification standard of action. In previous studies with expert-level agents, the goal of KT was to select actions with better expectations. But for the learning-from-scratch scenes, there is nothing to rely on to predict the potential consequences of an action. Thus, the researchers have focused on the primary goal of KT, which is to boost the learning speed Wang et al. (2024).
Recently, Experience-Sharing-based Knowledge Transfer (KT-ES) has been proposed to tackle these problems. Wang et al. (2022) extended the concept of memetic knowledge transfer, developing a KT-ES algorithm, namely, MeTL-ES. Inspired by implicit meme transmission, MeTL-ES proposed to share specific experiences, i.e., state transition tuples, rather than specific actions, among the learning agents. This mechanism converts the bi-directional information flow in each transfer to a one-way manner, which solves the scalability issue by nature. Moreover, MeTL-ES employs a new idea in designing the trigger condition. Rather than selecting experiences with higher rewards or Q values, it shares experiences with stochastic rules. Specifically, in the early learning stage, agents using MeTL-ES share most randomly, which matches the need for exploration in RL. When the performance is enhanced to a certain level, metrics like Q values get more accurate, and the exploration is roughly enough; MeTL-ES tends to indicate the value of an experience via Q values. Only experiences with higher possible outcomes will get transferred. Using this mechanism, we can observe a rapid rise in performance in the early stage, which benefits from the sufficient exploration brought by MeTL-ES.
However, KT-ES methods such as MeTL-ES lack the focus on the later stage, which may hinder the performance of KT-ES, indicating that KT-AA and KT-ES have complementary features. Out of this consideration, this study proposes a Hybrid Knowledge Transfer method, KT-Hybrid, by binding the advantage of KT-AA and KT-ES, expecting to promote the learning performance of the randomly initialized agents in the whole process. Overall, the contributions of this study are 3-fold:
1. This study discusses the scopes of application of KT-AA and KT-ES and presents a novel two-phase knowledge transfer framework to enhance the learning speed of MARL accordingly;
2. Based on the unique features of the framework, a novel algorithm, KT-Hybrid, is proposed, along with the corresponding trigger conditions to balance exploration and exploitation;
3. Building on the well-received Minefield Navigation Tasks, empirical studies in several typical scenarios are provided in this study, indicating that the proposed KT-Hybrid outperforms popular KT-AA and KT-ES algorithms.
2 PreliminariesThis section introduces some basic concepts and knowledge relevant to this study.
2.1 RL and MDPRL is an effective way to solve decision-making problems that can be modeled as Markov Decision Processes (MDP), and the entity that makes decisions to achieve certain tasks in a given environment is called an agent (Barto et al., 1989; Sutton and Barto, 2018). Typically, an MDP can be described by a 5-tuple, 〈S,A,T,R,γ〉, in which S→ℝS denotes the S-dimensional state space of the environment. A=→ℝK represents the K-dimensional action space of the agent, i.e., all the K actions that an agent can take. T(s,a,s′):S×A×S→[0,1] works as the state transition function, providing the probability that the current state s will transfer to s′ when the agent takes action a. R(s,a):S×A→ℝ is the reward function. The discount factor γ defines how a future reward will be discounted.
Given the definition of an MDP, the experience of an agent can be defined as 〈s, a, r, s′〉, in which s denotes the state of the environment, a represents the action that the agent takes, r shows the reward that the agent can earn from the environment when implying action a, and s′ informs the resultant environment state that corresponds to s and a.
Generally, the goal of an RL agent is to learn a policy π:S×A→[0,1] to conduct sequential decision-making processes in the MDP. Ideally, the policy will gradually converge to an optimal policy, π*, that can maximize the state value of the initial state. The state-value function can be given as follows:
vπ(s)=?π[∑k=0∞γkrt+k+1|st=s], (1)in which st and rt imply the state and reward at time t.
Q-learning is one of the most popular RL algorithms (Watkins and Dayan, 1992), which learns to estimate the Q-value function given as follows:
Qπ(s,a)=?π[∑k=0∞γkrt+k+1|st=s,at=a]. (2)Moreover, the update rule can be written as follows:
Qπ(s,a)=Qπ(s,a)+α(y-), (3)in which y=r+γmaxa′Q(s′,a′) denotes the target. Now, the TD error can be given as follows:
Since Deep Neural Networks (DNN) have been developed greatly in the past decade, Mnih et al. (2013) and Mnih et al. (2015) proposed Deep Q-Network (DQN), an incorporation of DNN and Q-learning. Parameterized by vector θ, the loss function of the neural network in the learning process can be given as follows:
L(θ)=?s,a,r,s′~D[(y--Q(s,a;θ))2], (5)where D is the experience buffer, which is designed to break the correlations of the data. Another part that can do the same is the introduction of the target network. In short, the target network is a copy of the policy network whose parameters are updated intermittently, providing the target value y−. Denoting the parameters of the target network as θ−, the target value can be calculated as follows:
y-=r+γmaxa′Q(s′,a′;θ-). (6) 2.2 MARL and SGWhen the number of entities that need to make decisions extends from one to several, the underlying model will be extended from MDP to Stochastic Games (SG) (Shapley, 1953). An SG can be described by the tuple of 〈n,S,A,T,R,γ〉, in which n is the number of agents and S represents the state of the environment. A:A1,A2,⋯,An denotes the joint action space, where Ai gives the action space of agent i. In this setting, we can define the action of agent i as ai∈Ai and the joint action a∈A as the joint action of all the agents. Then, we can write the state transition function T as T(s,a,s′):S×A×S→[0,1].
Typically, in multi-agent tasks, the agents cannot have access to the global state of the environment, i.e., s∈S. In contrast, they will have different observation functions, denoting Oi for each agent i. Now, the tuple of an SG can be extended to 〈n,S,O,O,A,T,R,γ〉, in which the added O:O1×O2,⋯,On describes the joint observation space of the agents and O:O1×O2⋯ × On gives the observation functions. In this study, we only consider the commonly used homogeneous multi-agent systems in which the agents share the same observation function O.
In this circumstance, conventional single-agent RL is not capable of handling such tasks, and the community has its sights set on Multi-agent Reinforcement Learning (MARL). Tan (1993) proposed to apply single-agent Q-learning to multi-agent task scenarios, forming the Independent Q-Learning (IQL). The learning process of IQL requires each agent to act and collect experience independently. On this basis, the agents will also learn independently according to their own rollouts. It is an intuitive but effective way to solve multi-agent tasks using RL. In the era of deep learning, Tampuu et al. (2017) integrated DNN and advanced techniques such as experience replay to IQL and formalized the Independent DQN (I-DQN), promoting the learning performance to a higher level. By combining the classic learning algorithm and newly developed techniques, I-DQN has inevitable advantages in scalability and flexibility (Foerster et al., 2017). Thus, we adopt I-DQN as the basic learning algorithm in this study.
2.3 Knowledge transfer for independent MARLKnowledge transfer mechanisms aim to leverage insights gained by one agent to accelerate learning or improve performance for another agent. This can be particularly beneficial in MARL since the learning agents are able to share various types of useful information.
For I-DQN-like independent MARL algorithms, there are two popular branches of knowledge transfer, which are KT-AA and KT-ES. In KT-AA, agents share action advice with each other. Specifically, a learning agent with KT-AA will determine dynamically whether to ask for advice during the learning process (Torrey and Taylor, 2013). Once the ask process is triggered, the agent will send the current state, and it encounters to the other agents for help, and the agents receiving the query will provide a potential action. The most distinctive advantage of KT-AA lies in the direct correction of a specific action, but at the same time, it suffers from bad scalability due to its bi-directional information exchange. KT-ES is a recently developed knowledge transfer approach. Compared with KT-AA, which was initially developed for learning systems with experts, KT-ES was proposed for simultaneously learning scenarios (Wang et al., 2022). Given this background, KT-ES focuses more on the long-term benefits by encouraging exploration, the rationality of which has also been validated recently in KT-AA (Wang et al., 2024). To achieve this, agents using KT-ES methods share personal experiences in a one-way manner. Typically, the experiences are defined by the state transitions together with variables used to calculate the sharing trigger conditions. The stochastic sharing without consideration of specific states leads to unfamiliar situations for which exploration may emerge.
To summarize, it is clear that both KT-AA and KT-ES can enhance learning performance, but with different motivations. In the following section, we will introduce a novel knowledge transfer method that combines the strengths of KT-AA and KT-ES to enhance the learning performance of independent MARL systems further.
3 KT-HybridThis section introduces the overall architecture of the proposed KT-Hybrid algorithm, along with the design details answering the questions of what to transfer, when to transfer, and how to use the received knowledge.
3.1 ArchitectureRecalling the goal that we require the knowledge transfer to achieve. Primarily, we need the agents in MARL systems to learn as fast as possible, at which period the agents will form basic-level policies that can roughly obtain a satisfactory performance. Meanwhile, to ensure the learning speed, i.e., the speed to form a basic performance, it is not ideal to put too much extra computational load on the agents at this early learning stage. In addition, to avoid possible obstacles to large-scale MARL, it would be better to adopt knowledge transfer methods with guaranteed scalability. These discussions make the KT-ES a great solution for the MARL problems, especially for the early learning stage.
While KT-ES meets the needs in the early learning stages, it usually has weaker performance in the late learning stage. The main reason for this problem is that the KT-ES methods give much more look to the data efficiency and scalability rather than the decision-making quality of any specific actions. Thus, agents with KT-ES tend to have a roughly qualified policy in a very short time, but the performance may remain fixed or even drop slightly in the late stage. This phenomenon also matches with both the results reported in the study by Wang et al. (2022) and our pilot study.
On the contrary, KT-AA focuses on every single decision-making performance of an agent at the cost of scalability. In specific, KT-AA needs bidirectional interactive communication rather than unidirectional broadcast-like communication in KT-ES. However, this brings about the unique advantage of KT-AA and KT-AA can help the agents in specific decision-making steps. By incorporating KT-AA, every possible action of each agent has some possibility of getting double-checked by other agents, helping to prevent bad decisions in specific states. This feature of KT-AA indicates that although it may be inefficient in the early learning stage, it has the potential to further enhance the policies that are roughly trained. In summary, KT-ES and KT-AA exhibit distinct advantages across different learning stages, suggesting the potential for improved learning performance through their combination.
Building on the above discussions, the goal of this study is to design hybridization of KT-ES and KT-AA, KT-Hybrid, trying to make full use of their complementary features. Overall, the KT-Hybrid follows a two-phase structure, as shown in Figure 1.
Figure 1. Overall architecture of KT-Hybrid. Robots in different backgrounds represent different learning agents, purple waves show the shared experience in the Igniting Phase, black arrows depict the self-learning process of the agents, and the blue and orange arrows are the transferred state and action in the Boosting Phase, respectively. In the central subfigure, the green background shows the early learning stage, and the orange shows the later learning stage. These two stages are divided with a blurred line, indicating that the shift timing should be selected by balancing the exploration and exploitation of these two stages.
In the MARL process, the first phase is called the Igniting Phase, which aims to rapidly learn a moderate-level policy in the early stage. Thus, we design experience-sharing-based knowledge transfer mechanisms in the Igniting Phase, leveraging the benefits of KT-ES in terms of fast learning speed and high data efficiency to provide the users with workable policies as soon as possible.
While for the learning process in the late stage, the Boosting Phase takes over to further boost the policy to a higher level with more communication effort. At this stage, the requirement of the user is converted from getting workable policies to tuning the policies for better performance. Therefore, we need to follow the principles of the action-advising-based knowledge transfer approaches in the Boosting Phase, expecting to obtain better performance even with a higher load.
Since the primary goal of knowledge transfer is to promote learning speed, it is unacceptable for the KT-Hybrid if the transfer scheme is of high complexity order, which may lead to significant computational costs.
Now, it is obvious that due to the complex mechanism, the performance of KT-Hybrid is promising, yet the design will be challenging. Given this framework, we will detail the format of knowledge in the two phases in Section 3.2, introduce a novel trigger condition to balance the two phases in Section 3.3, and design the learning scheme accordingly in Section 3.4.
3.2 What to transferThis subsection will introduce what type of knowledge is transferred in KT-Hybrid. To meet the two-phase architecture of KT-Hybrid, we need to design the form of knowledge separately.
For the Igniting Phase, agents are required to share implicit knowledge that works as ingredients in the learning process, i.e., experiences (Wang et al., 2022), rather than explicit actions, for fast promotion of the policies. Formally, for any agent i at time t, this agent will observe the state of the environment and obtain its own observation at this time oti. Then, agent i makes a decision to take action ati according to oti and its current policy πti. Then, the state of the environment s will transit to a new state s′, and agent i will get an updated state observation, denoting as ot+1i. Meanwhile, agent i will also get the reward signal rti from the environment. Now, after a complete state (observation) transition, the experience of agent i at time t can be formulated as follows:
Eti=〈oti,ati,rti,ot+1i〉. (7)Then, agent i will assess the quality of Eti; once Eti is suitable for sharing, it will be broadcast as a knowledge package Kti by agent i to the other agents for further learning. Formally, the knowledge at time t in the Igniting Phase, i.e., KtI, can be expressed ad follows:
In the knowledge shown in Equation 8, pIti denotes the probability for agent i to share the current experience in the Igniting Phase. Detailed description and definition of pIti will be provided in Section 3.3. In addition, the way how KT-Hybrid uses the shared knowledge KtI for learning will be described in Section 3.4.
The benefits of defining the state (observation) transition as the carrier of knowledge are threefold. First, sharing transitions that an agent has just experienced does not need any extra computation or memory storage, which means that the agents can learn faster with no extra loads. Second, since the broadcasting of experiences is a one-way communication, when the number of agents grows, the overall communication load will only increase linearly. This results in a good scalability. Finally, the state transition is commonly used as experience in RL, which makes KT-Hybrid a general knowledge transfer method for a wide range of MARL algorithms.
As learning proceeds, KT-Hybrid will turn to the Boosting Phase. At this phase, the agents will have moderate-level policies to handle the task in the environment. Thus, it is time for the agents to transfer explicit knowledge, i.e., the actions, to each other. For any agent i, to get action advice for observation oti at time t, it will query the other agents with the observation oti. Once receiving the query, other agents will provide actions according to their own policies. Take agent j as an example. At time t, it generates actions atij according to πtj and oti, in which atij represents the action generated by agent j to solve the observation of agent i at time t. Then, for agent i, it will receive a collection containing advice from all the others, i.e., a-i=. At time t in the Boosting Phase, defining the probability of triggering the action advising from agent j for agent i as pBtij, all the advice for agent i at time t can be defined as follows:
Moreover, the knowledge at time t in the Boosting Phase, denoted by KtB, can be written as follows:
With the action advice from the other agents, the focal agent can avoid some inappropriate decisions made by the moderate-level policy. At the same time, transferring action-based explicit knowledge can further improve the policy in the boosting phase.
3.3 When to transferSince the primary goal of KT-Hybrid for learning-from-scratch settings is to promote learning performance in the whole process, it is important to assess whether the knowledge should be transferred, i.e., when to transfer.
Mostly, the trigger conditions that enable knowledge transfer in the literature have the goal of transferring knowledge that can result in better task performance. However, we have to distinguish the different purposes between knowledge transfer in the multi-agent learning process and action demonstration in the process of task execution. When in the learning process, the goal of knowledge transfer should be to promote the learning performance rather than to enhance the correctness of some specific actions, which is especially true for the scenarios when all the agents learn simultaneously. Thus, it is necessary to re-consider what kind of knowledge is more beneficial to different learning phases in KT-Hybrid.
The Q value is a widely adopted metric to design trigger conditions (Silva and Costa, 2019). For the Igniting Phase, all of the agents are in the early learning stage. At this stage, conventional metrics such as Q values are not reliable since the networks have not been fully trained. At the same time, according to the explore–exploit balance, RL agents at this stage need to sufficiently explore the environment. With learning proceeds, the Q values calculated by the networks will be more accurate and reliable. Out of this consideration, in the Igniting Phase of KT-Hybrid, the probability of triggering knowledge transfer, i.e., sharing the current experience, for any agent i at time t is defined as pti, which is calculated as follows:
pIti(Qti,Q¯t−i,τ)={0,f(Qti,τ)≤Q¯t−i1,f(Qti,τ)>Q¯t−i (11)In Equation 11, Q̄t-i represents the mean value of the latest Q values received from the other agents, which can be written as follows:
Q̄t-i=1(n-1)∑jQtj,j∈N-1 (12)in which N-1 denotes the agents in the MARL system without agent i. Meanwhile, f(Qti,τ) in Equation 12 defines a scaling function inspired by the Sigmoid function, in which τ represents the learning steps that agent i has experienced. Formally, the scaling function can be calculated as follows:
f(Qti,τ)=Qti1+exp(a-bτ),τ∈ℕ+ (13)where a and b are tuning hyper-parameters. Substituting Equations 12, 13 into Equation 11, we can find that when τ is small, i.e., when agent i has experienced little training, the sharing of experiences will be triggered as much as possible. When the value of τ increases, the agents will tend to share experiences with higher Q values. This matches our expectations for the balance between exploration and exploitation.
However, we should note that due to the trigger condition in the Igniting Phase (Equation 11) does not consider any specific observations, the agents can only get moderate-level policies from the overall perspective. Meanwhile, the agents will have different learning trajectories after the independent learning in the Igniting Phase, indicating they will be proficient in different states. Thus, building on the policies learned in the Igniting Phase, the agents need to further learn from the others' expertise in the Boosting Phase.
Inspired by Hou et al. (2021), we consider two metrics to evaluate the necessity of taking advice. The first one is success counts li, the value of which counts the number of successful episodes that agent i has experienced. The other is the self-significance hi=Qti/max(Q1i,⋯,Qti), which evaluates the significance of an action advice to the advisor. Given the above definitions, we define the advice from agent j, i.e., aij, will only be qualified for agent i to choose when satisfying the following condition:
By the condition shown in Equation 14, only advice provided by agents with better overall performance that may have higher potential returns will be considered.
However, since an agent is allowed to take only one action at a time, we need to further design a merging module to resolve the conflicts among the qualified actions from different peers. Inspired by the multi-objective evolutionary algorithms, a ranking score Rj for each qualified action aij can be given as follows:
in which l^j=lj/max(l1,⋯,ln) is the regulated success counts.
Given Equation 15, we can finally choose the action with the highest Rj as the final advice among the qualified candidates in the Boosting Phase.
As for the extra computational load in calculating the trigger conditions, this part of computation mainly involves three parts, which are the calculation of li, hi, and Rj, respectively. For the calculation of li, it only needs to maintain a counting number of successful tasks. For hi and Rj, they need to find the maximum of a list and do multiplication once. All of these calculations only need simple algebraic operators. Note that although the computations are continually conducted, they have the same computing frequency of the policy network, i.e., for each agent, the trigger-condition-related calculations will only be conducted once after this agent makes a decision using the corresponding neural network. Compared with the computational load brought by the forward propagation, which consists of thousands of algebraic calculations such as multiplication, the extra computational load brought by the trigger conditions can be omitted. In addition, the computation cost of the trigger conditions in our proposed KT-Hybrid is comparable to many previous studies such as AdHocTD and AdHocVisit by Silva et al. (2017), eTL by Hou et al. (2017), and MeTL-ES by Wang et al. (2022). Thus, the extra computational load brought by the trigger conditions of KT-Hybrid is tolerable.
Another critical issue is how to determine the timing to shift the learning phase from the Igniting Phase to the Boosting Phase, which means we need a metric to determine whether the learning process is in the early or later stage. There are several potential principles in designing this shift scheme, including making the agents keep transferring knowledge in the whole learning process, preserving a longer Igniting Phase for less communicational cost, or letting the Boosting Phase intervene as early as possible for more efficient transfer. In this study, we take a straightforward shifting scheme by setting a fixed learning episode threshold E to divide the early and later learning stages. The learning process will be taken as the early learning stage before E episodes have been experienced, in which period the Igniting Phase will be triggered. Then, in the later learning stage, the Boost Phase will be used for the following learning process. The sensitivity of E will be tested later in Section 4. In addition, while the design of shifting schemes holds promise for exploring the potential of KT-Hybrid, delving into this aspect is currently beyond the scope of this study.
Now, we can provide the time complexity of the proposed KT-Hybrid. In deep RL, the main computational load lies in the decision-making (forward propagation) and learning (backpropagation) processes, both of which are related to the size of the network. In the following analysis, all the agents are assumed to use the same neural networks. Since the computational load of calculating the trigger conditions is much smaller than the neural network-related calculations, according to Equations 11–15; here, we omit this part and take the corresponding computational load as a constant Ct.
Assuming there are n agents in the MARL system, each of which conducts one round of both decision-making and training at each time step on average, the total amount of computation of the system can be formulated as follows:
ctotal=β·cI+(1-β)·cB+C
Comments (0)