A transformer network incorporated into a DRL policy should improve dexterous object manipulation with an anthropomorphic robot hand. Furthermore, in the policy training phase of DRL, imitation learning and reward function are used to train the robot’s grasping intelligence through demonstrations. Imitation learning can help in early-stage learning, and the reward function carries out the learning process until its completion. In the design of the reward function, reward shaping was employed by utilizing various categories of information, such as object location, robot hand location, and hand pose.
In this work, we propose a novel methodology of dexterous object manipulation with an anthropomorphic robot hand via a transformer-based DRL with a novel reward shaping-function. The proposed transformer network estimates natural hand poses from object affordance to guide the learning process of DRL. The transformer network estimates high-dimensional information (i.e., 24-DoF hand poses) from low-dimensional information (i.e., 6-DoF object affordance for location and orientation). This reduction provides natural hand poses for the movement of an anthropomorphic robot hand. Then, the DRL algorithm utilizes the estimated natural hand poses in the reward-shaping function and learns dexterous object manipulation.
2. MethodsThe proposed T-DRL intelligent framework for object grasping is composed of two main components illustrated in Figure 1. First, the transformer network estimates an optimal natural hand pose for the given object’s 3D point cloud. Second, the DRL policy learns to control the movements of an anthropomorphic robot hand for object manipulation.We leverage the capabilities of the transformer network to infer the optimal natural hand pose from the object affordance. The object affordance is represented by a 3-DoF position and orientation (roll, yaw, and pitch) of an anthropomorphic robot hand. The optimal grasping hand pose from the transformer network had 21-DoF joint positions, corresponding to a five-fingered anthropomorphic hand. We leverage the DRL to train the anthropomorphic robot hand for object manipulation with a natural hand pose. The DRL is composed of a policy that performs robot actions and a reward function that evaluates and updates the policy parameters. The proposed DRL uses the estimated optimal hand pose to achieve natural grasping and relocation of various objects. The reward function is configured such that the estimated hand pose is set as a goal, and the reward increases as the joints of the anthropomorphic robot hand are aligned with the estimated hand pose. The policy updates the hand poses (i.e., actions) of the anthropomorphic robot hand (i.e., agent) so that the value of the reward function increases for natural grasping. The proposed T-DRL is trained to achieve natural grasping and relocation for six objects. In our work, we selected objects based on the shape property representing most of the 24 objects in the ContactPose DB. Therefore, the proposed T-DRL trained in this work can be generalized for objects with similar shapes.
2.1. Databases for T-DRL TrainingsTo derive the optimal hand poses from the transformer network and incorporate them into the proposed DRL, this work utilizes the databases of a 6-DoF object affordance and 21-DoF hand pose.
The information of the 6-DoF object affordance is derived from GraspNet [34] which computes the point clouds of an object and produces 6-DoF affordance information. The probability of possible grasping success rates is determined according to the object shape and hand pose with a parallel gripper, as shown in Figure 1. GraspNet [34] is trained with 206 objects of five categories (boxes, cylinders, mugs, bottles, and bowls) and trained based on a variational autoencoder, which estimates a predefined number of grasping poses in the Cartesian coordinates. The 6-DoF object affordance is denoted as ϕiM in this study, where i=1,…,M, with M denotes the number of grasps and each ϕi ∊ ℝ6 denotes the position and orientation of an anthropomorphic robot hand attempting to grasp the object.The information of the 21-DoF hand pose, corresponding to the joint angles of a five-fingered anthropomorphic hand, is derived from the ContactPose database [35]. This database consists of grasping demonstrations made by human hands, holding various household objects. The database also includes the point clouds of objects with the corresponding 3D coordinates of the 21 finger joints used in a human hand pose. The grasping data was collected with two intentional grasps of ‘use’ (i.e., using the object after grasping) and ‘handoff’ (i.e., handing over the object after grasping) for fifty different objects. We selected objects representing six geometric shapes (spherical, rod, cube, oval, cylindrical, and curved) from the available 24 objects, including: apple, banana, camera, hammer, bottle, and light bulb. Each object has a total of 25 grasping demonstrations that were reduced to 10 corresponding to the right hand, with their functional intent being ‘use’. Henceforth, the extracted 21-DoF joints that coordinate the hand poses are denoted as (φiN) where i=1,…,N denotes the number of demonstrations and each φi ∊ ℝ21 denotes the joint angles in the hand pose reference system.As input to the transformer network, we need a data tuple containing five 21-DoF hand poses and five 6-DoF object affordances. Two of the ten 21-DoF hand pose data for each object were classified for testing and eight for training. One tuple was formed by matching with the 6-DoF generated at the exact same location as the center position of the 21-DoF. By randomly selecting five of the eight tuples, a list of approximately 16,500 unique tuples is generated from the combination of 6-DoF object affordance ϕiM and 21-DoF hand poses φiN for the six objects.
2.2. Object Manipulation T-DRLIn Figure 2, we have illustrated the two stages for training and testing of the proposed method. For training, Figure 2a illustrates the inputs for the transformer network, including a set of 6-DoF object affordances and 21-DoF hand poses. The network estimates the optimal hand pose by computing the attention value between the two inputs. Attention value indicates the similarity between information in the input and is used to estimate the 21-DoF hand pose in the decoder. In the testing phase, Figure 2b illustrates the process of estimating hand pose corresponding to the 6-DoF object affordance generated from the environment. In testing, a Mask R-CNN [36] with a ResNet-50-FPN backbone pre-trained with the COCO object dataset [37] estimates a pixel-level mask for the object in the captured RGBD image. Then, this mask is used to segment the image and extract a point cloud with a partial view of the object. The point cloud goes through GraspNet to obtain the object affordance and, combined with a set of hand poses from the ContactPose database, the pre-trained transformer network estimates the optimal hand pose of 21-DoF φN. 2.2.1. Transformer Network with Attention MechanismThe transformer network illustrated in Figure 2a is composed of two modules to derive an optimal hand pose for natural grasping: the 6-DoF object affordance encoder and the 21-DoF hand pose encoder–decoder.The object affordance encoder first uses a self-attention mechanism to discover the local relationship between every element in the source input of 6-DoF object affordance. The attention layer in the encoder uses three independent feedforward networks to transform the input (i.e., the hand pose of 21-DoF and object affordance of 6-DoF) into query (Q), key (K), and value (V) tensors of dimension dq, dv and dv, respectively. In the self-attention layer, Q, K, and V are calculated from each encoder input. Then, the self-attention layer finds the correlation between encoder inputs. Then, calculating the attention value, the query and key tensors are dot-produced and scaled. A SoftMax function then obtains the attention probabilities and the resulting tensor takes a linear projection using the value tensor. The attention is then computed with the following equation [38]:Attention(Q,K,V)=softmax(QKTdk)V,
(1)
The computed attention weights represent the relative importance of the input in the sequence (keys) for each particular output (query) and multiply it by a dynamic weight (value). To improve the range of input features that could be learned via attention, the multi-head attention runs multiple self-attention layers in parallel and learns different projections of the input data. These are expressed as:{MultiHead(Q,K,V)=Concat(h1,…,hh)WOhi=Attention(QWiQ,KWiK,VWiV)
(2)
where linear transformations WiQ∈ℝdmodel×dk, WiK∈ℝdmodel×dk, WiV∈ℝdmodel×dv, and WiO∈ℝhdv×dmodel are parameter matrices. The parameter h represents the number of subspaces we compute with dk=dv=dmodel/h=4. Therefore, the two self-attention layers shown in Figure 2 compute the attention matrix to find similarity within the 6-DoF object affordance and 21-DoF hand pose individually. The self-attention output is an embedding of dimension d that carries the mutual information of each element in the input. Then, a residual connection expands this embedding, concatenating with the original object affordance. Later, a normalization layer and two feedforward networks, both with a non-linear activation function, output the encoded representation (attention value) of the object affordance from the encoder. The hand pose encoder does the same job as the object affordance encoder, with the target input of the hand pose being 21-DoF. The output of the hand pose encoder is an embedding vector of dimension d, which computed the mutual information of the input elements through a self-attention mechanism, using a residual connection and two feedforward layers.The outputs of both encoders are used as input for the hand pose decoder. The cross-attention layer learns to map between the encoded values of the hand pose of 21-DoF and the object affordance of 6-DoF. Through this mapping the network can estimate the optimal hand pose for the object affordance ϕM. The decoder uses the cross-attention mechanism. In the cross-attention mechanism, K and V are calculated from 6-DoF object affordances and Q is calculated from 21-DoF hand poses. Then, the cross-attention layer discovers the relationship between 6-DoF object affordances and 21-DoF hand poses. The multi-head attention expands the ability of the model to focus on different elements from the object affordances of 6-DoF and hand poses of 21-DoF in parallel. It does this by estimating from 6-DoF to 21-DoF through the attention value calculated in the decoder. Later, this output will be used in the shaping-reward funtion for training the DRL policy.
2.2.2. DRL with Reward ShapingThe DRL policy illustrated in Figure 1 is trained to control the robot’s movements for generating a natural grasping hand pose for object manipulation. The model-free DRL policy πθ(at,st) describes the control problem of the robot hand as a Markov Decision Process (S,A,R), where s ∊ S, S ⊆ ℝn is an observation vector describing the current location of the robot hand and object in Cartesian coordinates. The agent’s actions a ∊ A, A ⊆ ℝm control the robot hand’s movements to interact with the object. A reward function rt=R(s,a,s′) evaluates the completeness of the manipulation task.To optimize the parameters of the DRL policy we follow the implementation of natural policy gradient (NPG) in [38], which computes the policy gradient as:∇θJ(θ)=1NT∑i=1N∑t=1T∇θlogπθ(ati|sti)Aπ(sti,ati)
(3)
Then, it pre-conditions the gradient with the Fisher Information Matrix, Fθ and makes the following normalized gradient ascent update:θk+1=θk+δ∇θJ(θ)T· Fθk−1·∇θJ(θ)·Fθk−1·∇θJ(θ)
(4)
where δ is the step size. The advantage function Aπ is the difference between the value for the given state–action pair and the value function of the state. The NPG in [38] uses the general advantage estimator (GAE), which is defined as:AGAE=∑l=0T(γλ)lδt+lV
(5)
where, δtV = rt + γV(st+1) − V(st) defines the temporal difference residual between consecutive predictions, assuming a value function V that approximates the true value function, Vπ. Both the policy and value networks share the same architecture and were defined by a three-layer neural network with 64 units in the hidden layers.The reward function maximizes the sum of the expected rewards at the end of each training episode. This function measures the similarity of the robot hand’s pose to the optimal hand pose estimated by the transformer network. It also evaluates how close the object is to the target location at the end of an episode. Defining the reward function to solve the manipulation tasks is expressed as follows:rg={λ1×rjointsi<t150λ2×rh:o+λ3×ro:to.w.
(6)
where rjoints is the mean squared error between the hand pose of 21-DoF, derived from the transformer network, and the robot hand’s pose after following the policy actions. Empirically, we found that by optimizing for rjoints during the first 150 timesteps of each iteration, the policy could learn how to shape the robot hand before grasping an object. The second term of the reward function minimizes the distance between the hand and the object rh:o and the distance of the object to the target location ro:t. The values of λ1,λ2 and λ3 are constant weights to balance the total reward, rg, which measures the completeness of the task. The value of rg gives a larger reward when at the end of the episode the object finishes near the target location. 2.3. Training and Validation of T-DRLThe proposed method is trained for natural object manipulation, including grasping and relocating of the randomly located object to a random target position.
2.3.1. Training and Validation of Transformer NetworkFor each training step, a minibatch B was sampled from the tuples. More specifically, we defined the input tuple as (ϕiM,φiN−1)j=1B and hand pose output as (φN)j=1B. We minimized the MSE loss function between the ground truth and predicted hand pose using the Adam optimizer with a learning rate of 3e−4, β1=0.9, and β2=0.98. The model was trained until the MSE error for every joint between the predicted and ground truth hand pose was lower than 5%. We validated the transformer network by MSE between the estimated hand joint and the ContactPose hand joint. Additionally, we compared the joint angle’s range distribution with the dataset and the estimated hand pose.
2.3.2. Training and Validation of T-DRLThe T-DRL policy updates its parameters by computing the sum of the rewards ∑i=0Tγiri from Equation (6). Learning is optimized using the natural policy gradient (NPG) described in [38]. First, for each object two policies were initialized with a random normal distribution and trained for 5000 iterations, each iteration with N=200 episodes, and a time horizon of T=1000 time steps. At the start of each episode the transformer network computes the 21-DoF hand pose for grasping the object in the simulation environment.We have compared our approach with a baseline DRL (i.e., the NPG algorithm) in Section 2.2.2 using a sparse reward ro:t of the reward function rg. After training, we compared the probability of success of successful grasping and relocation (i.e., rg over a threshold that indicates the policy has learned to grasp and move the object to the target location) among 100 trials of NPG and T-DRL. 4. DiscussionFor dexterous object manipulation, using natural hand poses has a significant impact on the success of natural object grasping. Several previous studies have conducted dexterous object manipulation with hand pose estimation and DRL [32,41]. Yueh-Hua Wu et al. [32] used GraspCVAE and DRL to manipulate objects with an ADROIT robot hand. GraspCVAE is based on a variational autoencoder that estimates the natural hand pose from the object affordance. In this work, DRL imitated the estimated hand pose without reward shaping and manipulated five objects (a bottle, a remote, a mug, a can, and a camera), achieving an average manipulation success rate of 80%. Similarly, Priyanka Mandikal et al. [41] used the FrankMocap regressor and DRL to manipulate objects with an ADROIT robot hand. They utilized the FrankMocap regressor to estimate natural hand poses according to object affordance. The six objects (a pan, a mug, a teapot, a knife, a cup, and a hammer) were manipulated, achieving an average success rate of 60%. The low success rate was mainly due to the FrankMocap regressor, which has shown poor performance in generating novel natural hand poses without object and hand pose data. In our study, we incorporated the transformer network into DRL with reward shaping for object manipulation. Our T-DRL outperforms the object manipulation of similar objects with an average success rate of 90.1%. We believe DRL’s transformer network and reward shaping contributes to this performance.In T-DRL, dexterous object manipulation of an anthropomorphic robot hand is possible through natural hand poses produced by the transformer network. For instance, NPG without natural hand poses grasps only a few objects, such as an apple, a hammer, a banana, and a light bulb as shown in Figure 4. For the bottle and camera, NPG fails to grasp them. The results demonstrate that for complex shaped objects, proper guidance through natural hand poses plays a critical role in grasping them. In contrast, the proposed T-DRL with natural hand poses successfully grasps all six objects. Unlike NPG, when grasping each object with T-DRL, we noticed that each object gets grasped in a hand pose that suits the object affordance. This is because of the natural hand pose produced by the transformer network according to the object affordance. Accordingly, T-DRL utilizes the natural hand pose in learning via reward shaping. The reward curves in Figure 4 show that the proposed reward function performs effective exploitation of the object grasping space in comparison to NPG. For example, when manipulating the objects of the hammer and banana, NPG and T-DRL obtain high rewards, but only T-DRL grasps objects with natural hand poses. This suggests that the proposed reward function provides better guidance for natural object manipulation in comparison to the NPG reward function. In general, the reward of T-DRL increases much faster than that of NPG. This suggests that unnecessary exploration for object grasping is reduced for the anthropomorphic robot hand, while the useful experience for successful grasping is gained much more quickly. The proposed reward-shaping function in T-DRL results in improving the grasping success rates from 65.3% by NPG to 90.1% by T-DRL. Using natural hand poses and reward shaping in T-DRL seems to play a significant role in training the anthropomorphic robot hand for dexterous object manipulation, especially considering the high-DoF of the five fingers.
Comments (0)