Autonomous countertraction for secure field of view in laparoscopic surgery using deep reinforcement learning

In this study, we focused on countertraction during mesenteric dissection in colorectal surgery as a representative scenario. The objective of countertraction during mesenteric dissection was defined as enhancing the planarity and visibility of tissue surfaces to secure the field of view in laparoscopic surgery. Accordingly, the reward functions were designed to reflect this objective. The RL model determines the traction direction and magnitude to maximize the expected rewards based on the observed point cloud of the membrane-like tissue surfaces and the current position of the forceps tip.

Observation and action space

In this study, we defined the observed state \(s_\) from the environment, which serves as the input to the RL model, as follows:

$$ \beginc} = \left( }} \left( t \right), \ldots , \overrightarrow }} \left( t \right), \overrightarrow }} \left( t \right), \overrightarrow }} \left( t \right)} \right)} \\ \end $$

(1)

where \(\overrightarrow }} = \left( , y_ , z_ } \right), i \in \left[ \right]\) represents the coordinate of its point in the point cloud of the tissue surface, and \(\overrightarrow }} = \left( , y_ , z_ } \right), j \in \left[ \right]\) denotes the tip positions of the two forceps, respectively. Assuming that the transformation matrix between the robot arm and camera can be obtained through hand-eye calibration, both \(\overrightarrow }}\) and \(\overrightarrow }}\) are defined as values in the camera coordinate system.

The output from the RL model, denoted as action \(a_\), is defined as the displacement of the forceps tips in the camera coordinate system. This action \(a_\) is obtained by applying a transformation matrix \(T_^\) to convert the camera coordinate system to the robot coordinate system. Here, \(\Delta \overrightarrow }} = \left( , \Delta y_ , \Delta z_ } \right), k \in \left[ \right]\).

$$ \beginc} = }_}}^}} \left( }} \left( t \right), \Delta \overrightarrow }} \left( t \right)} \right)} \\ \end $$

(2)

Reward function

The reward function quantifies the desirability of the performed action, calculating a reward value based on the environment. We quantitatively evaluated the shape and orientation of the tissue surface using a point cloud representation. For the surface shape, closer proximity to a plane indicated better planarity. Consequently, we evaluated the planarity by calculating the distance between the 3D point cloud of the tissue surface and the least-squares plane. Let \(\vec\) denote the unit normal of the least-squares plane and \(\overrightarrow }}\) be the centroid of the 3D point cloud \(\left( }} , \ldots , \overrightarrow }} } \right)\). The average distance \(\underline \) between the tissue surface and the least-squares plane was determined as follows:

$$ \beginc} = \frac\mathop \sum \limits_^ \left| }} - \overrightarrow }} } \right) \cdot \vec} \right|} \\ \end $$

(3)

For surface orientation, we assumed that better visibility was ensured when the tissue surface faces toward the camera’s optical axis direction. We evaluated this by the cosine value of the angle \(\theta\) between the normal vector of the least-squares plane and the camera optical axis vector \(\vec\). The cosine value was performed as follows:

$$ \beginc} \cdot \vec}}} \right|\left| } \right|}}} \\ \end $$

(4)

The reward value for the RL model \(R\) was calculated by integrating the assessments of the surface shape and orientation, with regularization parameters \(a\) and \(b\).

$$ \beginc} } \right) + 1}}} \right)^ \left( \right)^ } \\ \end $$

(5)

Furthermore, to penalize excessive tissue deformation, negative rewards were assigned based on the strain values between particles. The distance \(d_^\) between points \(i\) and \(j\) in the 3D point cloud at time \(t\), denoted by ( \(\overrightarrow ^ }} , \ldots , \overrightarrow ^ }}\)), can be calculated as follows:

$$ \beginc} ^ = \left| ^ }} - \overrightarrow ^ }} } \right|} \right|} \\ \end $$

(6)

At each time \(t\), the maximum strain between each pair of points was defined as the maximum interpoint strain \(\varepsilon_^\). A reward of -1 was assigned when \(\varepsilon_^\) exceeded a certain threshold.

$$ \beginc} ^ = \mathop \limits_j}} \frac^ - d_^ } \right|}}^ }}} \\ \end $$

(7)

Neural network architecture

Figure 1 illustrates the neural network structure used in this study. To enable continuous control in any direction, we developed a machine-learning model based on a soft actor-critic [18], which is capable of handling continuous actions. The soft actor-critic (SAC) consists of a policy function that outputs actions from states, a Q-function that outputs Q-values from states, and actions and rewards calculated by the reward function. The policy function determines the traction direction, whereas the Q-function evaluates and updates this direction. We utilized the mean squared error (MSE) as a loss function to update the Q-function and entropy-regularized policy loss to update the policy, same as the SAC model [18]. In our approach, we employed PointConv [19], a point cloud CNN, for the policy function neural network instead of the traditional fully connected layers. PointConv is a neural network that maintains permutation invariance and manages the relationship between neighboring points in point clouds, thus facilitating effective feature extraction from non-rigid tissue shapes.

Fig. 1figure 1

Neural network architecture. The numbers above the arrows represent the dimensions of the inputs. The policy function includes PointConv and fully connected layers, which take the point cloud of the membrane tissue and the end-effector positions (in the camera coordinate system) as inputs and output the displacements of each end-effector position (in the camera coordinate system). The Q-function comprises three fully connected layers

Comments (0)

No login
gif