Deep Reinforcement Learning for Concentric Tube Robot Path Following

As surgical interventions trend towards minimally invasive approaches, Concentric Tube Robots (CTRs) have been explored for various interventions such as brain, eye, fetoscopic, lung, cardiac and prostate surgeries. Arranged concentrically, each tube is rotated and translated independently to move the robot end-effector position, making kinematics and control challenging. Classical model-based approaches have been previously investigated with developments in deep learning based approaches outperforming more classical approaches in both forward kinematics and shape estimation. We propose a deep reinforcement learning approach to control where we generalise across two to four systems, an element not yet achieved in any other deep learning approach for CTRs. In this way we explore the likely robustness of the control approach. Also investigated is the impact of rotational constraints applied on tube actuation and the effects on error metrics. We evaluate inverse kinematics errors and tracking error for path following tasks and compare the results to those achieved using state of the art methods. Additionally, as current results are performed in simulation, we also investigate a domain transfer approach known as domain randomization and evaluate error metrics as an initial step towards hardware implementation. Finally, we compare our method to a Jacobian approach found in literature.


I. INTRODUCTION
C ONCENTRIC tube robots (CTRs) are a class of con- tinuum robots that depend on the interactions between neighboring, concentrically aligned tubes to produce the curvilinear shapes of the robot backbone [1].The main application of these unique robots is that of minimally invasive surgery (MIS), where most of the developments for CTRs have been focused.MIS has trended towards semi-autonomous and autonomous robotic surgery to improve surgical outcomes [2], [3].Due to the confined workspaces and resulting extended learning times for surgeons in MIS, dexterous, compliant continuum robots such as CTRs have been under development in preference to the mechanically rigid and limited degreesof-freedom (DOF) robots used in interventional medicine This work was supported by the Wellcome/EPSRC Centre for Interventional and Surgical Sciences (WEISS) at UCL (203145Z/16/Z), EPSRC (EP/P027938/1, EP/R004080/1).Danail Stoyanov is supported by a Royal Academy of Engineering Chair in Emerging Technologies (CiET1819\2\36) and an EPSRC Early Career Research Fellowship (EP/P012841/1).For the purpose of open access, the author has applied a CC BY public copyright license to any accepted manuscript version arising from this submission.Keshav Iyengar and Danail Stoyanov are with the Wellcome/ EPSRC Centre for Interventional and Surgical Sciences (WEISS), University College London, London W1W 7EJ, UK.Sarah Spurgeon is with the Department of Electronic and Electrical Engineering, University College London, London W1W 7EJ, UK. keshav.iyengar@ucl.ac.uk today.The pre-curved tubes of CTRs, which are sometimes referred to as active cannulas or catheters, are manufactured from super-elastic materials such as Nickel-Titanium alloys with each tube nested concentrically [4].From the base, the individual tubes can be actuated through extension, as seen in Fig. 1, which results in the bending and twisting of the backbone as well as access to the surgical site through the channel and robot end-effector.
CTRs are motivated clinically for use in the brain, lung, cardiac, gastric, and other surgical procedures where tool miniaturization and non-linear tool path trajectories are beneficial [6].Particularly, they have been investigated as steerable needles and surgical manipulators.As steerable needles, they substitute traditional steerable needles with higher precurvature and dexterity.They are also actuated with a follow-theleader approach, where every point along the robot's backbone traces the same path as the tip.As surgical manipulators, the benefit of using CTRs is the large number of design parameters that allow for patient-specific and procedure-specific CTR designs through optimization for surgical requirements and design heuristics [7], [8].In general, however, CTRs, with their increased DOFs and miniaturization potential, may be beneficial with their flexibility to reach a larger section of a surgical site with a working channel through the tubes for irrigation, ablation, or other tools.Thus, it is key to control the robot tip position to desired Cartesian points in the robot workspace accurately.However, due to tube interactions, modeling, and control are challenging.Position control for CTRs has relied on model development, and although a balance between computation and accuracy has been reached in the literature [9], there remain issues such as performance in the presence of tube parameter discrepancies and the impact of unmodelled physical phenomena such as friction and permanent plastic deformation.This motivates the development of an end-to-end model-free control framework for CTRs.
One such model-free framework for robotic control that is gaining popularity is reinforcement learning (RL), a paradigm of machine learning that deploys an agent to output an action that interacts with an environment [10].The environment then processes this action and returns a new state and, depending on the task, a reward signal.One parallel of RL is that of a control system.The agent is equivalent to the controller, the actions are equivalent to the control actions, the state is equivalent to any measurable signal from the plant, the reward is equivalent to a performance metric (eg.minimize steady-state error), and lastly, the learning algorithm is equivalent to the adaptive mechanism of the controller.Deep reinforcement learning (DeepRL) combines deep learning and RL and allows for highdimensional states and actions traditionally not available to RL algorithms or classical control.In this work, we expand on the previously published literature on utilizing DeepRL to control CTRs [11].Specifically, the aim is to control the endeffector Cartesian robot tip position with a DeepRL agent by means of actions that represent changes in joint values.The state includes the desired goal at the current step, allowing for more complex control tasks such as path following as well as inverse kinematics.
In §II, we review the Jacobian approach [12] as well as the state-of-the-art inverse kinematics error metrics.The model reviewed is also used in simulation for training and evaluation for our DeepRL method.In §III, the main components of our DeepRL approach for CTRs are described.Improvements such as exploring constrained and constraint-free tube rotation as well as generalizing the policy to accommodate multiple CTR systems are introduced.For tube rotation, we investigate how constraining the rotational degree of freedom for CTRs affects overall error metrics.In other deep learning methods, tube rotations are often constrained during data collection.For example, in [13], a restricted rotation joint space of −π/3 to π/3.For a neural network approach, this may not cause issues however, in a timestep-based exploration method like DeepRL, such constraints may hinder exploration of the workspace.Also introduced is a novel end-to-end CTR generalized DeepRL policy, which to our knowledge, is the first work in generalizing over tube parameters with a model-free framework for CTRs.Thus far, in deep learning approaches for CTRs, training occurs with data from a single selected CTR system, in simulation, hardware, or a combination.However, considering that hardware systems often differ from simulation due to manufacturing inconsistencies, and the large design space of CTRs, this type of training overfits the single CTR system.To demonstrate generalization to multiple CTR systems, the work investigates the scalability and accuracy of generalization over 4 CTR systems of varying workspace sizes.In §IV, results validating these improvements as well as error comparisons to previous tip-tracking and inverse kinematics methods are presented.Finally, we discuss strategies for translation to hardware including domain transfer and initial experiments for a domain transfer strategy known as domain randomization.The contribution evolved from our previous work [11] may be summarised as: 1) Investigating constrained and constraint-free rotation on tube rotation.2) Development of an initial proof-of-concept CTR system generic policy for CTRs.3) Details for a pathway and initial simulation results for strategies to hardware translation.

II. RELATED WORK
Over the last few years, deep learning approaches have become popular for control as well as kinematic and dynamic estimation of CTRs.The first deep learning approach was by Bergeles et al. [14] for forward and inverse kinematics, performed in a simulation where a simple extension and rotation representation was used.However, tube rotation representation in the network resulted in ambiguity in inverse kinematics solutions.In later work by Grassman et al. [13], by improving the joint representation, errors of 2.8% of robot length were achieved, albeit in a limited region of the workspace and in simulation.More recently, work on shape estimation and shape-to-joint input estimation has been investigated [15], [16] using deep neural networks.Finally, Donat et al. [17] introduced tip contact force estimation based on backbone deflection using a data-driven approach via deep direct cascade learning (DDCL).With tip error represented as a percentage of robot backbone length, more traditional optimization-based methods and inverse Jacobian methods have been found to have a tip tracking error of 3.2% and 2.5%.State-of-the-art closed-loop control methods can achieve errors of 0.9% and 0.5% [18].With active constraints and the use of a modelpredictive closed loop controller, tip errors of 0.3 − 0.5% [19] have also been reported.
The Jacobian-based controllers which are common in literature [18], [20] can be used in a closed-form fashion to perform path following with a CTR.A similar comparison was done in [19].Given a control input or desired change in joint values qd , the desired change in Cartesian space ẋd and desired position in Cartesian space x d , a positive semi-definite matrix K p and the pseudo-inverse Jacobian J † , we can define a control law such that Moreover, the pseudo-inverse can become very sensitive to singularities so a damping factor Λ is added such that J † = (J T J +Λ 2 I) −1 J T .As shown in [19], this method does not account for joint limits resulting in failed trajectories.Although learning-based approaches have been well developed and have had success for forward kinematics, force estimation and shape estimation, inverse kinematics and control using deep learning has remained an open problem for CTRs.Given that the deep learning-based forward kinematics [13] and shape estimation [15] report better error metrics than their physics-based model comparisons, investigating a deep learning-based approach for inverse kinematics and control could be beneficial and advantageous.To this end, we have investigated the use of DeepRL for CTRs.Our initial work [21] investigated the exploration problem for CTRs with simpler constant curvature dominant stiffness kinematics for simulation.The exploration problem stems from previous analysis for workspace characterization [22] that has shown the bias associated with uniform joint sampling for CTR workspaces.Due to the constraints to extension from the actuation perspective, full extension and retraction are less likely to be sampled.Since DeepRL methods rely on the experiences collected during training episodes as determined by the agent's actions if full extension or retraction joint values are not sampled, kinematics and control in those areas of the workspace will not be accurate.To mitigate this, noise is usually added to the selected actions or policy network to explore the state space.Our initial work determined that applying separate noise to the rotation and extension joints was crucial in acquiring a policy with accurate control.
As this work builds on our previous DeepRL literature, a brief overview of key concepts is introduced.In our last work [11] we improved our initial DeepRL approach [21] by using a more accurate geometrically exact kinematics model [9] for simulation, investigating joint representations, and applying a reward curriculum to improve sample efficiency (faster policy convergence with fewer data) for training.Two joint representations and three curricula functions were evaluated.The curricula were used to determine the goal tolerance during training steps, a novel approach for DeepRL to our best knowledge.The egocentric representation with the decay curriculum performed best overall in terms of sample efficiency and error metrics.To demonstrate policy robustness, a second noise-induced simulation was created where Gaussian noise was added to the join values as encoder noise and endeffector position as tracking noise.The training was then performed on the noise-induced and noise-free simulation.Performing evaluations on a noise-induced simulation, a slight improvement was seen in policy trained with the noise-induced simulation.The main takeaway from these experimental results was that the policy learned can incorporate some amount of noise in the state, and still perform adequately.
In §III, we first formulate the Markov Decision Process (MDP) consisting of state, action, and rewards.Next in subsection III-D, we review the results of our previous work, specifically joint representations of egocentric and proprioceptive, reward curricula, and combinations of representations and curricula.We then expand to constraint-free rotation in III-E to significantly improve error metrics and finally introduce and evaluate the system identifier to generalize results across multiple CTR systems.Finally, in §IV we provide an initial experiment for domain transfer to hardware using domain randomization in simulation to motivate transfer to hardware.

III. METHODS
In RL, Markov Decision Processes (MDPs) define mathematically the agent's task.Importantly, it defines the key elements for changing the state, the associated rewards, and the actions that affect the state to achieve the task.

A. Markov Decision Process Formulation
In the following, the state, action, reward, and goals are defined.State (s t ): The state at timestep t, is defined as the concatenation of the trigonometric joint representation [13], Euclidean norm between the current desired position and desired position and current goal tolerance.As shown in Fig. 2, rotation and extension of tube i (ordered innermost to outermost) are α i and β i with L i representing the full length.First, the trigonometric representation, γ i , of tube i is defined as: ( The rotation can be retrieved by taking the arc-tangent The extension joint β i can be retrieved directly and has constraints from the actuation side.In our previous work, the rotation was constrained from [−180 • , 180 • ], which was not required in the trigonometric representation, as will be shown with constraintfree rotation.The Cartesian goal error is the current error of the achieved end-effector position G a , and desired end-effector position G d .Lastly, the current goal tolerance, δ(t) is included in the state where t is the current timestep t of training.The full state, s t , can then be defined as: Action (a t ): Actions are defined as a change in rotation and extension joint positions: The maximum actions in extension and rotation are set to 1.0 mm and 5 • .Goals (G a , G d ): Goals are defined as Cartesian points within the workspace of the robot.There is the achieved goal, G a , and desired goal, G d where the achieved goal is determined with the forward kinematics of the kinematics model used and is recomputed at each timestep as the joint configuration changes from the selected actions from the policy.The desired goal updates at the start of every episode where a desired goal is found by sampling valid joint configurations in the workspace and applying forward kinematics of the model.However, this is not uniform in Cartesian space and requires action exploration.Rewards (r t ): The reward is a scalar value returned by the environment as feedback for the chosen action by the agent at the current timestep.In this work, sparse rewards are used as they have been shown to be more effective than dense rewards when using hindsight experience replay (HER) [23].
The reward function used in this work is defined as: where e t is the Euclidean distance ||G a − G d || at timestep t and δ(t) is the goal-based curriculum function that determines the goal tolerance at training timestep t.The workspace and various state and reward elements are illustrated in Fig. 3.

B. Goal-Based Curriculum
In our previous work, we introduced a novel goal-based curriculum that reduces goal tolerance through training steps to improve error convergence and overall error convergence.Linear and exponential decay goal-based curriculum along with a baseline constant curriculum function were compared with combinations of proprioceptive and egocentric joint representations.Each curriculum reduces the goal tolerance as a function of timestep t, the number of timesteps to apply the function, N ts , the initial tolerance, δ initial and final tolerance, δ f inal .The linear curriculum is defined as and the exponential decay curriculum is defined as The values used for these various parameters can be found in [11].

C. Joint Representation
To improve learning sample efficiency, joint representations were investigated.Specifically, proprioceptive (absolute) and egocentric (relative) joint representations where the reference of measure for each joint position is changed respectively.Although proprioceptive representations are used often in classical control for robotics, egocentric joint representations are utilized heavily in reinforcement learning control simulation environments like the DeepMind control suite [24].In proprioceptive or absolute joint representation all the joints are referenced from a common base reference.This is illustrated for rotations in Fig. 4a.However, in egocentric or relative joint representations, only the inner tube is referenced from the base shown in Fig. 4b.The next outer tube is referenced from the previous inner tube and so forth.This can be used for both rotation and extension joints.
and extensions To retrieve the absolute joint representation, the cumulative sum is taken as shown below:

D. CTR Simulation Environment
To collect data and experiences for the deepRL algorithm to learn a policy, a simulation environment for kinematics was developed following the openAI gym framework [25].With this gym framework, any compatible DeepRL algorithm can be utilized to learn a policy for the given task.The environment takes a set of tube parameters describing a CTR system, joint configuration, and selected actions by the agent to determine the overall shape of the CTR.For DeepRL, a large number of experiences are needed to train a policy, so a "sweet-spot" or relatively fast computationally and relatively accurate kinematics model was used which was first presented in [26] and later presented for externally loaded CTRs with point and distributed forces and moments in [1], [12].This model ignores friction, permanent strain, and forces along the backbone of the robot.The gym framework is then used for experiments including inverse kinematics, path following, and domain randomization.
To summarize the results of our previous work, investigating combinations of reward curriculum and joint representations, the egocentric, decay curriculum was found to perform best in evaluation.Training was done with deep deterministic policy gradient (DDPG) [27] algorithm with hindsight experience replay (HER) [23].Evaluation was done with 1000 evaluation episodes with results shown in Table I.The success rate is defined as the number of successful trajectories over the total number of trajectories, where success is achieving a trajectory with below 1 mm error.In the simulation environment for these experiments, tube rotations were constrained from −π to π. Unconstraining the tube rotation will be investigated in the next section.With learned policy, since the desired goal is within the state as shown in equation ( 6), a policy controller was created where the desired goal was updated via a trajectory generator.This type of controller is normally not available for other deep neural network approaches as they are not inherently timestep based whereas DeepRL aims to optimize actions at each timestep.Following various paths including circular and straight-line, the mean tracking error was 0.58 mm using the egocentric decay curriculum.To demonstrate robustness, tracking and joint noise of 0.8 mm and 1 • was induced into the simulation resulting in 1.37 mm mean tracking error.However, with the tube rotation constraints, exploration may not be exploring the workspace in its entirety.Another factor is workspace size which depends on the CTR system used for training.It is evident larger workspaces will take longer computation time to train.First, performing a workspace and joint error analysis, we found removing constraints on tube rotations provided a significant performance improvement, particularly in larger CTR systems.Second, we found generalization possible with the MDP formulation by appending a system identifier to the state.This generic policy would be useful as only a single policy would be needed for multiple systems and would be the first step towards full generalization for deep learning-based CTR kinematics.Moreover, this generalization motivates experiments using domain randomization, for domain transfer from simulation to hardware.In the following, we describe the methods for the two main experiments, improvement through constraintfree rotation and generalization.

E. Improvements with constraint-free tube rotation
With the best policy training method, egocentric decay from the previous work, state information such as the achieved goal, desired goal, Cartesian error, and joint error at the end of each episode from 1000 evaluation episodes were tabulated after training the policy on the larger CTR system 0. Plotting the Cartesian points of the achieved goal with RGB values corresponding to Cartesian error to the desired goal results in Fig. 6a.Furthermore, thresholding achieved goal points with Cartesian errors to the desired goal greater than 2 mm, and regions of larger errors can be isolated.As seen, there is a large standard deviation in errors greater than 2 mm with constrained rotation.This motivates unconstraining tube rotations during DeepRL training for exploration.Shown in Figure 6b are constraint-free rotation training results in no errors greater than 2 mm, thus the standard deviation of errors is greatly reduced.
In order to investigate the joint values associated with these large errors, Figure 6 shows that the Cartesian achieved goals for the innermost tube or α 1 at the end of each episode and the associated errors in the robot workspace.As seen, there is a large number of errors greater than 2 mm with some points up to 30 mm in error.In Figs.7a, 7b and 7c, the constraints causing the large errors at the boundaries of −180 and +180 where the largest outliers for errors are confirmed with a polar plot for each tube rotation.In previous work, this rotation constraint is to limit the joint space sampling during the generation of new desired goals, starting joint values and data collection as has been implemented in previous CTR deep learning work [13].However, this constraint through timesteps is non-essential in the trigonometric representation.Training a policy using the egocentric decay curriculum without constraining the rotations of the tubes considerably improved the error metrics from the previously constrained egocentric curriculum from a mean error of 4.05 mm to 0.68 mm for the largest system 0.
To further analyze the behavior of the agent with respect to error, a goal distance to Cartesian error analysis was performed from the 1000 evaluation episodes, and state information was tabulated.This analysis reveals the relationship between distance to the desired goal and the associated final error.Fig. 5. Visualizing the different CTR systems used with the geometrically exact modeling.Chosen were 4 different CTR systems, each composed of 3 tubes of various parameters resulting in different workspaces.The longest system is CTR system 0 (blue), with the shortest CTR system being system 3 (purple).System 2 (green) has the largest curvature resulting in a wide workspace.Finally, the CTR system 1 is in orange.The systems are ordered from longest to shortest.Each system in the figure is fully extended with the middle and inner tube rotated at 0 • and 180 • for a total of 4 configurations per system.Generally, a closer goal, ie. a low initial goal distance, would be expected to have smaller errors overall, with farther goals having larger errors.For this analysis, all 1000 data points were used to determine the linear relationship between the initial goal distance and final Cartesian error.For constrained rotation with system 0, the slope was found to be 3.27 mm with a y-intercept of 0.8 mm.This suggests a poor inverse kinematics solver as errors become very large with higher goal distances.With constraint-free rotation, the slope was found to be 0.6 mm with a y-intercept of 0.66 mm, a reasonable slope for such a relationship.As an inverse kinematics solver, our DeepRL method performs adequately when rotations are unconstrained.However, an important note is only one solution is provided as it is an iterative solver and is dependent on the initial condition of the joint configuration at the start of each episode.However, this can be remedied with multiple episodes with different initial joint configurations.Visualized in Fig. 8 is an example of the same desired end-effector position with two different initial joint configurations, resulting in two different final inverse kinematics solutions.Using the constraint-free egocentric decay method for system 0, the desired goal position was (0, 50, 150) mm.The final joint configuration in Fig. 8a  To verify the constraint-free results and to demonstrate training and evaluation of our DeepRL method, we apply it to three other CTR systems from various sources in the literature and trained each with constrained and constraint-free rotation, and performed 1000 evaluation episodes.The main changes to hyperparameters were 3 million training timesteps with 1.5 million steps for the curriculum, a buffer size of 500, 000, and neural network layer size of 3 hidden units with 256 neurons each.The results for all four CTR systems are presented in Table II.Of note is the increased standard deviation in constrained rotation.However, with smaller systems such as system 3, this is not as pronounced due to the smaller robot workspace.From our previous work, there has been a large reduction in mean and standard deviation of errors as shown in the largest CTR system.

F. Generic Policy
To further motivate the utility of DeepRL for CTRs, we introduce an initial proof of concept for a CTR system generic policy.A hurdle currently with using deep learning approaches for CTRs is the limitation of CTR system generalization.Because deep learning relies solely on the data collected, if only one CTR system is used for training, the learned policy will accurately control that system alone.Moreover, we aim to demonstrate that using our egocentric decay goal-based curriculum, with constraint-free rotation has improved error metrics when compared with the one that does not employ these extensions for such a generic policy.
This CTR generic policy will seek to generalize over four CTR systems that have different tube parameters.The objective is to obtain good performance across the CTR systems with a single control policy.For generalization, a system specifier, ψ = {0, 1, 2, 3}, was appended to the state, s t , for the agent to differentiate the CTR systems.The state is now defined as At the start of each episode, a discrete uniform distribution is sampled to determine the value of ψ, or the CTR system parameters.These parameters are then set in the simulation environment and for that episode, the selected system is the one simulated for the agent's task.Once the episode is reset, a new ψ is sampled.Thus, over timesteps, all systems should be sampled with the agent collecting experiences from all systems uniformly.We acknowledge this is not a true generalization,  and the policy only learns the systems given, however, this initial proof-of-concept some form of generalization is possible for CTRs using DeepRL, an attribute not shown in any previous work.Given the right network parameters, all attributes defining a CTR system could be included in the state to generalize fully.To demonstrate initial generalization, we train a single policy to generalize over two, three, and four systems including different combinations.The systems are ordered 0 to 3, from the shortest overall system to the longest overall system.First, to validate our egocentric decay constraint-free method, we compared generalization with our constraint-free egocentric decay method to the constraint-free proprioceptive constant method and constrained proprioceptive constant policy.We aim to demonstrate that our constraintfree egocentric curriculum is key to policy convergence for generalization.A full set of results for all combinations of generalization are provided for the different systems.Then, to demonstrate the learned policy we present path following task results generalized over four systems, with and without sensor and encoder noise to display robustness.
We make our gym environment and the code to reproduce these results available online. 1

IV. EXPERIMENTS AND RESULTS
Thus far, we have verified that constrained tube rotation causes large errors at the boundary of the constraints.This is due to constraints on rotation resulting in inadequate exploration.Furthermore, we verified the method as an inverse kinematics solver by solving for a desired goal position with two different initial starting positions.The result was two different inverse kinematics solutions given the two different initial starting positions.As mentioned prior, the deepRL method can also perform path following.To validate this, experiments in simulation were performed following various paths with the constraint-free egocentric decay DeepRL method.Next, we investigate the state identifier for generalization, and present error metrics and path following results.
In a surgical scenario, the surgeon would be controlling the end-effector tip position with a haptic device similar to a Phantom Omni from Sensable Technologies.Thus, the inputs are Cartesian coordinates of the end effector tip position and fit well with the state description of our DeepRL method.Therefore, there is the single-point inverse kinematics task as seen previously, and the path-following tasks described here.
The experimental framework for path-following tasks is as follows.First, x, y, z desired goal positions are generated by a path generator component to substitute control inputs from a user.This component takes as input path shape parameters and the discretization parameter.Shapes include polygons, circular and helix paths.The generator outputs a series of x, y, z desired goal positions of the path with the path parameters given.The next component is the policy controller which takes two inputs, the desired goal positions and the initial joint configuration.The controller is described in detail in the next section.The controller outputs the changes in joint positions to achieve the path by reaching each of the desired goals given by the path generator.The controller is open-loop since information about whether the goal is reached is not relayed back.There are a number of steps given to reach the goal, and even if the goal is not reached, the next goal is given.Finally, the last component is the CTR simulation.The simulation takes an input of changes in joint positions, performs kinematics for each step, and returns the achieved goal positions of the end-effector as well as the full CTR backbone shape, with which the resulting path followed can be visualized.The entirety of the validation path following framework can be found in Fig. 9.

A. Policy Controller
The policy controller component acts as an open loop controller that takes as input a series of desired goal positions.The main control occurs while iterating over the given desired goal positions.Iterating through desired goals, first, the environment is initialized with a reset function that returns the first state.The reset function is used as an input to the system identifier and desired goal.Next, in the main for loop of the controller, the policy is given 20 timesteps to achieve the current desired goal with actions from the policy function.
If the agent has achieved the current desired goal within the 1 mm tolerance before 20 timesteps, then a break is initiated and the next desired goal is set.This is outlined in Algorithm 1.

B. Single System Validation
To validate our constraint-free, egocentric decay curriculum DeepRL method, we train different CTR systems and present error metrics for inverse kinematics and path following for each system.This was done to verify that the same method can be applied to various systems, resulting in an accurate learned policy.Additionally, we add state noise in the form of encoder and end-effector position noise.This is to demonstrate the learned policy is somewhat resilient to noise in the state.For simplicity, a 1 • standard deviation was selected.To determine the extension joint noise, a gear ratio of 0.001 was used.For achieved goal tracking noise, a standard deviation 0.8 mm based on an EM tracker (Aurora, NDI Inc., CA) precision data found in the documentation.
Secondly, to provide a pathway to hardware translation, we include initial results for domain randomization.In domain randomization, to transfer the policy from the source domain (simulation) to the target domain (hardware), a series of environmental parameters in the source domain is sampled from a randomized space.During training, the episodes are collected with the source domain with randomization sampling for i ← 1 to N do 3: 4: for t ← 0 to 20 do 6: a current ← P olicy(s current ) end for 13: return A[ ] 14: end function applied.This allows for the policy to be exposed to a variety of environmental parameters to generalize.In this way, the policy is trained to maximize the expected reward over a distribution of CTR configurations around the desired CTR configuration.Specifically, uniform domain randomization [28], is implemented for the tube parameters specified in the simulation.These tube parameters include length, curved length, inner diameter, outer diameter, stiffness, torsional stiffness, and precurvature.In uniform domain randomization, an interval range from which the tube parameters are uniformly sampled should be defined.For example, for a randomization value of 5 %, the inner diameter sampling of system 0, the lower range would be 0.7 − 0.05 × 0.7 = 0.665 and the upper range would be 0.7+0.05×0.7 = 0.735.The chosen CTR configuration for the domain randomization experiment was that of system 2 with a domain randomization value of ±5% of each parameter.
To evaluate this translated domain policy, we performed path-following tasks and inverse kinematics on system 2 tube parameters.Although there exist more sophisticated methods of domain translation, as an initial work, we believe this demonstrates the feasibility to translate the policies trained to hardware.In 1000 evaluation episodes, a mean error of 0.86 mm was found with a standard deviation of 0.64 mm.Using a straight-line path for testing for path following, the mean tracking error was 1.10 mm with a standard deviation of 0.15 mm.We compare these results to state-of-the-art in the next section.Without domain randomization, training results for inverse kinematics were summarized in Table II under constraint-free experiments.For the path following, in a noise-free simulation, system 0 had error metrics of 0.66 ± 0.28 mm for a helix path and 1.74 ± 0.72 mm in a noise-induced simulation.System 0 is the longest system, with the highest errors, and was chosen for results, evaluation, and comparisons.

C. Generic Policy Validation
To validate our generic policy method, we trained a generic policy for different combinations of two, three, and four CTR systems.For example, to generalize over two CTR systems, because we have four different CTR systems available to train on, we trained on all combinations of two systems resulting in a total of six experiments.Similarly, generalizing over three systems results in four experiments and a single experiment to generalize over all four CTR systems.Performing 1000 evaluation episodes and summarizing the error metrics, the proposed method is able to generalize over multiple systems.The full error metrics are shown in Table III.Looking at the error metrics with respect to systems, there is a correlation between the length of the system and higher errors similar to the previous constrained and constraint-free experiments.System 0 is the longest length, and consistently has the largest error metrics.We believe this is because of the workspace size, as overall length increases, the agent will require more training steps and experiences.Another factor is that in general, longer CTR systems have larger errors, and thus comparisons are done with the percentage of robot length.To mitigate this effect, a sampling strategy is used where the sampling of the system used in the environment is based on the lengths of the systems.In this length-based sampling strategy, the categorical distribution is proportionate to the length of each system.Each system probability is the system length divided by the sum of the system lengths being generalized.In this manner, systems that are longer and that have larger workspaces are sampled more during training and have more experiences for the policy to train.Evaluating this sampling strategy for generalizing over four CTR systems, the error metrics were improved from the previous uniform sampling.To validate the generalization, we performed a helix path following task with system 0 as seen in Fig. 10.The error metrics were a mean tracking error of 1.01 mm and a standard deviation of 0.41 mm with 50 desired goal points in the path.Performing a noise-induced experiment the error metrics were a mean tracking error of 1.86 mm with a standard deviation of 0.8 mm.When the number of points was increased to 100, error metrics were a mean tracking error of 0.91 mm with a standard deviation of 0.41 mm.In a noiseinduced simulation, the mean tracking error was 1.91 mm with a standard deviation of 0.78 mm.
To compare our generic egocentric constraint-free method, we also train using proprioceptive representation for a foursystem generic policy with constrained rotation and one with constraint-free rotation to compare to our egocentric constraint-free policy.We summarize the results in Table V.To note is the importance of removing rotation constraints as seen in the mean and standard deviation of errors from constrained proprioceptive to constraint-free proprioceptive.Error metrics are greatly reduced with this improvement.Finally, using an egocentric representation does improve metrics but has less of a significant impact as compared to rotational constraints.

D. Comparisons to the State of the Art
To compare our inverse kinematics and path following results to a previous state-of-the-art classical methods and deep learning methods, we convert errors to a percentage of robot length.First, we present inverse kinematics results for our constraint-free egocentric decay for each system when  trained separately ie.not the generic policy.Mean and standard deviation as percentage for each system was 0.16% ± 0.065%, 0.17% ± 0.18%, 0.20% ± 0.08% and 0.3% ± 0.13%.When taken as a percentage of the robot length, the similarity of the error metrics is noteworthy.In our generalization method, for the four system generalization inverse kinematics for each system is 0.18% ± 0.02%, 0.19% ± 0.02%, 0.24% ± 0.02% and 0.42% ± 0.04%.For tip tracking with the more complex helix path errors were 0.23% ± 0.095% for system 0 with 50 goal points.Increasing the number of desired goal points to 100, the following mean tracking errors of 0.20% ± 0.08% as seen in Fig. 10.As reported in §II, Jacobian-based methods can achieve errors of 0.5% to 0.9%.As shown in Fig. 12, our DeepRL method is able to avoid joint limits, especially in extension, whereas the Jacobian approach does not include joint limits in the linearization.This comparison was done on system 0 with K p = 2I for a linear and circular path.The error metrics found were 1.15 mm ± 0.32 for the Jacobian method and 0.62 mm ± 0.07 for our DeepRL method in Fig 12a .More importantly, the Jacobian-based method was unable to follow some circular, linear, and helical trajectories that our DeepRL method successfully completed due to the Jacobian not including joint limits in the formulation, even if the damped-least squares method was used with Λ = 0.45.as seen in Fig. 12b.The advanced MPC method can achieve tip errors of 0.3% to 0.5%, however, we were unable to perform comparisons in simulation as the open-source simulation code was for a two-tube system.Our method does perform comparably to the reported state-of-the-art in simulation, however, this work is only in simulation and does not include constraints for snapping conditions in the model used.The approach will need to be validated in hardware.One possible way to consider snapping and singularity is to design a dense reward function that includes the minimization of elastic energy, hence avoiding snapping conditions.For our domain randomization results, as a percentage of robot length, the inverse kinematics errors were 0.86%±0.21%.Following a straight-line path, the errors were 0.36%±0.05%.The domain randomization metrics are higher, however, the aim is to demonstrate the feasibility of the transfer method.To validate, we will need to compare transfer to hardware with and without domain randomization or other transfer methods.

V. CONCLUSION
In this work, we investigate DeepRL, an end-to-end method for kinematic control of CTRs.Specifically, we aim to explore constraints on tube rotation and the impact on error metrics.Furthermore, the first proof-of-concept in system generalization is developed, not yet done for any deep learning method for CTRs.Finally, to provide initial work for hardware, we provided an initial pathway for domain transfer or policy transfer from simulation to hardware using Sim2Real domain randomization.Our method does demonstrate error metrics that perform well, however, validation in hardware is needed.Moreover, other domain transfer methods should be explored.Unaddressed in this work is the issue of snapping, whereby torsion in the tubes causes a rapid transition to a lower energy state resulting in a snapping motion.By creating simulation frameworks that utilize Dynamics models [29], snapping configurations could be avoided through rewards.We believe, with this work, we have demonstrated that deepRL methods may be able to outperform model-based methods for inverse kinematics and control for CTRs, similar to deep learning methods for forward kinematics and shape estimation.

Fig. 1 .
Fig. 1.A real set of concentric tubes [5] (a) and an equivalent illustration with tube actuation, workspace, and tip position.

Fig. 2 .
Fig.2.Extension β i and rotation α i for tube i of a i = 3 3-tube tube CTR system where L i is the overall length.s is the arc length or axis along the backbone.

Fig. 6 .
Fig. 6.Plotting achieved goal positions of evaluations of a policy trained in a simulation environment with constrained tube rotation (a) and (b) constraint-free tube rotations.Furthermore, each plot is filtered to errors greater than 2 mm to the desired goal of the episode to illustrate large errors.In (b), no errors are greater than 2 mm demonstrating a low error variance.

Fig. 9 .Fig. 10 .Fig. 11 .
Fig.9.Illustration of the process by which paths are generated, control actions are determined, and finally, paths followed.

Fig. 12 .
Fig. 12. System 0 following a circular path (a) and linear path (b) comparing Jacobian (blue) methods to our Deep RL method (purple) for path following.
This article has been accepted for publication in IEEE Transactions on Medical Robotics and Bionics.This is the author's version which has not been fully edited and content may change prior to final publication.Citation information: DOI 10.1109/TMRB.2023.3310037© 2023 IEEE.Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.Authorized licensed use limited to: University College London.Downloaded on September 05,2023 at 12:51:01 UTC from IEEE Xplore.Restrictions apply.

TABLE I JOINT
REPRESENTATION WITH CURRICULUM RESULTS.

TABLE III ERROR
METRICS FOR SYSTEM GENERALIZATION FOR THE VARIOUS NUMBER OF SYSTEMS AND COMBINATION OF SYSTEMS.ERRORS PRESENTED IN MILLIMETERS MEAN ± STANDARD DEVIATION.

TABLE IV FOUR
CTR SYSTEMS WITH VARIED PHYSICAL PARAMETERS.THIS TABLE INCLUDES THE LENGTHS AND CURVATURES.

TABLE V EXPERIMENTAL
COMPARISONS OF ROTATIONAL AND JOINT REPRESENTATION PER SYSTEM.
Danail Stoyanov Dan Stoyanov is a Professor at UCL Computer Science holding a Royal Academy of Engineering Chair in Emerging Technologies and serving as Director of the Wellcome/EPSRC Centre for Interventional and Surgical Sciences (WEISS).He graduated from King's College London before completing his PhD at Imperial College London.His research interests are focused on surgical robotics, surgical data science, and the development of surgical AI systems for clinical use.