Introduction
With recent advances in artificial intelligence and large language models enabling convincing conversation with nonhuman agents, what is the need for robots in social interactions? One such place is within the realm of pair-bonding where embodied robots, or robot-like devices, are already helping couples to mediate long-distance relationships [1]. Labs across the world researching touch enabled by embodied robots have also been finding beneficial effects when people interact socially with or in the company of these robots [2].
The beneficial effects of touch in human-robot interaction (HRI) can include positive behavior changes and physical and mental support [3], [4], [5], [6]. To properly elicit such responses, the details of the touch must be carefully considered. One detail to consider is the location of touches. In the robot studies mentioned thus far, the researchers primarily focused on hand-to-hand touch. However, others have investigated more intimate interactions such as hugging [7], [8], [9], [10].
As the face is rarely studied as a location for touch HRI, we picked it as one initial touch target to begin to demonstrate the possibility of studying challenging touch interactions. How people will react to robots touching their faces is still largely unknown. In human-human touch interactions, people only allow their faces to be touched when relationships are close [11], [12]. Similarly, another study reported that people who have strong relationships, such as family, lovers, and friends, touch their faces together [13]. Since the reception of face touches appears to be so closely linked to relationships, future studies on the perceived relationships between robots and humans will be made more thorough by enabling face touch
In addition to face touch, to allow us to compare our system's performance more directly with previous studies, and to demonstrate a broad range of touch locations, we chose to also target the arm. As an added benefit, as seen in [11] and [14], the arm is a less intimate location for touch than the face, which gives us more options for future studies.
Another detail to consider is which kind of body a robot should have for touch. There are many options for robots that could be used for robot-to-human touch. These could be traditional robotic arms or even animaloid. However, we would like to build and demonstrate a system that could later be integrated into other kinds of human communication such as hugging, talking with facial expressions, or perhaps handshakes. Thus, an android that has human-like abilities is required. Since human-likeness is preferred with respect to: robot size [7], trustworthiness [15], and increasingly necessary the more human behaviors are desired to be replicated, we chose to study touch with the life-like, humanoid android, SOTO [16]. This choice of android does not come without its limitations due to the potential uncanny valley effects and since it has been shown that touches performed by a human are preferred to even imitation hands [17]. These limitations will be outweighed by the ability to study other modes of communication alongside human-like touch in future studies.
After deciding the robot and initial touch targets, we created and evaluated a system that can track a human in real-time with a single color and depth camera to direct the android to reach out and touch anywhere along the left arm or the left cheek of a human. Fig. 1 demonstrates the android touching someone's face during one trial.
We seek to answer the following research questions (RQ) through the development and evaluation of this system:
RQ1:
Can we enable a humanoid android to touch people's faces and arms in a way that is both adaptive to participants' postures and subjectively accurate?
RQ2:
Is adaptive touch perceived to be at least as natural as preprogrammed touch behavior in previous studies with a similar android?
Related Works
A. Android-to-Human Face Touch
Face touch is one of the least investigated aspects in touch HRI, likely due to its increased intimacy leading to fewer contexts for it to be appropriate. Here we consider two notable examples.
In [18], a philosophical and critical exploration of the authors' art video, we are shown an example of face touch in the form of a lipstick application robot. The robot used was a KUKA KR10 arm. Lipstick was attached to the robot's end effector and successfully applied on a woman's lips. This touch action was intended to be very intimate and to explore where the boundaries between human and machine may lie.
A brushing robot was used in [19] in another example of face touch to investigate how repeated stroking could reduce pain response. The robot used was a single axis rotating brush that could be raised or lowered along a second axis.
In both examples, the face touch was done with robots that were not humanoid and that followed preprogrammed motion plans. While well suited for the goals of contrasting a woman to a machine and for accurate and repeatable touches, these robots would not be suitable for investigating android-to-human face touch social dynamics as we would like to in future studies.
B. Adaptive Robot-to-Human Touch
Some touch studies have demonstrated fully integrated systems that can create an overall natural and adaptive touch experience. The importance of having the robot's body, sensors, and touch behavior working together to create such a touch experience are described in [7]. In a similar hug robot study, cameras and touch sensors were also used to start and stop touches upon a human starting and stopping a hug [20]. Unlike [7], we wish to perform more than one kind of touch action, and unlike both, we use an android and our system can observe the participants before each touch so that changes in the participants' posture can be accounted for.
Hand and forearm touch was investigated in one of the previous studies using an android very similar to the one we used [21]. However, in this and other studies, [3], [22], with that android, the touch target was predetermined. Participants were required to place their hands on marks on tables. This kind of marker-based touch is repeated in other studies involving arm and hand touch like [23] or [24]. An interesting exception that does have very dynamic touch is [25], where researchers built a system that could touch in many ways along a person's arm with an imitation hand attached to a KUKA robot arm. The human's arm was tracked with a Kinect camera to direct the robot's hand. Like this study we intend to use a vision-based system to detect human position but would like to use an android similar to the one in [3], [21], and [22], to open up the possibilities of other communication methods being added on top of our touch system.
Developed System
The full system diagram in Fig. 2 shows how our developed system's components are related. The pose detection system tracks both the human and the android to send pose keyframes to the control system, the human commands the android via the input interface, and the android touches and senses the human.
A. Android
We use the SOTO robot, a realistic, male-looking android [16]. It was based on the earlier ERICA robot, described in [26], and thus shares the same core mechanics and purpose. SOTO has a total of 55 degrees of freedom (DoF) of which the 14 leg joints are passive. The face and eyes account for 13 DoF. The arms and wrists can move with 7 DoF and each hand is able to actuate all fingers. Aside from the eyes, the joints are pneumatically actuated, allowing for soft contacts.
Under the skin of SOTO's right hand that measures approximately 19.5 cm in length from the base of the palm to the tip of the middle finger and 9.0 cm in breadth across the top of the palm, capacitive touch sensor pads were installed to detect contacts. The sensors are made of an array of six pads, one for each fingertip and one for the palm. The pads are made of aluminum tape with an insulated, stranded, 24 AWG wire soldered to them to connect to a control circuit board. The finger pads are approximately 1x2.5 cm each and the palm pad is approximately 3x3 cm. They are taped to the internal skeleton of the robot to prevent sliding from their intended positions. The circuit is a PIC16F1827-I/P on a board designed for [27], a project to develop a sensor suit to detect when and how an android is touched.
The capacitive sensors' values are first passed through a digital low-pass-filter with a 1 Hertz cutoff frequency. The outputs are then checked against a threshold value above baselines to determine if a touch has occurred. The baselines can be calibrated at any time by taking the average value for each sensor individually over 10 seconds. The threshold value is manually set and can be adjusted at any time. A contact is considered to be made if the outputs of three fingers simultaneously exceed the threshold value.
B. Pose Detection
To detect a person in 3D space so that the android can accurately touch and perform eye contact with him or her, we use a color and depth sensing camera in combination with an open-source, 2D pose detecting program. The camera is an Intel RealSense D435i mounted above and behind the android [28]. The pose keypoints are generated by OpenPose, a machine learning based approach to pose detection [29], [30], [31], [32]. Fig. 3(a) shows the view from the camera's perspective as well as a visualization of the generated pose overlayed onto the frame.
(a) The view from the RealSense camera with the OpenPose pose overlay. Notice how both the human, right, and android, left, are detected by OpenPose. (b) Android in the ready position with the control interface displayed on a laptop on its lap.
Since the output of OpenPose is only 2D, the output files are combined with the depth data from the camera with the use of the Intel RealSense SDK [33]. The pose and depth combining program we wrote is available for use at [34].
C. Input
There are two input graphical user interfaces (GUIs): one is remote that could be run on a laptop or android device, the other the main controller that runs on a desktop. The GUIs are connected over the network. They were built in the Godot game engine, a lightweight, free, and open-source tool [35].
The remote interface has two sections, as can be seen in Fig. 3(b), a prominent stop button and a human figure with a highlighted arm and face. The face selection has one positional option whereas the arm region allows a selection anywhere along the line from the shoulder to the wrist. The highlighted sections on the human figure allow an operator to request that a location be targeted for a touch. An indication of the number of touches completed out of those yet to be done is shown in the corner of the screen.
On the desktop interface, the same controls as the laptop interface exist in addition to others. The other controls allow a person at the desktop interface to request the remote operator reselect a touch, to begin event logging, to adjust settings, or to begin and stop the sub-programs.
D. Control
To control the android, the sensor data and human inputs are combined in a control program. The program does the following tasks for every frame of pose data received, approximately once every 1/12 s:
Update and sort the people detected to determine which is the android and which is the closest person that should be targeted and looked at. Updating involves matching people observed in this frame with previously seen people. If an observed head is very near to a previously seen head, then those two people, or android, are the same. The android is determined to be the person detected at the known position of the android. Additionally, the positions of all keypoints are averaged over the previous three observations to smooth out noise in the detection. Keypoints that are too far away from their owners are removed during this step. This is caused by observations being close to the edge of a person, as sometimes happens with noses seen in profile.
Check and process the current command from the two user interfaces.
If the android is being commanded to move, then the program sends the corresponding command to move to the android controller and checks for touch sensor input if necessary.
When requested by either GUI, the commands to move the android's hand and arm are sent in the form of JSON over TCP to another computer running the android's software that converts the 3D position, rotation, and speed of the hand commands into the necessary motor inputs for the robot.
When a touch is commanded, the target location becomes fixed in 3D space where the person was at the moment of the command. The android moves its right hand according to the algorithm in algorithm 1 at 8 cm/s for face touches and 10 cm/s for arm touches. These speeds result in touches that take approximately 5-7 seconds from command to contact. The constants in the algorithm are defined as follows:
Target is the location in 3D space the android will move its hand to, selected to be on a person's body.
Cheek_Offset / Face_Angle are heuristically chosen modifiers to the ear keypoint to allow the android's hand to nicely lay on the target person's cheek. The offset is 10 cm to the left from the android's perspective and the angle positions the hand with its thumb pointing up and fingers pointing 50 degrees above the horizon.
Extra_Distance is the maximum distance that the hand would travel past, and in the direction of, the Target if a contact had not been sensed before the hand arrived at the Target. It was chosen to be 10 cm
Midpoint_Margin is an offset from the android's perspective 15 cm to the right and 6 cm towards itself so that the trajectory of the hand would be arc-like.
Horizontal_Limit_Angle is 55 degrees with respect to the horizon that represents a need to rotate the hand to approach the Target from the top instead of from the side.
Algorithm 1: Face and Arm Touch Control.
Do Once:
If face touch commanded:
Target + left ear keypoint = Cheek_Offset
Rotate hand to Face_Angle
Else If arm touch commanded:
Midpoint = Point_Between(android shoulder, Target)
Midpoint += Midpoint_Margin
Move_Right_Hand_To(Midpoint)
If target limb angle to floor
Rotate hand roll to be parallel to horizontal
Rotate hand yaw to be parallel with Target
Stop If touch detected OR extra motion done:
First: Move_Right_Hand_To(Target)
Next: Move_Right_Hand_To(Target + Extra_Distance)
The Experiment
A. Participants
The participants were chosen from other members of our lab for their expertise and familiarity with robots. We expected that their familiarity with robots would allow them to give specific and technical feedback about the system to help us improve it for future, science-focused studies, and that the reduced novelty of the robot would reduce some biases. Their mean experience was 4.28/7. There were 25 participants (8F) with ages ranging from 19 to 47 with a median age of 30. Ten of the participants were Japanese and the second most common nationalities were Chinese, Italian and Mexican with two participants each. Of the remaining participants, one was from Africa, two were from Asia, three were from Europe, one was from North America, and two were from South America.
The participants were invited individually by approaching them in-person. Upon completion of their trials in the experiment, they were rewarded with small bags of cookies that were not announced ahead of the experiment.
B. Procedure
Participants were brought into a room with the robot seated and its hand in the ready position (right hand raised as if to wave as in Fig. 3(b)) and given a seat directly in front of the android. They were required to sit with their knees touching the android's knees so that the android could use the full range of motion of its arm to accommodate all touch target locations and orientations. The purpose of the study, the expected duration (15-20 minutes), and their roles were explained. Then the participants filled out a brief demographic survey and gave their consent to the pose and video recording. Next, the experimenter started the control program and calibrated the touch sensor.
Participants were asked to control the robot by selecting locations, with a laptop placed on the android's lap. After 30 touches chosen at the participant's discretion, the participants were given the final paper survey. When completed, they were given the cookies and thanked for their time.
All the procedures were approved by the Advanced Telecommunications Research Institute International Review Board Ethics Committee (501-4).
C. Measurements
All trials were video recorded with a phone camera placed on a tripod at 1080p, 30 fps. Additionally, the experimenter's screen was recorded. The 3D pose keyframes that informed the touch control were saved for each participant. Logs were generated for each participant for cross referencing with the video and pose data.
Subjective data was collected in two surveys, the first digital and second paper. The first survey collected demographic information and written consent to the pose and video recording. The second survey, given at the end of the experiment, asked participants to score Naturalness, Accuracy, Friendliness of the robot and its touches, and Fear about the robot and its motions on a 1–7 point scale. Naturalness and Accuracy were asked in four questions each and Friendliness and Fear were asked with a single question each. A blank section for written responses and feedback was provided at the bottom and strongly encouraged to be filled out. Please see Fig. 4 for a complete list of questions asked.
After the experiments had concluded, two human coders watched videos of all touch events alongside a human figure diagram that showed where the participants had clicked when selecting where to be touched. They labeled the body locations that the participants requested to be touched and where the android actually touched in the form of text labels. Additionally, in the case of arm touches, they labeled the exact location on the participant that was touched by placing another marker on the diagram.
Results
A. Summary of Subjective Responses
The submitted average numeric scores, in a range from 1-7, as well as standard deviations (SD) and the Cronbach's
B. Labeled Body Region Accuracy
We collected 681 touch events over the course of the experiment. The discrepancy between the optimal 750 events and the number of collected events has two origins. The first source was because during the first 10 participants, only those events where the touch sensor activated were logged. This overreliance on the touch sensor was corrected for in the subsequent 15 participants. The second source of the discrepancy was one participant who had two program crashes interrupt the experiment, leading us to request two fewer touches than intended. The two coders then labeled only these collected touch events.
As there was high reliability between the two coders, the labels from one were picked at random to represent the results. The label discrepancies between the coders can be seen in Table II. The target region label discrepancy was 3.2% and the actually touched region label discrepancy was 8.2%. The Cronbach's
The qualitative accuracy of touches were derived from the target and actual body locations touched labels. If both the target and actual labels were the same, then the touch was marked as “correct.” If the actual label was only one body part away from the target, then it was marked “partial.” The order of body parts is as follows: head, face, neck, shoulder, upper arm, elbow, forearm, wrist, and hand. The two end target locations, head and hand, could only have partial touches if either the face or wrist were actually touched respectively. For all other differences between the Target and Actual labels, the touch was marked as “incorrect.” For example, a touch targeting the face that was labeled as touching the shoulder would be marked as “incorrect” since it was not “correctly” touching the face or “partly” correctly touching the head or neck.
The full results of this body location accuracy, divided into body regions, are shown in Table III. Overall, the system touched participants incorrectly in only 3.8% of touches. Of special note are the face results where 98% of touches were correct and none were incorrect. That is to say, no touches that targeted the face touched below the neck. The upper and lower arm regions saw results that were similar to each other with 3% and 8% incorrect correspondingly.
C. Quantitative Accuracy
The quantitative accuracy came in the form of human labeled distances and automatically logged distances. These results are summarized in Table IV.
For the labeled results, the physical distance error between the target locations that the participants selected, and the labeled marks were determined by measuring the arms of the participants by taking the average of 10 frames of their arm lengths from the 3D keypoint data where their arms were well detected. Comparison of this method to measuring the arms with a measuring tape with two of the former participants yielded an error of approximately 1 cm between the two arm measuring methods. Only the errors of the arm touches were considered since arms could be approximated as straight lines. The mean labeled errors were 3.8 cm (SD = 4.5) for the upper arm and 6.3 cm (SD = 5.7) for the lower arm.
The logged error is the distance between the detected wrist keypoint location of the android's right hand, offset by 10 cm to the knuckles to remove the bias towards palm touches, and the target location in 3D space at the time of a touch event. This touch event was logged when the touch sensor activated, or when the stop button had been pressed after the hand had passed the initial target location and before the touch sensor had activated. Eleven points that were above 40 cm were removed from the results as outliers, likely caused by the android's wrist location not being detected properly. The overall error was 10.2 cm (SD = 5.1) for all body regions. The upper arm, lower arm, and face touches had mean errors of 10.2 cm (SD = 4.3), 9.6 cm (SD = 6.6), and 11.0 cm (SD = 8.2) respectively. The sources of the larger than desired absolute values of these errors are explained in the discussion chapter below.
Discussion
A. RQ1: Accuracy
To answer RQ1, we will look at both the qualitative and quantitative accuracy results. The labeled qualitative accuracy shows us promising results. With respect to the correct body part being touched, the system touched the correct or partly correct body location 96.2% of the time, with even better results for the face. The lower arm experienced a larger proportion of partially correct and incorrect touches compared to the upper arm region. This is likely due to the shallow angle of the participants' limbs making the distinction between touches on the hand, wrist, or forearm ambiguous for coders. The participants' accuracy scores suggest that the touches were more accurate than they were inaccurate. Although we do not have other works to compare this score to, a trend in the accurate direction is a good place to start and compare future works against.
Quantitatively, we do have a reference point for comparison. By comparing our arm logged and labeled contact-to-target touch distances with results of two-point discrimination tests, we can assess if our system can touch accurately enough with respect to human's perception. The mean numerical accuracy the coders found (4.9 cm), when compared to the “touch resolution” of the human arm, is similar. In three studies on young adults' abilities to discern if one or two points were being probed on their arms, the minimum probe distances that could be discriminated between ranged from 2.2 cm to 4.5 cm depending on the parts of their arms probed [36], [37], [38]. This means that some of our touches happened within less than the perceptible range. It should be noted that the hand of the android is not a single point, so the labelers had to assume the perspective of the participant to select a contact point.
The automatic logged accuracy was larger at 10.2 cm on average. The discrepancy between these two measurement methods is likely due to the logged accuracy not being able to account for the actual contact point, only the position of the hand with respect to the target location. For example, if the android touched the participant with its fingertips directly on the target, the logged accuracy would be the distance from the knuckles at the center of the hand to the target. This would lead to the log recording approximately 10-12 cm of error but a coder who is assuming the role of the participant would record 0-1 cm of error. Since we primarily are concerned with the subjective evaluation of the accuracy, these logged accuracies should be considered secondary to the labeled ones.
The answer to RQ1 is then that our accuracy, while slightly less than the perception limits of humans, is perceived as more accurate than inaccurate. We also found that our system can direct touches to the correct body part with a low failure rate. There is room for improvement to increase participant scores and to make measurements more accurate without relying on human coders, but with this system evaluated, we now have a baseline to compare future iterations of touch control systems against.
B. RQ2: Naturalness
The second goal of our study was to ensure that the functionality we developed would not compromise the naturalness of the perceived touch. If our adaptive system was less natural, then it may not be considered viable for scientific studies moving forward over more traditional, preprogrammed touch behaviors. The most similar study to ours that measured naturalness was [21]. Using a female version of the robot we used, ERICA [26], the researchers studied how different touch behaviors and locations impacted feelings of intimacy. The study had a gender balanced group of 22 participants who sat beside the android with their hands on a mark on a table in front of them. The android then performed various preprogrammed touch behaviors on the participants.
One of the touch types that study performed was the same type that we did, a simple contact and retraction described as a “touch” type as opposed to “pat,” “grip,” or “stroke.” The participants in that study rated the Naturalness of those touches as 2.96 for the hand and 3.00 for the forearm on a scale of 1-7. Our average Naturalness score was scored 3.86, also out of a 1-7 scale. There are two factors that might have confounded this result. The first is participant selection and the second is the touch location variation between the studies.
Although our participants had the potential for bias in favor of our system, our analysis of the choice to use lab members, and thus those likely to be very experienced with robots, suggested a slightly negative correlation between Robot Experience and both Accuracy and Naturalness. This suggests that they gave slightly harsher feedback than an unbiased population might. Despite these negative correlations, the average scores given by those with high Robot Experience of 5, 6, and 7 (40% of participants) were less than 0.7 points lower than the overall responses for both subjective categories.
The other consideration, touch location, may confuse the results since our study has one score for many different touch locations and the reference study was limited to hand and forearm while also separating the Naturalness scores for the two. That being said, since the touches we performed that the past study did not are considered more intimate locations to be touched, please again see [11], and the Naturalness scores split by location are nearly identical, these factors are unlikely to have inflated our score.
Therefore, by considering the major confounding factors, of which there are likely others not accounted for, it is unlikely that our system is more natural than past studies. However, it is likely to be at least as natural, demonstrating that our adaptive system is suitable for future scientific studies.
C. Limitations
As also described in [39], OpenPose's robust ability to detect people allows it to detect them even when obscured. This prevented us from using our original control method that continuously updated the target since near the target, the depth of the target would appear to be the back of the android's hand, prompting it to pull away and oscillate. For the final experiment, we stopped updating the target at the time of selection to increase the comfort for participants.
Another set of limitations came from the touch sensors. High sensitivity caused occasional early touch detections. To compensate for this, touch thresholds were set high, sometimes causing uncomfortable touch pressure and removing the ability to know the initially contacted sensor pad. Development of the sensor will improve the comfort and ease of future experiments and enable higher quality measurements.
Finally, both our use of robot and participants were not optimally gender balanced. Since we only used a male appearing android, we do not yet know how the system will be received by participants if it looked female. Additionally, although our participants' sexes were balanced enough for this system evaluation, it would be desirable to have a more even distribution to illuminate subtler effects.
D. Future of This System
By considering the feedback given directly from participants, and our accuracy results, we will be able to improve this system for future studies. Such studies could include changing the android's gender to understand how that affects impressions of the robot and how the intimacy of a touched location, like the face compared to wrist, might explain the perceived relationship to it or its operator. These studies would help understand if a robot-to-human version of [11] or [14] would also show a preference for touch two and from females that are close relationship-wise. There are also questions about direct comparisons between touch from androids and other humanoid robots, mutual touch, or even robots touching other robots. Exploring these topics with unbiased and balanced participants with standard surveys will make the results from our future studies more broadly insightful and comparable for the HRI field at large.
Conclusion
We identified the lack of touch HRI studies that use an adaptive system for studying robot-to-human touch with a humanoid android and the lack of research on face touch. To demonstrate the feasibility of such an adaptive system, and to inspire others to create similar systems, we developed and evaluated a system that allowed a human to control an android to reach out and touch them along their left arms and faces. The robot was able to do this accurately and safely by observing humans' real positions using a color and depth sensing camera and capacitive touch sensors within its hand. Our goal of creating a system that could touch multiple parts of a person was achieved as demonstrated by coders finding 96.2% of touches landing at least partly where participants requested and the mean accuracy of those touches only being 4 mm larger than the upper range of human subjective touch perception. Our second goal was also achieved since we were able to show that adding adaptive touch does not reduce perceptions of naturalness and may even increase them. With this system developed and demonstrated to be satisfactory, we hope to have expanded expectations of what is capable in touch HRI so that we all can more quickly reach truly human levels of communication with robots.
ACKNOWLEDGMENT
The authors would like to thank the participants and the label coders for making this research possible with their precious time and feedback. We would also like to thank Dr. Dario Alfonso Cuello Mejia for his invaluable advice and support during the writing process and Sayuri Yamauchi for her help during the execution of our experiments.