Flexible Gesture Recognition for Immersive Virtual Environments

virtual environments are not yet as common as one should think, based on the ..... topological maps by looking at people: An example of cooperation between ...
212KB Größe 3 Downloads 347 Ansichten
Flexible Gesture Recognition for Immersive Virtual Environments Matthias Deller, Achim Ebert, Michael Bender, and Hans Hagen German Research Center for Artificial Intelligence, Kaiserslautern, Germany

Abstract With powerful graphics hardware becoming affordable for everyone, there is an increasing tendency towards a new generation of user interfaces, with the focus shifting from traditional two-dimensional desktops to three-dimensional virtual environments. Therefore, there is a growing need for applicable immersive interaction metaphors to manipulate these environments. In our paper we propose a gesture recognition engine using an inexpensive data glove with integrated 6 DOF tracking. Despite of noisy input data from the glove, we are able to achieve reliable and flexible gesture recognition. New gestures can be trained easily and existing gestures can be individually adapted for different users.

1

Introduction

In recent years, a new direction is showing in the development of computer applications. With powerful graphics hardware becoming available to everyone at reasonable prices, there is an increasing tendency to enhance traditional desktops and applications by making use of all three dimensions, thereby replacing the common desktop metaphor with a virtual environment. One of the main advantages of these virtual environments is described with the term immersion, meaning the lowering of barriers between human and computer. The user gets the impression of being part of the virtual scene and, ideally, is able to manipulate it as he would his real surroundings, without devoting conscious attention to the usage of an interface. One reason why

Proceedings of the Information Visualization (IV’06) 0-7695-2602-0/06 $20.00 © 2006

IEEE

virtual environments are not yet as common as one should think, based on the advantages they present, might be the lack of adequate interfaces to interact with immersive environments, as well as methods and paradigms to intuitively manipulate three-dimensional settings. More recently, there are also some devices especially designed for controlling three-dimensional environments, but for the most part they are not very intuitive and in most cases still demand additional steering [1]. The most natural way for humans to manipulate their surroundings is, of course, by simply using their hands. Hands are used to grab and move objects, or manipulate them in other ways. They are used to point at, indicate or mark objects of interest. Finally, hands can be used to communicate with others and to state intentions by making postures or gestures. In most cases, this is done without having to consciously think about it, and so without interrupting other tasks the person may be involved with at the same time. Therefore, the most promising approach to minimize the cognitive load required for learning and using a user interface in a virtual environment is to employ a gesture recognition engine that lets the user interact with the application in a natural way by just utilizing his hands in ways he is already used to.

2

Related Work

At the moment, research on gesture recognition is mainly focused on visual capturing and interpretation of gestures. Either the user or his hands are captured by cameras so their position or the posture of the hands can be determined with appropriate methods. To achieve this goal, there are several different strategies. In some cases, these techniques are non-invasive, so the user is not

required to wear any special equipment or clothing. Of course, this makes it hard to determine what part of the picture belongs to the background and which parts are the hands. Some approaches aim to solve this segmentation problem by imposing requirements on the user’s surroundings, such as requiring a special uniform background to distinguish the user's hand from it [2][3]. Others don't need a specially prepared, but static background [4]. Another possibility is the use of the infrared spectrum to allow a better distinguishability of the hand and its surroundings [5]. Newer approaches use a combination of these methods to enhance the segmentation process and find the user's fingers in front of varying backgrounds [6]. Still other authors aim to simplify the segmentation process by introducing restrictions, often by requiring the user to wear marked gloves [7][8], or by restricting the capturing process to a single, accordingly prepared setting [9]. Although promising, all of these approaches have the common drawback that they pose special needs to the surrounding in which they are used. They require uniform, steady lighting conditions, high contrast in the captured pictures and have difficulties when the user's motions are so fast that his hands appear blurred on the captured frames. Apart from that, these procedures demand a lot of computing power, as well as special and often costly hardware. In addition, the cameras for capturing the user have to be firmly installed and adjusted, so these devices are bound to one place and the user has to stay within a predefined area to allow reliable gesture recognition. Often, a separate room has to be used to enable the recognition of the user’s gestures. Another possibility to capture gestures is by the use of special interface devices called data gloves [10][11]. The handicap of professional data gloves, however, is the fact that they are not per se equipped with positioning sensors. This limits the range of detectable gestures to static postures, unless further hardware is applied. The user has to wear additional gear to enable determination of position and orientation of his hand, often with electromagnetic tracking devices like the Ascension Flock of Birds [12]. These devices allow a relatively exact determination of the hands position as well as its orientation, if fixated to an appropriate location. The problem with using electromagnetic tracking, however, is the circumstance that they require the user to wear at least

Proceedings of the Information Visualization (IV’06) 0-7695-2602-0/06 $20.00 © 2006

IEEE

one extra sensor attached to the system by cable. Additionally, electromagnetic tracking devices have to be firmly installed and calibrated, and they are very prone to errors if there are metallic objects in the vicinity of the tracking system. So, although there are several high potential approaches to allow the use of gestures to enhance interaction with (mobile) computers, these possibilities are not yet serviceable for real-time gesture interaction with computers. They demand very specialized and therefore expensive hardware, require the user to wear special clothing or stand in front of a fixed background, and use a lot of computing power to determine the performed gestures. Further, almost all of these techniques are restricted to one especially prepared setting, because the setup has to be installed in and calibrated to a designated surrounding. Thus, these solutions are not feasible to be used in a normal working environment, especially if they are to be integrated in more complex applications to allow real-time interaction on the spot. Consequential, there is a need for a gesture recognition that is both flexible to be adapted to various conditions like alternating users or different hardware, possibly even transportable devices, yet fast and powerful enough to enable a reliable recognition of a variety of gestures without hampering the performance of the actual application. Similar to the introduction of the mouse as an adequate interaction device for graphical user interfaces, gesture recognition interfaces should be easily defined and integrated either for interaction in three-dimensional settings or as a means to interact with the computer in a more natural way, without having to use an abstract interface.

3

Applied hardware

The glove hardware we used to realize and test our gesture recognition engine was a P5 Glove from Essential reality [13]. The P5 is a consumer data glove originally designed as a game controller. It features five bend sensors to track the flexion of the wearer's fingers as well as an infrared-based optical tracking system, allowing computation of the glove's position and orientation without the need for additional hardware. The glove consists of a stationary base station housing the infrared receptors enabling the spatial tracking. The glove itself is

connected to the base station with a cable and consists of a plastic housing that is strapped to the back of the user’s hand, with five bendable strips connected to his fingers to determine the bend of each individual finger. In addition, on top of the housing are four buttons which can be used to provide additional functionality. The obtainment of position and orientation data is achieved with the help of reflectors mounted on prominent positions on the glove housing. Dependent on how many of these reflectors are visible for the base station and on which positions the visible reflectors are registered, the glove's driver is able to calculate the orientation and position of the glove. The tracking is cut short if the alignment of the glove is such that the back of the user's hand is angled away from the receptor and so too many of the reflectors are concealed. Yet, since the P5 is intended to be used sitting in front of a desktop computer, most gestures can be adequately recognized using this hardware.

Figure 1: Our demonstration setup: 2D display, stereoscopic display and data glove. During our work with the P5, we learned that the calculated values for the flexion of the fingers were quite accurate, while the spatial tracking data was much less reliable. The position information was fairly dependable, whereas the orientation values of the glove were, dependent on lighting conditions, sometimes very unstable. Because of this, additional adequate filtering mechanisms had to be applied to ascertain sufficiently reliable values. The low price of about 50 Euros was one reason we chose the P5 for our gesture recognition, because it shows that serviceable interaction hardware for virtual environments can be realized at a cost that makes it an option for the normal consumer market. The other reason for our choice was to show that our recognition engine is powerful and flexible enough to enable reliable gesture

Proceedings of the Information Visualization (IV’06) 0-7695-2602-0/06 $20.00 © 2006

IEEE

recognition even when used with inexpensive gamer hardware.

4

Posture and gesture recognition

A major problem for the recognition of gestures, especially when using visual tracking, is the high amount of computational power required to determine the most likely match to the gesture carried out by the user. Especially when gesture recognition is to be integrated in running applications that at the same time have to render a virtual environment and manipulate this environment according to the recognized gestures, this is a task that cannot be accomplished on a single average consumer PC. We aim to achieve a reliable real-time recognition that is capable of running on any fairly up-to-date workplace PC and can easily be integrated in normal. Like Bimber’s 'fuzzy logic approach' [14], we use a set of gestures that have been learned by performing the gesture to determine the most likely match. However, unlike the aforementioned method, for our system we do not define gestures as motion over a certain period of time, but as a sequence of postures made at specific positions with specific orientations of the user's hand. Thus, the relevant data for each posture is mainly given by the flexions of the individual fingers. However, for some postures the orientation of the hand may be more or less significant. While some gestures mean the same independent of the hand’s orientation, for some gestures the orientation data is much more relevant, for example the meaning of a fist with outstretched thumb can differ significantly whether the thumb points upward or downward. Due to this fact, the postures for our recognition engine are composed of the flexion values of the fingers, the orientation data of the hand and an additional value indicating the relevance of the orientation for the posture. As mentioned before, the required postures are taught to the system by simply performing them, then associating an identifier with the posture. This approach makes it extremely easy to teach the system new postures that may be required for specific applications. Alternately, existing postures can be adapted for specific users. To do so, the posture in question is selected and performed several times by the user. The system

captures the different variations of the posture and determines the resulting averaged posture definition. In this manner, it is possible to create a flexible collection of different postures, termed a posture library, with little expedience of time. This library can be saved and loaded in form of a gesture definition file, making it possible for the same application to have different posture definitions for different users, allowing an on-line change of the user context.

4.1

Recognition process

Our recognition engine is subdivided into two components: the data acquisition and the gesture manager. The data acquisition runs as a separate thread and is constantly checking the received data from the glove for possible matches from the gesture manager. As mentioned before, position and especially orientation data received from the P5 can be very noisy, so they have to be appropriately filtered and smoothed out to enable a sufficiently reliable matching to the known postures. First, the tracking data is piped through a deadband filter to reduce the chance of jumping error values in the tracked data. Also, Alterations in the position or orientation data that exceed a given limit are discarded as improbable and replaced with their previous values. The resulting data is then straightened out by a dynamically adjusting average filter. The resulting data is reasonably correct enough to provide a good basis for the matching process of the gesture manager. If the gesture manager finds a likely match to the provided data in his posture library, this posture is marked as a candidate. To lower the possibility of misrecognition and false positives, a posture is only accredited as recognized when held for an adjustable minimum time span. During our tests it showed that values between 300 and 600 milliseconds are suitable to allow a reliable recognition without forcing the user to hold the posture for too long. Once a posture is recognized, a PostureChanged-event is sent to the application that started the acquisition thread. To enable the application to use the recognized posture for further processing, the identifier of the posture as well as the identifier of the previous posture is provided to facilitate the sequencing of postures to a more complex gesture.

Proceedings of the Information Visualization (IV’06) 0-7695-2602-0/06 $20.00 © 2006

IEEE

Furthermore, the position and orientation of the glove is provided. The acquisition thread also keeps track of the glove's movement. If the changes in the position or orientation data of the glove exceed an adjustable threshold, a GloveMove-event is fired. This event is similar to common MouseMove-events, providing both the start and end values of the position and orientation data of the movement. Finally, to take into account hardware that possesses additional buttons, like the P5, the data acquisition thread also monitors the state of these buttons and generates corresponding ButtonPressed- and ButtonReleased-events, providing the designated number of the button. It is important to note that although the data acquisition we implemented was fitted to the Essential Reality P5, it can easily be adapted to be suitable for any other data glove, either for mere posture recognition or in combination with any additional 6 Degrees Of Freedom tracking device like the Ascension Flock of Birds [12] to achieve full gestural interaction. To test this, we adapted our gesture recognition to a professional dataglove from Fifth Dimension Technologies [15], although without any tracking, so only static postures were supported. Nevertheless, the recognition of these postures was fast and, because of the more sophisticated sensors of the 5DT product, very reliable.

4.2

The Gesture Manager

The gesture manager is the principal part of the recognition engine, maintaining the list of known postures as well as providing multiple functions to manage the posture library. As soon as the first posture is added to the library or an existing library is loaded, the gesture manager begins matching the data received from the data acquisition thread to the stored datasets. This is done by first looking for the best matching finger constellation. In this first step, the bend values of the fingers are interpreted as five-dimensional vectors and for each posture definition the distance to the current data is calculated. If this distance fails to be within an adjustable minimum recognition distance, the posture is discarded. If a posture matches the data to a relevant degree, the orientation data is compared in a likewise manner to the current values. Depending on whether this distance exceeds another

adjustable limit, the likelihood of a match is lowered or raised according to the orientation quota associated with the corresponding posture dataset. Also, the gesture manager provides several means to modulate parameters on run time. The recognition sensitivity can be changed, new postures can be added, existing ones adapted, or new posture libraries can be loaded.

4.3

Recognition of gestures

As mentioned before, we see gestures as a sequence of successive postures. With the help of the PostureChanged-events, our recognition engine provides an extremely flexible way to track gestures performed by the user. The recognition of single postures like letters of the American Sign Language is as easily possible as the recognition of more complex, dynamic gestures. This can be done by tracking the sequence of performed postures. For example, let's consider the detection of a "click" gesture. Tests with different users indicated that an intuitive gesture for this task is pointing at the object and then “tapping” at it with the index finger. To accomplish the detection of this gesture, one would define a pointing posture with outstretched index finger and thumb and the other fingers flexed, then a tapping posture with half-bent index finger. All there remains to do in the application is to check for two sequent PostureChanged-events indicating a change from the pointing to the tapping posture, then back to pointing. In this manner, almost any desired gesture can quickly be implemented and recognized.

5

Implementation and results

We have evaluated our gesture recognition engine by enhancing a demo application representing a virtual document space with gesture interaction. In the implemented virtual environment, the user can manipulate various objects representing documents and trigger specific actions by performing a corresponding gesture. In order to enhance the degree of immersion, we used a particular demonstration setup as shown in Figure 1. To allow the user a stereoscopic view of the scene, we used a special 3D display device, the SeeReal C-I [16]. To compensate for the loss in resolution on the stereoscopic monitor we used an additional TFT display to also show a

Proceedings of the Information Visualization (IV’06) 0-7695-2602-0/06 $20.00 © 2006

IEEE

higher resolution view of the scene. A testimony for the speed of our recognition engine is the fact that we were able to realize the application logic including the rendering of three different perspectives (one for each eye, another one for the non-stereoscopic display), and the tracking and recognition of gestures on a normal consumer grade computer in real-time. Our demo scene shown in Figure 3 consists of a virtual desk, on which different documents are arranged randomly. In the background of the scene, a wall containing a pin board and a calendar can be seen. Additionally, the user’s hand is represented by a hand avatar, showing its location in the scene as well as the hand’s orientation and the flexion of the fingers.

Figure 3: Our demonstration application: A virtual desktop with several gestural interaction possibilities. The user was given multiple means to interact with this environment. First, he could rearrange the documents on the table by simply moving his hand avatar over a chosen document, then grabbing it by making a fist. He could then move the selected document around and drop it in the desired location by opening his fist, releasing his grip on the document. Another possibility was to have a closer look at the calendar or the pin board by moving his hand in front of the object and point at it. Additionally, there were several possibilities to interact with specific documents. To select a document, the user had to move his hand over it, and then tap on it in the way described earlier. Once a document was selected, it moved to the front of the scene, allowing a closer look at the cover page. The user then had the choice between putting the document back in its location by performing a dropping gesture, closing then opening his hand, or he could open the document. To open it, he had to “grab” it in the same way

(by making a fist), then turn his hand around and open it, spreading his fingers with his palm facing upward. Next, the user was able to browse through the document by making a pointing posture and tilting his hand to the right or left to browse forward or backward. We had several users test the demonstrational environment, moving documents and browsing through them. Apart from initial difficulties due to the unfamiliarity with the glove hardware, after a short while most users were able to use the different gestures in a natural way, with only few adaptations of the posture definitions to the individual users.

6

Future work

One of our next endeavours will be to integrate artificial intelligence methods to allow automatic adaptation of the generic posture libraries to individual users, allowing a smoother recognition of their gestures while interacting with the demonstration environment. Furthermore, we plan to verify our gesture recognition engine with different types of hardware, for instance using a professional data glove with additional finger sensors in combination with position and orientation data acquired from an electromagnetic tracking system. Concerning the system itself, we plan to add the possibility of using individual recognition boundaries for each posture definition as well as an automatic adjustment of these boundaries during the training of the posture, dependent on the accuracy when repeating the posture.

7

Conclusions

In this paper we presented our prototype of a flexible and powerful gesture recognition engine, allowing gesture interaction for a variety of possible hardware devices and combinations thereof. Gestures can rapidly and easily be defined as a sequence of succeeding postures. These postures are trained to the system by simply performing them wearing the designated glove hardware. Our engine can easily be integrated in any desired application and is capable of providing a fast and reliable gesture recognition interface on standard consumer computers with the possibility of on-line change of user contexts and gesture collections.

Proceedings of the Information Visualization (IV’06) 0-7695-2602-0/06 $20.00 © 2006

IEEE

References [1] CIGER J., GUTIERREZ M., VEXO F., THALMANN D.: The Magic Wand, Proceedings of the 19th Spring Conference on Computer Graphics, 2003. [2] QUEK F., MYSLIWIEC T., ZHAO M.: Finger mouse: A freehand pointing interface, Proceedings of the International Conference on Automatic Face and Gesture Recognition, Zürich, 1995. [3] LIEN C., HUANG C.: Model-Based Articulated Hand Motion Tracking For Gesture Recognition, Image and Vision Computing, vol. 16, February 1998. [4] APPENZELLER G., LEE J., HASHIMOTO H.: Building topological maps by looking at people: An example of cooperation between intelligent spaces and robots, Proceedings of the IEEE-RSJ International Conference on Intelligent Robots and Systems, 1997. [5] REHG J., KANADE T.: Digiteyes: Vision-based human hand tracking, Technical Report CMU-CS-93-220, School of Computer Science, Carnegie Mellon University, 1993. [6] VON HARDENBERG C., BÉRARD F.: Bare-Hand HumanComputer Interaction, Proceedings of the ACM Workshop on Perceptive User Interfaces, Orlando, 2001. [7] STARNER T., WEAVER J, PENTLAND A.: A wearable computer based American Sign Language Recognizer, Proceedings of the IEEE Transactions on Pattern Analysis and Machine Intelligence, 1998. [8] HIENZ H., GROEBEL K., OFFNER G.: Real-time hand-arm motion analysis using a single video camera, Proceedings of the International Conference on Automatic Face and Gesture Recognition, Killington, 1996. [9] CROWLEY J., BÉRARD F., COUTAZ J.: Finger tracking as an input device for augmented reality, Proceedings of the International Conference on Automatic Face and Gesture Recognition, Zürich, 1995. [10] TAKAHASHI T., KISHINO F.: Hand gesture coding based on experiments using a hand gesture interface device, ACM SIGCHI Bulletin, April 1991. [11] HUANG T.S., PAVLOVIC V.I.: Hand Gesture Modeling, Analysis, and Synthesis, Proceedings of the International Conference on Automatic Face and Gesture Recognition, Zürich, 1995. [12] ASCENSION PRODUCTS – FLOCK OF BIRDS, URL: http://www.ascension tech.com/products/flockofbirds.php [13] THE P5 GLOVE HOMEPAGE, URL: http://www.videogamealliance.com/VGA/video_game/P5.php [14] BIMBER O.: Continuous 6DOF Gesture Recognition: A Fuzzy-Logic Approach, Proceedings of 7-th International Conference in Central Europe on Computer Graphics, Visualization and Interactive Digital Media (WSCG’99), 1999. [15] FIFTH DIMENSION TECHNOLOGIES HOMEPAGE, URL: http://www. 5dt.com/index.html [16] SEEREAL TECHNOLOGIES HOMEPAGE, URL: http://www.seereal.de