Monday, September 1, 2008

"Specifying Gestures by Example"

by Dean Rubine

Summary

The author first listed some examples of existing gesture-based applications, however, in these applications, the gesture recognizer is hand coded, which is quite complicated job. But GRANDMA (Gesture Recognizers Automated in a Novel Direct Manipulation Architecture) enables an implementor to create gestural interfaces with simple click-and drag. GDP is such an application built by using GRANDMA.

Then GDP is showed as an example gesture-based application. The application recognizes gestures that create rectangle, ellipse, line, pack and perform copy, rotate-scale, delete. The end of gesture is perceived as either release of mouse button or cease of moving mouse for a given amout of time (latter method is used in GDP). Gestureing and direct-manipulation are combines in a 2-phase interaction technique: (1) gesture collection (2) gesture classification & manipulation. And due to the limitation of GRANDMA, gestures in GDP are all single strockes to avoid segmentatic problem and to allow shorter timeouts.

GRANDMA is a MVC system which associates input event handler with a vew class. In GDP, GdpTopView is the window in which GDP runs, GraphicObjectView is associated with line, rectangle, ellipse, text gesture handlers. Gestures are added through the following steps: (1) create new gesture handler ("new class" button) and associate with GraphicObjectView. (2) enter gesture training examples (15 examples per gesture is adequate). (3) edit semantics of each gesture in handler's set -- "recog" is evaluated when gesture is recognized, "manip" is evaluated on subsequent mouse points, "done" is evaluated when mouse button is released. Note that attributees of the gesture may be used in the gesture semantics (e.g. start and end points of line).

Next, the author discusses how gestures are classified. Preprocessing is done to eliminate jiggle: an input point within 3 pixels of the previous input point is discarded. Geometrically and algebraically features of are carefully chosen: (1) incrementally computable in constant time per input point (2) meaningful so that it can be used in gesture semantics as well as for recognition (3) enough features to provide differentiation between all gestures that might reasonably be expected (in GDP, 13 features are calculated). Gesture classification is done by calculating the linear evaluation function for each class "Vc" over the features of an input gesture, the input gesture is class "c" that maximizes "Vc". The weights in each class's linear evaluation function are determined by the training examples of that class. Thereafter, the probability that a gesture is classified correctly and the deviation (Mahalanobis distance) that a gesture is away from the mean of its chosen class is computed to reject ambiguous gestures and outliers, however, this should be disenabled if the application provides "undo". Overall, the gesture classification method has high rate of correct rate and fast computation time provided with around 15 training examples per gesture.

In the end, some extensions are introduced: versions of GDP use eager recognition (as soon as enought of it has been seen to do so unambiguously) and multi-finger gesture recognition (by comining classification of individual paths with a decision tree and a set of global features).


Discussion

GRANDMA definitely simplifies the work involved in building gesture-based applications. MVC architecture enables straightforward mapping between gesture handlers and views, though I don't see the use of "done" in gesture semantics since gestures end when users stop moving the mouse for a while.

Training by examples is a great idea, however, the author didn't give much information on how the set of training examples should be consisted of. For example, if all the input examples are all gathered from 5-year-old kids, the features of a class may be biased and cannot be used for efficient classification of input from common users.

I still need time to figure out the meaning of covariance matrix. But at the same time, I come up with another idea of gesture classfication which I haven't tested how well it works: the mean feature values of training examples for a given class can be computed (and may be normalized by, say, dividing by their standard deviation), and a vector of N mean feature values (say N features are chosen) can be denoted as a point in N-dimensional feature space, M classes can be represented as M points in N-dimentional space, each point for one class. And then, a given input gesture can also be denoted as a point in N-dimensional space based on its feature values, the distances Di (1 < i < M) from the point that represents the input gesture to the M points that represent M classes can then be computed, and the input gesture is class "i" that minimizes Di.

No comments: