Wednesday, October 15, 2008

"Distinguishing Text from Graphics in On-line Handwritten Ink"

by Christopher M. Bishop, Markus Svensen, Geoffrey E. Hinton

Summary

This paper introduces an approach to classify strokes into text or graphics (non-text). It tries to combine stroke-based feature with temporal information in computing probabilities.

For each stroke, 11 features are extracted, then a total least squares (TLS) model is fitted to the stroke. The stroke is divided into fragments at points with local mixima curvature and TLS is applied again to the largest resulting fragment. Features are: 1) stroke arc length, 2) total absolute curvature, 3) main direction given by TLS, 4) length-width ratio of TLS fit, 5) total number of fragments found in the stroke, 6) arc length of the largest fragment, 7) total absolute curvature of the largest fragment, 8) main direction of the largest fragment, 9) length of long side of bounding rectangle of the largest fragment. Features 6-9 are based on the assumption that indeed large largest fragment (may include the entire stroke) with high length-to-width TLS ratio indicates a graphics stroke. Multilayer perceptron (MLP) model is trained with these features. Probability p(tn|xn) is calculated. Since the distribution is biased towards text, adjusting objective function used for fitting the model can solve the problem.

Next, the temporal context is used to achieve better performance. The identity of successive strokes will tend to be correlated (text stroke followed by text stroke, graphics stroke followed by graphics stroke.) p(t1=1) = 0.5467, and a Hidden Markov Model (uni-partite HMM) of p(tn|tn-1) is built from the training data. Then p(tn|xn) of the discriminative model and p(tn|tn-1) are combined.

Finally, gap information is also incorporated (characteristics of gap between two text strokes and gap between a text stroke are different.) Features are: 1) logarithm of the difference of the pen-down times for the surrounding strokes, 2) x- and y- differences of the pen-down location for the surrounding strokes, and 3) x- and y- differences of the pen-up location of the preceding stroke and the pen-down location of the following stroke. Thus, labels (tn = tn+1) and (tn != tn+1) can be assigned (bi-partite HMM).

Accuracy on Cambridge data was 91.49%, and accuracy on Redmond data was 96.71% (22.64% graphics strokes are classified as text strokes).


Discussion

This is a text vs. shape paper. It tries to classify strokes into text or graphics based on features of the strokes as well as on the context. The result shows that the rate of misclassifying graphics strokes into text strokes is relatively high. The HMM may not be helpful in sketches with frequent text-graphics switches.

1 comment:

Daniel said...

Yeah, though they even adjusted for the bias of text over graphics, it still seems that a few text recognition features need to be brought to this stage instead of just trying to guess the user's intent from timing and gaps.