SKETCH RECOGNITION 2008: October 2008

Tuesday, October 28, 2008

"Grouping Text Lines in Freeform Handwritten Notes"

by Ming Ye, Herry Sutanto, Sashi Raghupathy, Chengyang Li and Michael Shilman

Summary

This paper presents an optimization method for text line grouping.

Line grouping problem: find the partitioning of the stroke set to optimize the likelihood of the resulting lines and the consistency and simplicity of the their configuration.

Three features to measure the likelihood of a line: the linear regression error (eLR), maximum inter-stroke distances (Dxmax(l) & Dymax(l)) projected onto the fitting line and its orthogonal direction respectively.

Configuration consistency: use neighborhood graph - a pair of lines are neighbors if the minimum distance between their fitting line segments falls below a threshold and there are no other objects (lines or drawings) lying between them. Compute each line’s configuration consistency θ(l) as the neighbor-length-weighted sum of the orientation angle difference. Use the weights to encourage lines to be more consistent with longer (usually more reliable) neighbors.

Model complexity: measured by number of lines (π).

Cost function: weighted sum of eLR, Dxmax(l), Dymax(l), θ(l), π. The optimal line grouping is the stroke partitioning that minimizes this cost.

Gradient-descent optimization: Starting from an initial solution from temporal grouping, iteratively generate local alternative hypotheses and accept the one leading to the largest global cost decrease. Temporal grouping: partitioning based on the fact that most text lines are composed of temporally adjacent strokes. Two types of alternative hypotheses: merging a pair of line segments and correcting high-configuration-energy errors, corresponding to the two major error categories of the temporal grouping results. Merging hypothesis is generated for each neighboring pair. High-configuration-energy errors are caused by temporally adjacent strokes belonging to different lines. Any line segment l whose maximum angle difference from its neighbors exceeds a threshold is an initial candidate. Iteratively update the line configuration and the hypothesis queue, take the hypothesis that incurs the most global cost reduction in each step, until the queue is empty. Incremental mode used in online sketch yields better results than batch mode (throw in full page of sketch).

Accuracy is measured by the recall metric, which is defined as the number of correct lines divided by the number of labeled lines in each page. The average recall rates for PerfectWD and CrudeWD are 0.93 and 0.87 respectively.

Discussion

The features used in grouping (calculating global cost) is useful. The iterative gradient-descent algorithm seems to be working quite well. But I'm wondering if it works equally well on sketches with text and shape mixed. For example, people may write text on a line, or the table wasn't drawn big enough that they have to write through the edges of the table (text cut by shpe).

Monday, October 27, 2008

"PerceptuallySupported Image Editing of Text and Graphics"

by Eric Saund, David Fleet, Daniel Larner, James Mahoney

Summary

This paper presents ScanScribe, an image editing program that support easy selection and manipulation of materials in images of sketches.

Image materials can be selected by rectangle, lasso, polygon, and clicking on object. Clicking on object is available any time image objects have become established within the program. Whether a rectangle or lasso is used is based on whether the selection path encloses. Polygon selection is activated by mouse double click.

When image objects are selected, new bitmap layer is created with the object as foreground and a transparent background. ScanScribe uses a flat lattice for primitive object grouping, so that an object may belong to any number of groups (e.g. both a row group and a column group in a table). Click on object allows switching between groups with all the other objects in the same group selected. Groups can be manually created, or automatically formed after operations on a set of objects. And an object is expelled from a group any time it's moved far from that group's spatial locus. A set of objects can also be merged into one object.

Foreground/background separation is done through an iterative technique (high-pass filtering and distance measure on pixels' HSB values) to estimate and interpolate the color distribution of light background across the scene.

Handwritten text is replaced by typing in the text entry when the bitmap region containing handwriten text is selected. And typed text is rendered as bitmaps.

Image recognition occurs in two stages. First, the original undifferentiated bitmap is segmented into primitive image objects by an automatic mechanism. Second, sensible groupings of these primitives are automatically formed and established as groups, or composite objects, in the lattice grouping structure. Stroke primitives are grouped according to rules following the principles of curvilinear smoothness and closure. Blob primitives (text) are grouped into composite objects reflecting words and lines of text based on spatial proximity and curvilinear alignment. Structure recognition is invoked in ScanScribe through an Edit menu item, and can be applied to only a selected region of the image.

Discussion

ScanScribe is an image editing tool that supports offline sketch recognition to some extent, basically group and structure, though the text "recognition" is done by manually typing in text entry. The lattice used for grouping is a good idea as it supports multiple group labels for each object. It's a promising application if it can incorporate more sketch recognition techniques.

Tuesday, October 21, 2008

"Template-based Online Character Recognition"

by Scott D. Connell, Anil K. Jain

Summary

To perform classificatio of handwritten characters, a recognition system must attempt to leverage between class variations, while accommodating potentially large within-class variations -- composed of two parts: between and within writing styles. Their goal is to automatically learn these writing styles so that adequate character models can be built.

Background discusses five categories of online handwriting recognition techniques: (1) primitive decomposition, (2) motor models, (3) elastic matching, (4) stochastic models, (5) neural networks.

Preprocessing uses equi-distant resampling followed by a Gaussian filter. Locations of endpoints and high curvature points are preserved. And characters are normalized to the same height.

A stroke is represented as a string of on average 62 events. Distance between two strings involves computing the distance between the corresponding pair of events. This requires events first be aligned. In this paper, each event is represented as: x, y with respect to the first point of the first stroke of the digit, and the angle of curvature (theta) at the sample point. Distance between two events is a weighed sum of their corresponding x, y, theta distances. Distance between characters is the sum of distances between aligned events. Neighboring strokes in multi-stroke characters are connected, so that they can be represented as a single string. Restrictions are made to skip points to prevent 1-N or N-1 alignment of events, when a point is skiped, penalty (spurious & missing penalties) is added to the distance. The distance is calculated by finding the minimal distance of all possible alignments of events in two strings. A stroke count difference penalty is added to the final distance.

Two methods - cluster centers and training set editing - are used to selecting representative prototypes. Cluster centers: cluster data and retain the prototype closest to each center (can be seen as finding representative examples of writing styles.) Training set editing: selecting only those templates that fall along class boundaries.

Next, the paper introduces nearest neighbor and decision tree as methods of classification. In constructing a decision tree, "difference feature" is used (asking question "Is character c more similar to reference i, or reference j?").

Accuracy was 86.9% for a 36-class set of alphanumeric characters.

Discussion

To me, it seems "different writing styles" within one class is just the same as different classes. The ability to distinguish between writing styles (and serves as different classes?) comes from the fact the training examples are automatically clustered. I know there is unsupervised clustering algorithms where the number of clusters K is automatically decided based on the data set.

"Sketch Recognition for Computer-Aided Design"

by Christopher F. Herot

Summary

This paper talks about free-hand sketching where inferences can be drawn from user actions.

In the first step, raw data is converted into a list of lines and curves by utilizing speed information to make interpretation of the user's intent. Corners are the intersection of two lines. Curves are fit to B-spline, smoothness is a function of speed. Then the output of line-curve finder is passed to a program the processes overtracing, adding a thickness parameter. The next level of interpretation is performed by the latching program which eliminates excess points. The latching radius is determined by speed, line length and density around each point.

Drawing style is important, suitable parameters can be determined for describing drawing styles. The sequence of drawing is also important. Adjustment of parameters can be done through editing operations (user repeatedly separates points joined by latcher, then the program can reduce the radius.)

Discussion

As a paper in 1970s, it is quite advanced. The adjustment of latching radius based on speed, length, density and editing operations is interesting.

Thursday, October 16, 2008

"Ink Features for Diagram Recognition"

by Rachel Patel, Beryl Plimmer, John Grundy, Ross Ihaka

Summary

Previous works use top-down, bottom-up, and combination approaches to do sketch recognition. But few systems support mixed text/shape sketch. This paper uses formal statistical analysis to identify key ink features and tries to tell text and shapes apart.

46 features are selected and grouped into 7 categories: size, time, intersections, curvature, pressure, OS recognition values, and inter-stroke gaps. 1519 strokes are collected for statistical analysis to determine whether each feature is significant in distinguishing between shapes and text. These features can then be arranged into a decision tree, with the most significant features at the root, leaves of the tree are either TEXT or SHAPE. rpart function is applied to the training data to find the optimal position for a split to made that yields minimal number of misclassification. The features that most accurately split the data into text and shape is significant.

This divider outperforms InkKit and Microsoft divider in overall accuracy, though its misclassification rate of text on new diagram set is a little higher than that of InkKit (music nodes were a confounding factor).

Significant features are:
1. time till next stroke (inter-stroke gaps)
2. speed till next stroke (inter-stroke gaps)
3. distance from last stroke (inter-stroke gaps)
4. distance to next stroke (inter-stroke gaps)
5. bounding box width (size)
6. perimeter to area (size)
7. amount of ink inside (size)
8. total angle (curvature)

Discussion

The idea of using rpart to select significant features and organize them into a decision tree is great. The misclassification rate on shapes is twice as high as that on text, meaning that many shapes are classified as text, and this is due mainly to such strokes as music nodes.

One thing is that for English and some other Latin character sets, capitalized characters and non-capitalized characters may be distinct on some features. For non-capitalized characters, curvature / bounding box width might be a useful feature (don't know if this correlates with curvature + bounding box width), as text tend to have high curvature/bounding box width. However, this feature may to be able to distinguish between shapes and most capitalized characters, such as 'H', 'F', 'X', etc, especially on stroke basis.

Wednesday, October 15, 2008

"Distinguishing Text from Graphics in On-line Handwritten Ink"

by Christopher M. Bishop, Markus Svensen, Geoffrey E. Hinton

Summary

This paper introduces an approach to classify strokes into text or graphics (non-text). It tries to combine stroke-based feature with temporal information in computing probabilities.

For each stroke, 11 features are extracted, then a total least squares (TLS) model is fitted to the stroke. The stroke is divided into fragments at points with local mixima curvature and TLS is applied again to the largest resulting fragment. Features are: 1) stroke arc length, 2) total absolute curvature, 3) main direction given by TLS, 4) length-width ratio of TLS fit, 5) total number of fragments found in the stroke, 6) arc length of the largest fragment, 7) total absolute curvature of the largest fragment, 8) main direction of the largest fragment, 9) length of long side of bounding rectangle of the largest fragment. Features 6-9 are based on the assumption that indeed large largest fragment (may include the entire stroke) with high length-to-width TLS ratio indicates a graphics stroke. Multilayer perceptron (MLP) model is trained with these features. Probability p(tn|xn) is calculated. Since the distribution is biased towards text, adjusting objective function used for fitting the model can solve the problem.

Next, the temporal context is used to achieve better performance. The identity of successive strokes will tend to be correlated (text stroke followed by text stroke, graphics stroke followed by graphics stroke.) p(t1=1) = 0.5467, and a Hidden Markov Model (uni-partite HMM) of p(tn|tn-1) is built from the training data. Then p(tn|xn) of the discriminative model and p(tn|tn-1) are combined.

Finally, gap information is also incorporated (characteristics of gap between two text strokes and gap between a text stroke are different.) Features are: 1) logarithm of the difference of the pen-down times for the surrounding strokes, 2) x- and y- differences of the pen-down location for the surrounding strokes, and 3) x- and y- differences of the pen-up location of the preceding stroke and the pen-down location of the following stroke. Thus, labels (tn = tn+1) and (tn != tn+1) can be assigned (bi-partite HMM).

Accuracy on Cambridge data was 91.49%, and accuracy on Redmond data was 96.71% (22.64% graphics strokes are classified as text strokes).

Discussion

This is a text vs. shape paper. It tries to classify strokes into text or graphics based on features of the strokes as well as on the context. The result shows that the rate of misclassifying graphics strokes into text strokes is relatively high. The HMM may not be helpful in sketches with frequent text-graphics switches.

Tuesday, October 14, 2008

"Renegade Gaming: Practices Surrounding Social Use of the Nintendo DS Handheld Gaming System"

"MathBrush: A Case Study for Pen-based Interactive Mathematics"

by George Labahn, Edward Lank, Mirette Marzouk, Andrea Bunt, Scott MacLean, and David Tausky

Summary

This paper introduces MathBrush to provide a pen-based interface to support mathematical problem-solving tasks, which are currently done on computer algebra systems (CAS). It recognizes and transforms 2-dimensional hand-written mathematical expressions into 1-demensional text string, which can then be passed to the back-end CAS.

MathBrush support large and complex input and output expressions. It incorporates a context sensitive pop-up menu based on a cursory analysis of the mathematical expression and manipulations most likely to be performed (e.g. inverse, rank on matrices). It has an editing-in-place mechanism that allows interaction with CAS output, which enables separate operations on subexpressions. It supports two- or three-dimentional plotting and angles and limits of the plots can be adjusted. Users can select a new CAS to check for alternative answers to their problems.

Evalution methods include thinkalouds and semi-structured interview. Equation entry is a three step process. Users first draw the mathematical equation. They then interact with the equation in an input validation panel (IVP) to correct any character recognition errors. Finally, they render the equation to validate structural analysis. Extraneous ink, esp. dots resulted in errors in parsing. Features like scratch-out and translation(circling and moving terms around) was not used. Correcting recognition that uses a separate input validation panel to correct character recognition followed by a rendering step to present structural analysis caused confusion due to lack of perception in how structural analysis is performed. Mathematical interactions were a strength of the system.

Users tend to modify their writing to adjust to the tolerances of recognition (slowing down, taking care with symbole spacing, etc).

Discussion

This paper focuses on UI introduction, detail in recognition is not mensioned. I'm wondering if ambiguity solving techniques is used in MathBrush, by which I mean that possible interpretations of the expression are sent to CAS and those interpretations that yield errors will not likely be what the user meant to input, so they can be eliminated at this point.

Thursday, October 9, 2008

"Sketch-Based Educational Games: 'Drawing' Kids Away from Traditional Interfaces"

by Brandon Paulson, Brian Eoff, Aaron Wolin, Joshua Johnston, Tracy Hammond

Summary

This paper introduces games developed using LADDER. These games help promote kinesthetic and
tactile learning of child. The games include memory games, simple physics simulation games, games to learn shapes and geography, as well as a system for providing feedback about sentence diagramming.

User tests on graduate students give positive feedback. User tests on children are delayed by IRB constraint.

Discussion

Interesing! I hope to see more complicated games in the future (like real-time strategy game where players can draw their armies and give instructions.)

"Recognizing Free-form Hand-sketched Constraint Network Diagrams by Combining Geometry and Context"

by Tracy Hammond, Barry O’Sullivan

Summary

Constraint programming can be divided very crudely into modeling and solving. Modeling defines the problem: which variables are involved, and what constraints on them. Solving finds values for all the variables that simultaneously satisfy all the constraints. While solving can seek to embedded algorithms, libraries, and programming languages, bottleneck exists in modelling phase. This paper tries to solve this bottleneck by providing a free-hand sketching system.

The system uses geometric-based to allow free-form drawing. It is built using LADDER, and the user interface is generated by GUILD. In a modelling graph of a constraint problem graph, nodes are CSP variables and edges between variables represent constraints. The set of possible values (domain) each variable can take is specified by writing a set adjacent to a node. The constraint between two variables is specified by giving a direction on an edge and a relation adjacent to the edge. Text can be enterred by handwriting, typing, or variable names (t, v, w , x, y, z) handwritten directly onto the diagram. Then the corresponding constraint satisfaction model is automatically generated.

Nodes are recognized as ellipses that contain a variable name within them. An undirected link is recognized geometrically as a line. Directed links are recognized as an arrow that extends between two nodes. Other shapes are variable name, variables: T, V, W, X, Y, Z, and constraints: less than, equal, not equal.

Shapes appear to be similar (less than, V, W, arrow head) are distinguished in context. Two lines in each direction (partners) are added in the system to allow flexible drawing order, for example, arrow head coincide with either end of the shaft will be recognized as an arrow. And to keep recognition response times short, as soon as shapes are recognized, their properties are computed and placed into a hash-table. Thresholds are determined by combining absolute (within 10 pixels), relative (5% of stroke length), and perceptual ("coincididence": closer to the endpoint than to the midpoint) rules to give a more accurate representation of the
error.

Accuracy was 100% but (1) UI for drawing gives feedback if shapes are drawn wrong, (2) testing graphs are drawn by designer of the system.

Discussion

This is an application of LADDER in a specific domain. The benefit of sketch recognition system over paper-based drawing is evident in that the sketch can be automatically converted to the model.

Tuesday, October 7, 2008

"Ambiguous Intentions: a Paper-like Interface for Creative Design"

by Mark D Gross, Ellen Yi-Luen Do

Summary

Designers prefer to use paper and pencil because it supports ambiguity, imprecision, and incremental formalization of ideas as well as rapid exploration of alternatives. Yet computers offer advantages: editing, 3D modeling, and rendering, as well as simulation, critiquing, case bases, and distance collaboration. The Electronic Cocktail Napkin introduced in this paper supports abstraction through end user programming of graphical rewrite rules, ambiguity by carrying alternative interpretations using context to resolve unintended ambiguities, and imprecision with constraint based interactive behavior.

Configurations are high level structures made up of primitive elements and some constraints (four boxes around a larger box recognized as a dining table). The program replaces low-level elements with an abstract configuration. Defining a configuration is a special event, so the Napkin provides a special pattern definition window where it identifies the element types and spatial relations both graphically and as symbolic expressions. Recognizing configurations depends on context.

Alternative interpretations of a drawing element or configuration are carried until the ambiguity can be resolve by finding additional information elsewhere in the drawing or later in time, or by being explicitly resolved by the user. The ambiguity can be resolved later in one of two ways: The user can identify the element by drawing over it. Or, the program may find that the glyph is part of a known configuration, in a position that resolves the ambiguity. Contexts map primitive elements to different interpretations, on the other hand, a single unique element or configuration can often determine context.

Representing imprecision with constraints: for example, after recognizing a diagram as a graph of nodes and arcs, the Napkin establishes connection constraints. The user can drag the nodes and the graph connectivity remains intact. The Napkin identifies spatial relations among drawing
elements, but schemes must determine which relationships are incidental and which are intended. Only context relevant constraints are asserted. But users can also add, delete, and modify constraints. It is important to support incremental refinement and formalization of design.

The low level recognizer identify the best match(es) in a given context for an input glyph along with a certainty value (1-5, 1 for high certainty). A glyph is time delimited by when the user lifts the pen for longer than a certain time out period. Therefore a glyph can include more than one stroke. A 3x3 grid is inscribed within the bounding box of a glyph, numbered 1-9. Stroke path is described as a sequence, for example [1 2 3 6 9 8 7 4] represents a clockwise circle or box, a corner count helps to distinguish shapes with the same pen path. Corners are identified when the change in heading of segments between raw points exceeds 45 degrees. Pen path, aspect ratio and size of the bounding box, stroke and corner count are features of a glyph. The glyph template records a constraint on acceptable values (e.g., a set of possible corner counts {3 4 5} ). Templates also contain a slot that indicates which transformations of a glyph are permitted. Match proceeds as follows: (1) templates that match the input glyph's pen path sequence and all other features are selected. If only one candidate is found, assign certainty 1 to it. If there are multiple candidates, assgin certainty 2 to each of them. (2) permuted the path sequence to match rotated, backwards, or reflected cases, also assign 1 or 2. (3) approximate matches with sequences that differ one or two squares are assigned 3 or 4 (multiple). (4) if no match can be found, assign certainty value of 5.

To identify configurations, the program runs recognizer functions over all pairs of appropriately typed elements in the diagram to look for these patterns. Replace elements with configuration, and then the configuration may be parsed as an element in a larger 'graph' configuration. Each configuration recognizer is a Lisp function, constraints on the types of elements and their spatial relations. The Napkin also provides a text and a graphic view to adjust the configuration recognizers.

The Napkin maintains a list of all known contexts, a 'current context' and a 'current context chain' - the search path for contextual recognition. A context has four components: glyph templates, configuration recognizes, spatial relations, and mappings of glyphs from other contexts. The first recognition attempt is in the current context, but all contexts in the current context chain are consulted. If there are several matches they are returned as a list of alternatives, in the order their contexts appear in the context chain. The Napkin’s initial context is the most general one, which includes only shapes, alphanumeric characters, and lines. context. As soon as the user draws a glyph or makes a configuration that the Napkin can identify as belonging to a more specific context, the Napkin adjusts the current context and the current context chain.

Discussion

The low level recognition is feature-based, while the high-level configurations are defined in the way somewhat like that used in LADDER. The combination of context and recognition of current context is the highlight of this system. The way of resolving ambiguity and imprecision reminds me of Sudoku. Although this system only supports a small number of primitive shapes, the concept of supporting abstraction, ambiguity and imprecision is good.

"LADDER, a sketching language for user interface developers"

by Tracy Hammond, Randall Davis

Summary

Sketch recognition systems can be quite time consuming to build if they are to handle the intricacies of each domain. With LADDER, a developer need only write a domain description which describes what the domain shapes looklik e, and how they should be displayed and edited after they are recognized. The language consists of predefined shapes, constraints, editing behaviors, and display methods, as well as a syntax for specifying a domain description. LADDER allows the developer to specify both hard and soft constraints. LADDER is the first language that not only can define how shapes are to be recognized, but also can define how shapes are displayed and edited. Trigger and action allows user defined editing.

(drawn usingDescription limitations: (1) LADDER can only describe shapes with a fixed graphical grammar (drawn with the same graphical components each time). (2) The shapes must be composed solely of the primitive constraints contained in LADDER and using only the constraints available in LADDER. (3) LADDER can only describe domains that have few curves or where the curve details are not important for distinguishing between different shapes. (4) LADDER can only describe shapes that have a lot of regularity and not too much detail.

Shapes are defined by a list of components, their aliases, constraints that define relationship of the components, editing behaviors (triggers and actions), and display methods (original, clean-up, ideal shape, custom shape). Shapes can be defined hierarchically. Also can extend abstract shape with is-a section (like extends in java). Shape groups (abstract shape groups) can be used by the recognition system to provide top-down recognition, and ‘‘chain reaction’’ editing behaviors.

The language includes the primitive shapes SHAPE, POINT, PATH, LINE, BEZIERCURVE, CURVE, ARC, ELLIPSE, and SPIRAL, also includes predefined shapes built from these primitives including RECTANGLE, DIAMOND, etc. SHAPE (properties: boundingbox, centerpoint, width, and height) is an abstract shape which all other shapes extend.

A number of predefined constraints are included in the language, including perpendicular, parallel, collinear, sameSide, oppositeSide, coincident, connected, meet, intersect, tangent, contains, concentric, larger, near, drawOrder, equalLength, equal, lessThan, lessThanEqual, angle, angleDir, acute, obtuse, acuteMeet, and obtuseMeet. constraints that are valid only in a particular orientation, including horizontal, vertical, posSlope, negSlope, leftOf, rightOf, above, below, sameXPos, sameYPos, aboveLeft, aboveRight, belowLeft, belowRight, centeredBelow, centeredAbove, centeredLeft, centeredRight, and angleL. isRotatable implies the shape can be found in any orientation.

There are predefined editing behaviors like dragInside. The possible editing actions include wait, select, deselect, color, delete, translate, rotate, scale, resize, rubberBand, showHandle, and setCursor. The possible triggers include click, doubleClick, hold, holdDrag, draw, drawOver, scribbleOver, and encircle.

Key word vector can be used to specify variable number of components, and must specify the minimum and maximum number of components.

The domain independent modules determine if the stroke can be classified as an ELLIPSE, LINE, CURVE, ARC, POINT, POLYLINE or some combination. Interpretations from low-level recognizers are produced and sent off to the higher level recognizer. In order to ensure that only one interpretation is chosen, each shape has an ID, and each shape keeps a list of its subshapes, including its strokes. At any particular time, each subshape is allowed to belong to only one final recognized domain shape. But if the primitive shape recognizer does not provide the correct interpretation of a stroke, the domain shape recognizer will never be able to correctly recognize a higher level shape using this stroke.

Domain shape recognition is performed by the Jess rule-based system. To improve efficiency, a greedy algorithm is used that removes subshapes from the rule-based system once a final higher level shape is chosen.

If an editing gesture such as click-and-hold or double-click occurs, the system checks to see (1) if an editing gesture for that trigger is defined for any shape, and (2) if the mouse is over the shape the gesture is defined for. If so then the drawing gesture is short-circuited and the editing gesture takes over.

To display a shape’s ideal strokes, the system uses a shape constraint solver (using optimization functions from Mathematica) which takes in a shape description and initial locations for all of the subshapes and outputs the shape with all of the constraints satisfied while moving the points as little as possible.

Domain shape recognizers, exhibitors, and editors are automatically generated during the translation process.

Discussion

The system is so powerful. The shapes are defined hierarchically, which is just like OO languages. User defined editing behaviors and display modes are also great.

Thursday, October 2, 2008

"What!?! No Rubine Features?: Using Geometric-based Features to Produce Normalized Confidence Values for Sketch Recognition"

by Brandon Paulson, Pankaj Rajan, Pedro Davalos, Ricardo Gutierrez-Osuna, Tracy Hammond

Summary

Gesture-based sketch recognition has the benefit of using mathematically sound classifiers which produce fast classifications along with normalized confidence values, but has the disadvantage of using feature sets which are user-dependent and require individual training by each user to give good recognition results. Geometric-based techniques classifies sketches based on what they really look like, but inferences about generalization are hard to determine because classification is not statistical, and ranking alternative interpretations is difficult. So this paper tries to combine these two approaches.

31 features used in Paleosketch and 12 features used in Rubine are used in a statistical quadratic classifier shown in the table below. Then it uses a sequential forward selection (SFS) to decide the optimal feature subset. 10 folds of SFS is performed using sketches from 10 groups of randomly chosen half users as training data, and the rest are used for training. This produces 10 groups of optimal subsets. Then additional subsets are picked as those appear 100%, 100%+90%, 100%+90%+80%... in previous 10 subsets. The overall optimal subset contains features that appear at least 50% of the time in multi-fold subset selection process. As a result only one Rubine feature is in the optimal feature subset.

=========FROM PALEOSKETCH===========
1. Endpoint to stroke length ratio (100%)
2. NDDE (90%)
3. DCR (90%)
4. Slope of the direction graph (20%)
5. Maximum curvature (40%)
6. Average curvature (30%)
7. # of corners (30%)
8. Line least squares error (0%)
9. Line feature area error (40%)
10. Arc fit: radius estimate (0%)
11. Arc feature area error (20%)
12. Curve least squares error (90%)
13. Polyline fit: # of sub-strokes (70%)
14. Polyline fit: percent of sub-strokes pass line test (50%)
15. Polyline feature area error (80%)
16. Polyline least squares error (30%)
17. Ellipse fit: major axis length estimate (20%)
18. Ellipse fit: minor axis length estimate (30%)
19. Ellipse feature area error (10%)
20. Circle fit: radius estimate (30%)
21. Circle fit: major axis to minor axis ratio (80%)
22. Circle feature area error (0%)
23. Spiral fit: avg. radius/bounding box radius ratio (60%)
24. Spiral fit: center closeness error (70%)
25. Spiral fit: max distance between consecutive centers (20%)
26. Spiral fit: average radius estimate (10%)
27. Spiral fit: radius test passed (1.0 or 0.0) (40%)
28. Complex fit: # of sub-fits (60%)
29. Complex fit: # of non-polyline primitives (50%)
30. Complex fit: percent of sub-fits that are lines (90%)
31. Complex score / rank (50%)
=========FROM RUBINE===========
32. Cosine of the starting angle (30%)
33. Sine of the starting angle (10%)
34. Length of bounding box diagonal (20%)
35. Angle of the bounding box diagonal (40%)
36. Distance between endpoints (10%)
37. Cosine of angle between endpoints (0%)
38. Sine of angle between endpoints (10%)
39. Total stroke length (20%)
40. Total rotation (100%)
41. Absolute rotation (10%)
42. Rotation squared (10%)
43. Maximum speed (20%)
44. Total time (30%)

Discussion

This paper shows the way to select features, the experiment method is great. It is kind of like brute force to find the combination that yields highest accuracy.

Wednesday, October 1, 2008

"Backpropagation Applied to Handwritten Zip Code Recognition"

by Y. LeCun, B. Boser, et al

Summary

This paper introduces a neural network based approach of recgnizing handwritten zip codes.

Training & testing data is collected from a post office in Buffalo, altogether 9298 segmented numerals. 7291 is used for training and the remaining 2007 saved for testing. All the zip codes are preprocessed by seperating digits from their neighbors and then apply linear transformating to convert them into 16x16 pixel images, and the grey levels are scaled -1~1.

The input of the network is 16x16 images, and the output is 10 units, one per class. The precise location of a feature is not relevant to the classification, while the approximation location does. In order to detect a particular feature at any location, it uses the same set of weights - "weight sharing". And the neural network is composed of 3 layers:

H1: 12 feature maps with 8x8 units, each unit takes a 5x5 field from the 16x16 image as input. Units in 1st layer that are one unit part, their receptive fields are 2 pixels apart. Thus the image is undersampled and some position information is eliminated. Connections extending past the boundaries are filled with virtual background plane, whose value is chosen to be -1. It uses the same set of 25 weights on all the 5x5 input fields, but each unit in H1 has its own bias. (19,968 connections & 1068 free parameters)

H2: 12 feature maps with 4x4 units built on H1 and just like the way H1 is built, except that each unit in H2 combines 8 of the 12 feature maps in H1. The sample fields are also 5x5, and also identical weight vectors and different bias for each unit. (38,592 connections & 2592 free parameters)

H3: 30 units fully connected to units in H2. (192*30 + 30biases = 5790 connections)

Output: 10 units fully connected to units in H3, adding another 310 weights.

The nonlinear function used at each node was a scaled hyperbolic tangent. The target values for the output units were chosen within the quasilinear range of the sigmoid. The weights were initialized with random values ranging between -2.4/Fi and 2.4/Fi, where Fi is the number of inputs.

The network was trained 23 passes (167,693 pattern presentations) - backpropagation. The accuracy was 0.14% for training data, and 5% for testing data. The percentage of rejection was 12.1% to achieve 1% error rate. Overall, it outperforms previous works in accuracy and speed.

Discussion

To be honest, I know little about neural network. But it seems to be using a bunch of functions to undersample sketches by their features, and train the network to make certain weights bigger, so that the input can follow the path which maximize the weights to get to the corresponding output. However, this may not work equally well in non-digit sketches, and also worth noted is that the system uses images of grey level -1~1, while the sketches we are considering uses binery images most of the time. Anyway, this is a different solution from the ones we've learned so far.

"Constellation Models for Sketch Recognition"

by D. Sharon and M. van de Panne

Summary

The authors introduces a vision-based sketch recognition approach. It utilizes the pictorial structure , or constellation, to recognize strokes in sketches of particular classes. The system makes two assumptions: (1) similar parts are drawn with similar strokes, (2) mandatory parts must have exactly one instance in the sketch, optional parts may have multiple instances in the sketch.

The constellation model consists features of individual object parts as well as features of pairs of parts. Individual features capture shape and global positions, whereas pairwise features (only computed when at least one part is mandatory) summarize relative positions. The sketch recognition process has two phases: (1) search the space of possible mandatory label assignments, (2) search for optional labels for remaining unlabelled strokes. Probability distribution is used.

The labels are searched by using a brand-and-bound tree for maximum likelihood (ML) searth. Mandatory parts are labelled first. A node in depth i of the tree is extended by evaluating all possible assignments of mandatory label i+1 to unlabelled strokes. To speed up the search, multipass thresholding and hard constraints are introduced. If a node's likelihood, as computed by its current partial set of label assignments, is lower than a threshold, that search branch is terminated. And hard constraints (e.g. nose should be above mouth) are relationships that must be met for some pairs, since all training examples have meet these relationships bewteen their corresponding parts.

Discussion

This approach recognizes objects based on relative locations of the strokes of sketches. However, the feature vector just uses bounding box of the strokes, which means that on the one hand, as long as the strokes can be fit into corresponding bounding box and maintain the specified relationship, the sketch will be recognized as an object no matter how each stroke is drawn; and on the other hand, even if the sketch looks the same, if each part is not drawn in the specified way (i.e. a nose drawn in one stroke and a nose drawn in 2 strokes are not the same), the whole sketch may not be able to successfully recognized.

"Envisioning Sketch Recognition: A Local Feature Based Approach to Recognizing Informal Sketches"

by Michael Oltmans

Summary

Chapter1 Introduction

The approach is based on a representation of a sketched shape in terms of the visual parts it is made of. It is developed to recognize shapes in free sketching, which is different from sketches with feedback. When the goal is to jot something down quickly or explore early design possibilities, a less intrusive style of interaction is preferable. Feedback allows the user to repair recognition errors early on, touch up strokes are cleaned up before next shape is drawn, and feedback can teach the user what the recognition system is capable of recognizing, but freely drawn sketches don't have these features.

The principal challenges are:
(1) Shapes vary on both a signal and a conceptual level. (noises and conceptual variation, same thing drawn differently)
(2) Users draw strokes that overtrace previous strokes, and that may not be necessary to depict the shape being drawn. (doesn't change appearance, so visual approach is well suited)
(3) Segmenting strokes into groups corresponding to individual shapes is a complex task. (grouping according to the order they were drawn is not sufficient)

Chapter2 Representation of Visual Parts

The part-based representation is used to train a classifier to distinguish between shapes based on their parts. A visual part is an abstraction of the appearance of a region of the shape. It represents the appearance of a region of the shape (e.g. the top right hand corner of a rectangle). Each visual part is a circular region (bullseye) that is subdivided into rings and wedges, like a dartboard. The rings are spaced so that they are separated equally in log-space. They are spaced like this so that the inner bins are smaller and represent more detailed information about the appearance, while the outer bins contain more general information. A shape context is calculated
every 5 pixels along the the strokes in the shape.

The histogram boundaries are rotated relative to the interest point's orientation to make it rotationally invariant;the bins are rotated by an additional half bin width to prevent the stroke from lying along the boundary. So the bullseyes are rotationally invariant. The stroke direction relative to the orientation of the bullseye feature (calculating the tangent over a window of 19 points, four orientations: 0-45, 45-90, 90-135, 135-180) is added as a third dimension of the histogram, providing a count of how many points fall into each spatial bin at each direction (a single bin can distinguish between vertical and horizontal line).

Stroke preprocessing: first scale the shape so that the maximum distance between any two points in the shape is 75 pixels. Then resample the points along the strokes to have an approximately constant separation along the stroke. And interpolate additional points so that no two consecutive points have a distance greater than 1 pixel.

The distance between two bullseyes is calculated by comparing their vectors of bin counts: delta(p,q) = sigma(square(pi-qi)/(pi+qi)). Since the outer bins are less important and they usually contain more points, normalize the total weight of the histogram to be 1 to avoid differences arising from the total number of points in a region.

Chapter3 Representation of Shapes

Codebook: calculate the collection of bullseye parts for each shape in our training set, then cluster them into groups of similar parts, and use a representative from each cluster center as one of the codebook entries. Use a quality threshold (QT) clusterer, which forms clusters such that the distance between any two members of a clusterer is under a threshold. To improve
eciency, instead of trying each point as a cluster seed, the authors randomly selected 20 points as seeds and choose the largest cluster formed from those 20 seeds. Randomly select 1000 bullseye parts from the training set instead of using all of the calculated parts. The clustering is terminated after finding the 200 largest clusters.

To recognize a sketched shape, it is represented based on the parts it contains and then train a classifier to distinguish shapes based on those parts. First calculate the visual parts at a sampling of locations that cover the sketched symbol. Typically, 50 parts are calculated for a shape. In the next step, represent the sketched symbol in terms of the standard vocabulary of parts (codebook), compare each part in the codebook to each of the parts calculated from the sketched symbol, a distance matrix is got. Then a match vector is calculated by finding the minimum distance in each row of the matrix (match value: small match value means a codebook part presents in the input shape). Using the match vector representation, a classifier is trained to learn the differences between shape classes and to learn the allowable variations within a shape class.

The visual parts insulate the shape descriptions from the signal noise: bin (fuzzy), the way that are used to describe a shape (distance measure, low if similar, not exactly the same).

Chapter4 Recognition

With a representation of fixed cardinality and ordering corresponding to the codebook entries we can train a support vector machine (SVM) to learn a decision boundary that separates typical match vectors of one shape class from those from another shape class. To assign each input shape a single label from the set of possible shape classes, a one-vs-one strategy is used. This is implemented by training one classifier to distinguish between each pair of shape classes. This results in n(n-1)/2 binary classiers for n shape classes. For one input shape, the result from
each classifier is counted as a vote for the class it predicted. Assgin the class which received the most votes to the shape.

Shape localization involved sevel steps: selecting candidate regions, classifying the candidate regions, and combining the classified regions to form the final predictions. Candidate regions are selected by scanning a rectangular window over the input sketch every 10 pixels along the stroke with several different sized windows. The classifier is trained on an additional WIRE class to identify wire segments. The candidate finder generates many overlapping candidates so each shape in the image is partially contained in many candidate regions. The algorithm for combining the initial set of candidates into a nal set of predictions has two steps that are repeated for each shape class: (1) forming initial clusters of candidates and (2) splitting clusters that are too large into multiple predictions. To make a nal set of predictions we follow a greedy strategy based
on the total weight of each cluster. The weight of each cluster is determined by EM (expectation maximization), by combining the weights of the individual candidates in the cluster. Highly weighted clusters are generally tightly-grouped and have many high scoring candidates. As a result clusters with high weights are generally correct whereas clusters with low weights are less likely to be correct.

Chapter5 Evaluation

The isolated shape recognizer and the full sketch processor are evaluated separately. Analog circuit drawings report a recognition rate of 89.5%. HHreco data set that contains PowerPoint style shapes produced a recognition rate of 94.4%. One of the limitations is that it has difficulty identifying shapes that contain the same low level features in different combinations. The full sketch processor correctly identified 92.3% of the regions.

Discussion

This paper uses a vision-based approach, which successfully tackles the noises and conceptual variance of shapes appear in free sketching. And the full sketch processor which identifies locations and scales of components in sketch also has quite high accuracy. But the fuzzying also makes it unable to distinguish between shapes with the same set of parts that are differently connected. Also, this is an instance-based regcognizer, the input must be compared with all templates. So it slows quickly when new classes are added. And each time a new class is added, all the classifiers must be retrained. For sketches not like analog circuits where small components connected by wires, this recognizer may not work well (for example, a sketch with a circle in a rectangle).

"A Domain-Independent System for Sketch Recognition"

by Bo Yu, Shijie Cai

Summary

This paper utilizes low-level geometric features to handle smooth curves and hybrid shapes as well as poloylines. Previous sketch recognition systems mainly have these deficiencies: (1) strict restrictions on the drawing manner, (2) lack ability to analyze hybrid or smooth curves and decompose them into primitive shapes.

Their system is diveded into two stages: imprecise stroke approximation and post-process. Approximation classifies the strokes into lines, arcs, circles, ellipses and helixes and a combination of these primitive shapes. While post-process is applied after the user finishes drawing.

Imprecise stroke approximation phase uses a top-down approach. A stroke is segmented at the highest curvature point into 2 parts, then do feature-area verification with respect to primitive shapes on each segment to decide if it can be classified as one of those shapes, if approximation fails, the segment is recursively decomposes at the point with the next highest curvature. Curvature is defined as the sum of change in direction from Pn-k to Pn+k divided by path distance between Pn-k and Pn+k, where k is set to 2.

If the direction graph fits a horizontal line and number of deviation point to the least-square best fit line of the direction graph or the actual coordinate of the stroke falls below a loose threshold, then it could be a line. If the feature-area / candidate line's standard area is less than a strict threshold 1, then it's immediately approximated to be a line segment, otherwise, the decision is delayed till after the arc recognition. Circles are arcs with direction change of 2Pi. If the direction change of the stroke is greater than 2Pi, it's segmented into parts with direction change of 2Pi, and decide whether it's overtraced circle or helix (descending or ascending radii). To deal with self-intersection strokes, two copies of stroke is made, one segmented using the original procedure (at high curvature points), the other segmented at self-intersection points to see which gives better result. ((1) simpler is better: fewer parts of primitive shapes, (2) specifier is better: circles, ellipses, helixes are better than lines and arcs)

Post-process does simple relation retrieval (parallelism, perpendicularity, etc.), cleanup (tails removal, merge primitive elements), and basic object recognition (boundaries, rectangles, triangles, arrows, dash lines, etc.)

The overall accuracy is 98%, while accuray for arcs alone is 94%. Hybrid shape of arcs connected to lines gives only 70% accuracy because of low curvature at smooth connecting points.

Discussion

The top-down approach tackles the problem of noises in curvature and direction graph. And the rules used in cleanup phase is quite useful, 60%-80% meaningless elements are removed while meaningful parts are preserved. For the self-intersection part, would a spiral be interpreted into circles attached to arcs and lins? I think this may change the topological relation and the meaning of the sketch.

"GLADDER: Combining Gesture and Geometric Sketch Recognition"

by Paul Corey and Tracy Hammond

Summary

This paper discusses an approach to combine Rubine features with the geometric classifier of LADDER.

Modified Rubine Recognizer: In addition to average feature vectors, class specific covariance matrices are maintained for each class. Input strokes are assigned to a specific class based on computation of Mahalanobis distances.

LADDER Recognizer: Primitives are recognized by low-level shape tests, and are orders based on the hierarchical classifier. Top three are chosen as possible interpretation, and less complex interpretations are put before more complex ones.

Combination: The minimal Mahalanobis distance of a stroke to any Rubine class is computed, if it exceed a threshold of 35, then it's likely to be a LADDER primitive (average Mahalanobis distance 100), rather than one of the Rubine glyphs (average Mahalanobis distance 24). If it is below the threshold, then Rubine interpretation is put in top of the fit list, otherwise, Rubine interpretation is put at the bottom. Context can later be used to rectify an incorrect ordering of fits.

The accuracies are: Modified Rubine 61.9%, LADDER 75.2%, Integrated 79.9%. Some errors are due to overlaping of classes between Rubin and LADDER (zero and circle). And they also suggests in future work that a tiered thresholding system can be used to put Rubine glyphs after less complex primitives and before complex primitives of LADDER interpretations.

Discussion

The combination approach partly solved the problem of the mix of characters and shapes in a sketch. It's a step forward the "free" or "natural" sketch recognition. The tiered thresholding system may be implemented by dividing LADDER primitives and Rubine classes into groups, whose average Mahalanobis distance are quite different, but the deviation of each class within the group is small.

SKETCH RECOGNITION 2008