by Tracy Hammond and Randall Davis
Summary
This is another paper talking about near-miss...
http://giftzyx.blogspot.com/2008/11/interactive-learning-of-structural.html
Monday, December 8, 2008
Tuesday, November 25, 2008
"Multimodal Collaborative Handwriting Training for Visually-Impaired People"
by Beryl Plimmer, Andrew Crossan, Stephen A. Brewster, Rachel Blagojevic
Summary
This paper presents McSig, a multimodal collaborative handwriting training system for the visully-impaired people.
The teacher can teacher visully-impaired students how to write in three ways: 1) verbal communication to give guidence and explain concepts. 2) the teacher use a Tablet PC to draw, and her movement is echoed to the PHANTOM on the student side (Playback mode), and 3) Stencil mode - allow students to write shapes themselves with constraining forces to guide their movements. Multi-stroke characters and multiple repetition of single-stroke characters are discerned by using time-out, if strokes are drawn within 1sec interval, they belong to the same character.
The system is also audio aided. They mapped a pitch of a sinusoidal to vertical movements, and audio pan to horizontal movments. A high pitch indicates that the cursor is near the top of the drawing area, while a low pitch indicates that the cursor is near the bottom. Distinct sounds are played at the start and end of the teacher's trajectory.
PID controller is used to minimise the error between the current value of a system and a target value. The user will be dragged through a close approximation of the trajectory in a smooth and stable manner.
User study consists of pre-test phase, training phase, and post-test phase. Both the partially-sighted group and totally blind group see progress after using this system. The virtual stencil didn't work may due to the fact that visually-impaired people use two hands to write, one holding the pen, and the other for spatial orientation on the paper. This system could also be extended to use in surgery training.
Discussion
The idea of using multimodal way to help learning is good. But the sound may be annoying and distract the the user.
Summary
This paper presents McSig, a multimodal collaborative handwriting training system for the visully-impaired people.
The teacher can teacher visully-impaired students how to write in three ways: 1) verbal communication to give guidence and explain concepts. 2) the teacher use a Tablet PC to draw, and her movement is echoed to the PHANTOM on the student side (Playback mode), and 3) Stencil mode - allow students to write shapes themselves with constraining forces to guide their movements. Multi-stroke characters and multiple repetition of single-stroke characters are discerned by using time-out, if strokes are drawn within 1sec interval, they belong to the same character.
The system is also audio aided. They mapped a pitch of a sinusoidal to vertical movements, and audio pan to horizontal movments. A high pitch indicates that the cursor is near the top of the drawing area, while a low pitch indicates that the cursor is near the bottom. Distinct sounds are played at the start and end of the teacher's trajectory.
PID controller is used to minimise the error between the current value of a system and a target value. The user will be dragged through a close approximation of the trajectory in a smooth and stable manner.
User study consists of pre-test phase, training phase, and post-test phase. Both the partially-sighted group and totally blind group see progress after using this system. The virtual stencil didn't work may due to the fact that visually-impaired people use two hands to write, one holding the pen, and the other for spatial orientation on the paper. This system could also be extended to use in surgery training.
Discussion
The idea of using multimodal way to help learning is good. But the sound may be annoying and distract the the user.
Monday, November 17, 2008
"Fluid Sketches: Continuous Recognition and Morphing of Simple HandDrawn Shapes"
by James Arvo, Kevin Novins
Summary
This paper presents an approach to tightly couple gesture recognition with rapid morphing to provide a new and subjectively different form of feedback. The feedback is provided as each pen stroke is drawn, so that immediate adjustments can be made if the result is not what was intended.
A family of ordinary different eqations (ODEs) is used to determine how a user-drawn shape changes with time as a result of continuous recognition and morphing. The parametric ODE is referred to as the fluid sketching equation. Points are continuous moved towards its corresponding ideal position that reflects the curve the user is attempting to draw. qs(t) is the new position of ps at time t. It is this feedback of q influencing its own subsequent time evolution that leads to modeling q as a differential equation.
Recognition stage attempts to classify the user-drawn stroke into a fragment of a known shape (circle, box, line segment). Best fit of the stroke to each shape is calculated. The metric is the sum of the squares of the distances from each sampled point on the curve to the closest-matching shape. Least-squares fit for circle is found by applying a linear equation, while for box, relaxation is used (each point exerts a "spring" force to the nearest face or vertex and use gradient descent to get to a configuration in which the total net force is nearly zero - most balanced state).
The matching is based on the entire stroke, but the recent portion is weighed more heavily. The simplest strategy for f is to simply move in the direction of the closest point on the ideal shape. Viscosity v is to describe how fast the point on the stroke moves toward the ideal shape S. v=0 means instantaneous snapping to the ideal shape, while v=infinity means that the user-drawn strokes are retained
unmorphed.
In situations where accuracy of placement is less important, the majority of users in this study preferred fluid sketching.
Discussion
The approach introduced in the paper enables the user to see the cleaned shape portion without even finishing the stroke. However, this may distract user's attention and also distort the stroke at an early stage that may yield a misrecognition.
Summary
This paper presents an approach to tightly couple gesture recognition with rapid morphing to provide a new and subjectively different form of feedback. The feedback is provided as each pen stroke is drawn, so that immediate adjustments can be made if the result is not what was intended.
A family of ordinary different eqations (ODEs) is used to determine how a user-drawn shape changes with time as a result of continuous recognition and morphing. The parametric ODE is referred to as the fluid sketching equation. Points are continuous moved towards its corresponding ideal position that reflects the curve the user is attempting to draw. qs(t) is the new position of ps at time t. It is this feedback of q influencing its own subsequent time evolution that leads to modeling q as a differential equation.
Recognition stage attempts to classify the user-drawn stroke into a fragment of a known shape (circle, box, line segment). Best fit of the stroke to each shape is calculated. The metric is the sum of the squares of the distances from each sampled point on the curve to the closest-matching shape. Least-squares fit for circle is found by applying a linear equation, while for box, relaxation is used (each point exerts a "spring" force to the nearest face or vertex and use gradient descent to get to a configuration in which the total net force is nearly zero - most balanced state).
The matching is based on the entire stroke, but the recent portion is weighed more heavily. The simplest strategy for f is to simply move in the direction of the closest point on the ideal shape. Viscosity v is to describe how fast the point on the stroke moves toward the ideal shape S. v=0 means instantaneous snapping to the ideal shape, while v=infinity means that the user-drawn strokes are retained
unmorphed.
In situations where accuracy of placement is less important, the majority of users in this study preferred fluid sketching.
Discussion
The approach introduced in the paper enables the user to see the cleaned shape portion without even finishing the stroke. However, this may distract user's attention and also distort the stroke at an early stage that may yield a misrecognition.
"Sketch Recognition User Interfaces: Guidelines for Design and Development"
by Christine Alvarado
Summary
This paper addresses both HCI and sketch recognition by introducing a recognition-based diagram creation tool that robustly recognizes naturally drawn diagrams and automatically imports them into PowerPoint. Recognition is done by a multi-domain sketch engine called SketchREAD. Users create diagrams consisting of shapes (ellipses and quadrilaterals) and connectors (lines and arrows) on Tablet PCs using this tool.
To enable seamless interaction, the diagram creation tool and PPT is automatically synchronized to contain the same information. This tool also provides editing capabilities including move and delete. Pen-based editing commands are designed. Users switch between edit mode and sketch mode by pressing buttons on the winodw. Online edit mode is introduced to allow users sketch while in edit mode. The user holds down the pen in sketch mode to enter online edit mode, then she selects the items, when she lifted the pen, she can draw new items, however, the selected items remain highlighted, indicating that she can move or delete them using the same gestures in edit mode. Both recognized (cleaned) strokes and unrecognized (original) strokes appear on the slides.
Design guidelines: (1) Display recognition results only when the user is done sketching. Whether the diagram is finished is indicated by changing of window focus, or explicited by the user by pressing "show" button. (2) Provide obvious indications to distinguish free sketching from recognition. (3) Restrict recognition to a single domain until automatic domain detection becomes feasible. (4) Incorporate pen-based editing as much as possible (copy, paste, alignment, resizing). (5) Sketching and editing should use distinct pen motions. (6) SkRUIs require large buttons.
Discussion
Not touching the recognition techniques in detail, the paper discusses mainly the UI design and evalution of the sketch recognition tool. The design guidelines are useful.
The online edit mode may still be confusing to the user, it's better to provide a more straightforward way to switch between editing and sketching, like the trigger-action used in LADDER.
Summary
This paper addresses both HCI and sketch recognition by introducing a recognition-based diagram creation tool that robustly recognizes naturally drawn diagrams and automatically imports them into PowerPoint. Recognition is done by a multi-domain sketch engine called SketchREAD. Users create diagrams consisting of shapes (ellipses and quadrilaterals) and connectors (lines and arrows) on Tablet PCs using this tool.
To enable seamless interaction, the diagram creation tool and PPT is automatically synchronized to contain the same information. This tool also provides editing capabilities including move and delete. Pen-based editing commands are designed. Users switch between edit mode and sketch mode by pressing buttons on the winodw. Online edit mode is introduced to allow users sketch while in edit mode. The user holds down the pen in sketch mode to enter online edit mode, then she selects the items, when she lifted the pen, she can draw new items, however, the selected items remain highlighted, indicating that she can move or delete them using the same gestures in edit mode. Both recognized (cleaned) strokes and unrecognized (original) strokes appear on the slides.
Design guidelines: (1) Display recognition results only when the user is done sketching. Whether the diagram is finished is indicated by changing of window focus, or explicited by the user by pressing "show" button. (2) Provide obvious indications to distinguish free sketching from recognition. (3) Restrict recognition to a single domain until automatic domain detection becomes feasible. (4) Incorporate pen-based editing as much as possible (copy, paste, alignment, resizing). (5) Sketching and editing should use distinct pen motions. (6) SkRUIs require large buttons.
Discussion
Not touching the recognition techniques in detail, the paper discusses mainly the UI design and evalution of the sketch recognition tool. The design guidelines are useful.
The online edit mode may still be confusing to the user, it's better to provide a more straightforward way to switch between editing and sketching, like the trigger-action used in LADDER.
"Magic Paper: Sketch-Understanding Research"
by Randall Davis
Summary
Two things motivate the desire to enable sketching. First, using pen is more natural then keyboard and mouse, and free-form sketching is more intuitive than shapes CAD programs generate. Second,CAD force sketchers to make commitments that they may not want to make at early conceptual design stage.
The difficulty of sketching is in proportion to the user's allowed degree of freedom. These difficulties include: First, our task is incremental. This lets the system provide continuous feedback about its understanding of the sketch, so the user can make corrections when a misunderstanding arises. In addition, as in any signal interpretation problem, there is noise. Next, the drawing conventions in many domains permit variations. Individual styles also vary, across users and even within a sketch. Another issue is the difficulty of segmentation. Next, there are issues involving overtraced or filled-in shapes. Finally, and perhaps most interesting, the signal is both 2D and nonchronological. The signal is nonchronological in the sense that we don't require each object to be finished before the next is started.
Two basic assumptions ground most work in sketch understanding. First, the work is done in domains where there's a reasonably well-established graphical lexicon and grammar. But this is clearly not universal. Second, much like work in speech understanding, sketch-understanding systems are built for a specific domain.
Finding primitives: uses the data's temporal character. And by combining curvature and speed, corner finding could be more precise. Recognizing shapes: three correspondingly different approaches to recognition: (1) how the shape is defined - using LADDER, (2) how the shape is drawn - Dynamic Bayes' Network (DBN), and (3) what the shape looks like - vision approaches (bull's eye). We can connect sketch understanding to various back-end programs (physics simulator, RationalRose).
The author also talks about creating a sketch understander for a new domain, the issue of underconstrained and overconstrained definitions, and description-refinement process using near-miss.
Discussion
This paper is an overview of sketch recognition, including the difficulties, common approaches, issues, applications, etc.
Summary
Two things motivate the desire to enable sketching. First, using pen is more natural then keyboard and mouse, and free-form sketching is more intuitive than shapes CAD programs generate. Second,CAD force sketchers to make commitments that they may not want to make at early conceptual design stage.
The difficulty of sketching is in proportion to the user's allowed degree of freedom. These difficulties include: First, our task is incremental. This lets the system provide continuous feedback about its understanding of the sketch, so the user can make corrections when a misunderstanding arises. In addition, as in any signal interpretation problem, there is noise. Next, the drawing conventions in many domains permit variations. Individual styles also vary, across users and even within a sketch. Another issue is the difficulty of segmentation. Next, there are issues involving overtraced or filled-in shapes. Finally, and perhaps most interesting, the signal is both 2D and nonchronological. The signal is nonchronological in the sense that we don't require each object to be finished before the next is started.
Two basic assumptions ground most work in sketch understanding. First, the work is done in domains where there's a reasonably well-established graphical lexicon and grammar. But this is clearly not universal. Second, much like work in speech understanding, sketch-understanding systems are built for a specific domain.
Finding primitives: uses the data's temporal character. And by combining curvature and speed, corner finding could be more precise. Recognizing shapes: three correspondingly different approaches to recognition: (1) how the shape is defined - using LADDER, (2) how the shape is drawn - Dynamic Bayes' Network (DBN), and (3) what the shape looks like - vision approaches (bull's eye). We can connect sketch understanding to various back-end programs (physics simulator, RationalRose).
The author also talks about creating a sketch understander for a new domain, the issue of underconstrained and overconstrained definitions, and description-refinement process using near-miss.
Discussion
This paper is an overview of sketch recognition, including the difficulties, common approaches, issues, applications, etc.
"Interactive Learning of Structural Shape Descriptions from Automatically Generated Nearmiss Examples"
by Tracy Hammond, Randall Davis
Summary
This paper describes a visual debugger of shape descriptions using active learning that automatically generates its own suspected near-miss examples.
A near-miss shape is a shape that differs from the initial hand drawn positive example in only one aspect. The debugging is done by classifying near-miss shapes as positive or negative to help correct overconstrained (constraints too tight) or underconstrained (constraints too loose) descriptions.
This approach needs a positive hand-drawn example and a user-typed description that will correctly recognize that shape. Then a recognizer is generated using the description. If the known to be correct example is not recognized, the description must be overconstrained. And it's corrected by first finding a match between the typed description and the drawn shape that has fewest failed constraints. The system shows those failed constraints in red and asks the developer if they could be removed to correct the description. If the description is automatically generated, the list of constraints could be quite long, so the authors applied a set of heuristics to prune the list.
The initial hand-drawn example and the initial description may be both overconstrained although the example can be recognized with that description. So they use a constraint candidate list. Each time a positive example shape is encountered, any constraint not true of the new example is removed from the list. They also generate a list of constraints known to be part of the correct description by examine the negative examples. Each time a negative example shape is encountered, a list of the constraints that are in the constraint candidate list but are not true of the negative example shape is constructed. At least one constraint in this list is part of the correct description, because it correctly classifies the new example as negative. For scaling and rotation, the developer is asked to indicate the status of each example individually in the case of scaling, while he/she is permitted only to say whether or not all of the examples are positive for rotation. A constraint is tested by creating a description in which the constraint is replaced by its negation.
Under-constrained description is checked by making a list of possible missing constraints. As the list could be very long, the authors uses a filtering process. Whether or not a constraint is missing from the description is tested by adding its negation to the description. If the shape in which the constraint is met is a positive example, and the shape in which the constraint is not met is a negative example, then this constraint is necessary in the shape's concept.
Discussion
This paper talks about a positive learning approach to debug the shape description in LADDER in an interactive way. It is both efficient and necessary because human testers may never draw shapes that expose the bug in the description.
Summary
This paper describes a visual debugger of shape descriptions using active learning that automatically generates its own suspected near-miss examples.
A near-miss shape is a shape that differs from the initial hand drawn positive example in only one aspect. The debugging is done by classifying near-miss shapes as positive or negative to help correct overconstrained (constraints too tight) or underconstrained (constraints too loose) descriptions.
This approach needs a positive hand-drawn example and a user-typed description that will correctly recognize that shape. Then a recognizer is generated using the description. If the known to be correct example is not recognized, the description must be overconstrained. And it's corrected by first finding a match between the typed description and the drawn shape that has fewest failed constraints. The system shows those failed constraints in red and asks the developer if they could be removed to correct the description. If the description is automatically generated, the list of constraints could be quite long, so the authors applied a set of heuristics to prune the list.
The initial hand-drawn example and the initial description may be both overconstrained although the example can be recognized with that description. So they use a constraint candidate list. Each time a positive example shape is encountered, any constraint not true of the new example is removed from the list. They also generate a list of constraints known to be part of the correct description by examine the negative examples. Each time a negative example shape is encountered, a list of the constraints that are in the constraint candidate list but are not true of the negative example shape is constructed. At least one constraint in this list is part of the correct description, because it correctly classifies the new example as negative. For scaling and rotation, the developer is asked to indicate the status of each example individually in the case of scaling, while he/she is permitted only to say whether or not all of the examples are positive for rotation. A constraint is tested by creating a description in which the constraint is replaced by its negation.
Under-constrained description is checked by making a list of possible missing constraints. As the list could be very long, the authors uses a filtering process. Whether or not a constraint is missing from the description is tested by adding its negation to the description. If the shape in which the constraint is met is a positive example, and the shape in which the constraint is not met is a negative example, then this constraint is necessary in the shape's concept.
Discussion
This paper talks about a positive learning approach to debug the shape description in LADDER in an interactive way. It is both efficient and necessary because human testers may never draw shapes that expose the bug in the description.
Tuesday, October 28, 2008
"Grouping Text Lines in Freeform Handwritten Notes"
by Ming Ye, Herry Sutanto, Sashi Raghupathy, Chengyang Li and Michael Shilman
Summary
This paper presents an optimization method for text line grouping.
Line grouping problem: find the partitioning of the stroke set to optimize the likelihood of the resulting lines and the consistency and simplicity of the their configuration.
Three features to measure the likelihood of a line: the linear regression error (eLR), maximum inter-stroke distances (Dxmax(l) & Dymax(l)) projected onto the fitting line and its orthogonal direction respectively.
Configuration consistency: use neighborhood graph - a pair of lines are neighbors if the minimum distance between their fitting line segments falls below a threshold and there are no other objects (lines or drawings) lying between them. Compute each line’s configuration consistency θ(l) as the neighbor-length-weighted sum of the orientation angle difference. Use the weights to encourage lines to be more consistent with longer (usually more reliable) neighbors.
Model complexity: measured by number of lines (π).
Cost function: weighted sum of eLR, Dxmax(l), Dymax(l), θ(l), π. The optimal line grouping is the stroke partitioning that minimizes this cost.
Gradient-descent optimization: Starting from an initial solution from temporal grouping, iteratively generate local alternative hypotheses and accept the one leading to the largest global cost decrease. Temporal grouping: partitioning based on the fact that most text lines are composed of temporally adjacent strokes. Two types of alternative hypotheses: merging a pair of line segments and correcting high-configuration-energy errors, corresponding to the two major error categories of the temporal grouping results. Merging hypothesis is generated for each neighboring pair. High-configuration-energy errors are caused by temporally adjacent strokes belonging to different lines. Any line segment l whose maximum angle difference from its neighbors exceeds a threshold is an initial candidate. Iteratively update the line configuration and the hypothesis queue, take the hypothesis that incurs the most global cost reduction in each step, until the queue is empty. Incremental mode used in online sketch yields better results than batch mode (throw in full page of sketch).
Accuracy is measured by the recall metric, which is defined as the number of correct lines divided by the number of labeled lines in each page. The average recall rates for PerfectWD and CrudeWD are 0.93 and 0.87 respectively.
Discussion
The features used in grouping (calculating global cost) is useful. The iterative gradient-descent algorithm seems to be working quite well. But I'm wondering if it works equally well on sketches with text and shape mixed. For example, people may write text on a line, or the table wasn't drawn big enough that they have to write through the edges of the table (text cut by shpe).
Summary
This paper presents an optimization method for text line grouping.
Line grouping problem: find the partitioning of the stroke set to optimize the likelihood of the resulting lines and the consistency and simplicity of the their configuration.
Three features to measure the likelihood of a line: the linear regression error (eLR), maximum inter-stroke distances (Dxmax(l) & Dymax(l)) projected onto the fitting line and its orthogonal direction respectively.
Configuration consistency: use neighborhood graph - a pair of lines are neighbors if the minimum distance between their fitting line segments falls below a threshold and there are no other objects (lines or drawings) lying between them. Compute each line’s configuration consistency θ(l) as the neighbor-length-weighted sum of the orientation angle difference. Use the weights to encourage lines to be more consistent with longer (usually more reliable) neighbors.
Model complexity: measured by number of lines (π).
Cost function: weighted sum of eLR, Dxmax(l), Dymax(l), θ(l), π. The optimal line grouping is the stroke partitioning that minimizes this cost.
Gradient-descent optimization: Starting from an initial solution from temporal grouping, iteratively generate local alternative hypotheses and accept the one leading to the largest global cost decrease. Temporal grouping: partitioning based on the fact that most text lines are composed of temporally adjacent strokes. Two types of alternative hypotheses: merging a pair of line segments and correcting high-configuration-energy errors, corresponding to the two major error categories of the temporal grouping results. Merging hypothesis is generated for each neighboring pair. High-configuration-energy errors are caused by temporally adjacent strokes belonging to different lines. Any line segment l whose maximum angle difference from its neighbors exceeds a threshold is an initial candidate. Iteratively update the line configuration and the hypothesis queue, take the hypothesis that incurs the most global cost reduction in each step, until the queue is empty. Incremental mode used in online sketch yields better results than batch mode (throw in full page of sketch).
Accuracy is measured by the recall metric, which is defined as the number of correct lines divided by the number of labeled lines in each page. The average recall rates for PerfectWD and CrudeWD are 0.93 and 0.87 respectively.
Discussion
The features used in grouping (calculating global cost) is useful. The iterative gradient-descent algorithm seems to be working quite well. But I'm wondering if it works equally well on sketches with text and shape mixed. For example, people may write text on a line, or the table wasn't drawn big enough that they have to write through the edges of the table (text cut by shpe).
Monday, October 27, 2008
"PerceptuallySupported Image Editing of Text and Graphics"
by Eric Saund, David Fleet, Daniel Larner, James Mahoney
Summary
This paper presents ScanScribe, an image editing program that support easy selection and manipulation of materials in images of sketches.
Image materials can be selected by rectangle, lasso, polygon, and clicking on object. Clicking on object is available any time image objects have become established within the program. Whether a rectangle or lasso is used is based on whether the selection path encloses. Polygon selection is activated by mouse double click.
When image objects are selected, new bitmap layer is created with the object as foreground and a transparent background. ScanScribe uses a flat lattice for primitive object grouping, so that an object may belong to any number of groups (e.g. both a row group and a column group in a table). Click on object allows switching between groups with all the other objects in the same group selected. Groups can be manually created, or automatically formed after operations on a set of objects. And an object is expelled from a group any time it's moved far from that group's spatial locus. A set of objects can also be merged into one object.
Foreground/background separation is done through an iterative technique (high-pass filtering and distance measure on pixels' HSB values) to estimate and interpolate the color distribution of light background across the scene.
Handwritten text is replaced by typing in the text entry when the bitmap region containing handwriten text is selected. And typed text is rendered as bitmaps.
Image recognition occurs in two stages. First, the original undifferentiated bitmap is segmented into primitive image objects by an automatic mechanism. Second, sensible groupings of these primitives are automatically formed and established as groups, or composite objects, in the lattice grouping structure. Stroke primitives are grouped according to rules following the principles of curvilinear smoothness and closure. Blob primitives (text) are grouped into composite objects reflecting words and lines of text based on spatial proximity and curvilinear alignment. Structure recognition is invoked in ScanScribe through an Edit menu item, and can be applied to only a selected region of the image.
Discussion
ScanScribe is an image editing tool that supports offline sketch recognition to some extent, basically group and structure, though the text "recognition" is done by manually typing in text entry. The lattice used for grouping is a good idea as it supports multiple group labels for each object. It's a promising application if it can incorporate more sketch recognition techniques.
Summary
This paper presents ScanScribe, an image editing program that support easy selection and manipulation of materials in images of sketches.
Image materials can be selected by rectangle, lasso, polygon, and clicking on object. Clicking on object is available any time image objects have become established within the program. Whether a rectangle or lasso is used is based on whether the selection path encloses. Polygon selection is activated by mouse double click.
When image objects are selected, new bitmap layer is created with the object as foreground and a transparent background. ScanScribe uses a flat lattice for primitive object grouping, so that an object may belong to any number of groups (e.g. both a row group and a column group in a table). Click on object allows switching between groups with all the other objects in the same group selected. Groups can be manually created, or automatically formed after operations on a set of objects. And an object is expelled from a group any time it's moved far from that group's spatial locus. A set of objects can also be merged into one object.
Foreground/background separation is done through an iterative technique (high-pass filtering and distance measure on pixels' HSB values) to estimate and interpolate the color distribution of light background across the scene.
Handwritten text is replaced by typing in the text entry when the bitmap region containing handwriten text is selected. And typed text is rendered as bitmaps.
Image recognition occurs in two stages. First, the original undifferentiated bitmap is segmented into primitive image objects by an automatic mechanism. Second, sensible groupings of these primitives are automatically formed and established as groups, or composite objects, in the lattice grouping structure. Stroke primitives are grouped according to rules following the principles of curvilinear smoothness and closure. Blob primitives (text) are grouped into composite objects reflecting words and lines of text based on spatial proximity and curvilinear alignment. Structure recognition is invoked in ScanScribe through an Edit menu item, and can be applied to only a selected region of the image.
Discussion
ScanScribe is an image editing tool that supports offline sketch recognition to some extent, basically group and structure, though the text "recognition" is done by manually typing in text entry. The lattice used for grouping is a good idea as it supports multiple group labels for each object. It's a promising application if it can incorporate more sketch recognition techniques.
Tuesday, October 21, 2008
"Template-based Online Character Recognition"
by Scott D. Connell, Anil K. Jain
Summary
To perform classificatio of handwritten characters, a recognition system must attempt to leverage between class variations, while accommodating potentially large within-class variations -- composed of two parts: between and within writing styles. Their goal is to automatically learn these writing styles so that adequate character models can be built.
Background discusses five categories of online handwriting recognition techniques: (1) primitive decomposition, (2) motor models, (3) elastic matching, (4) stochastic models, (5) neural networks.
Preprocessing uses equi-distant resampling followed by a Gaussian filter. Locations of endpoints and high curvature points are preserved. And characters are normalized to the same height.
A stroke is represented as a string of on average 62 events. Distance between two strings involves computing the distance between the corresponding pair of events. This requires events first be aligned. In this paper, each event is represented as: x, y with respect to the first point of the first stroke of the digit, and the angle of curvature (theta) at the sample point. Distance between two events is a weighed sum of their corresponding x, y, theta distances. Distance between characters is the sum of distances between aligned events. Neighboring strokes in multi-stroke characters are connected, so that they can be represented as a single string. Restrictions are made to skip points to prevent 1-N or N-1 alignment of events, when a point is skiped, penalty (spurious & missing penalties) is added to the distance. The distance is calculated by finding the minimal distance of all possible alignments of events in two strings. A stroke count difference penalty is added to the final distance.
Two methods - cluster centers and training set editing - are used to selecting representative prototypes. Cluster centers: cluster data and retain the prototype closest to each center (can be seen as finding representative examples of writing styles.) Training set editing: selecting only those templates that fall along class boundaries.
Next, the paper introduces nearest neighbor and decision tree as methods of classification. In constructing a decision tree, "difference feature" is used (asking question "Is character c more similar to reference i, or reference j?").
Accuracy was 86.9% for a 36-class set of alphanumeric characters.
Discussion
To me, it seems "different writing styles" within one class is just the same as different classes. The ability to distinguish between writing styles (and serves as different classes?) comes from the fact the training examples are automatically clustered. I know there is unsupervised clustering algorithms where the number of clusters K is automatically decided based on the data set.
Summary
To perform classificatio of handwritten characters, a recognition system must attempt to leverage between class variations, while accommodating potentially large within-class variations -- composed of two parts: between and within writing styles. Their goal is to automatically learn these writing styles so that adequate character models can be built.
Background discusses five categories of online handwriting recognition techniques: (1) primitive decomposition, (2) motor models, (3) elastic matching, (4) stochastic models, (5) neural networks.
Preprocessing uses equi-distant resampling followed by a Gaussian filter. Locations of endpoints and high curvature points are preserved. And characters are normalized to the same height.
A stroke is represented as a string of on average 62 events. Distance between two strings involves computing the distance between the corresponding pair of events. This requires events first be aligned. In this paper, each event is represented as: x, y with respect to the first point of the first stroke of the digit, and the angle of curvature (theta) at the sample point. Distance between two events is a weighed sum of their corresponding x, y, theta distances. Distance between characters is the sum of distances between aligned events. Neighboring strokes in multi-stroke characters are connected, so that they can be represented as a single string. Restrictions are made to skip points to prevent 1-N or N-1 alignment of events, when a point is skiped, penalty (spurious & missing penalties) is added to the distance. The distance is calculated by finding the minimal distance of all possible alignments of events in two strings. A stroke count difference penalty is added to the final distance.
Two methods - cluster centers and training set editing - are used to selecting representative prototypes. Cluster centers: cluster data and retain the prototype closest to each center (can be seen as finding representative examples of writing styles.) Training set editing: selecting only those templates that fall along class boundaries.
Next, the paper introduces nearest neighbor and decision tree as methods of classification. In constructing a decision tree, "difference feature" is used (asking question "Is character c more similar to reference i, or reference j?").
Accuracy was 86.9% for a 36-class set of alphanumeric characters.
Discussion
To me, it seems "different writing styles" within one class is just the same as different classes. The ability to distinguish between writing styles (and serves as different classes?) comes from the fact the training examples are automatically clustered. I know there is unsupervised clustering algorithms where the number of clusters K is automatically decided based on the data set.
"Sketch Recognition for Computer-Aided Design"
by Christopher F. Herot
Summary
This paper talks about free-hand sketching where inferences can be drawn from user actions.
In the first step, raw data is converted into a list of lines and curves by utilizing speed information to make interpretation of the user's intent. Corners are the intersection of two lines. Curves are fit to B-spline, smoothness is a function of speed. Then the output of line-curve finder is passed to a program the processes overtracing, adding a thickness parameter. The next level of interpretation is performed by the latching program which eliminates excess points. The latching radius is determined by speed, line length and density around each point.
Drawing style is important, suitable parameters can be determined for describing drawing styles. The sequence of drawing is also important. Adjustment of parameters can be done through editing operations (user repeatedly separates points joined by latcher, then the program can reduce the radius.)
Discussion
As a paper in 1970s, it is quite advanced. The adjustment of latching radius based on speed, length, density and editing operations is interesting.
Summary
This paper talks about free-hand sketching where inferences can be drawn from user actions.
In the first step, raw data is converted into a list of lines and curves by utilizing speed information to make interpretation of the user's intent. Corners are the intersection of two lines. Curves are fit to B-spline, smoothness is a function of speed. Then the output of line-curve finder is passed to a program the processes overtracing, adding a thickness parameter. The next level of interpretation is performed by the latching program which eliminates excess points. The latching radius is determined by speed, line length and density around each point.
Drawing style is important, suitable parameters can be determined for describing drawing styles. The sequence of drawing is also important. Adjustment of parameters can be done through editing operations (user repeatedly separates points joined by latcher, then the program can reduce the radius.)
Discussion
As a paper in 1970s, it is quite advanced. The adjustment of latching radius based on speed, length, density and editing operations is interesting.
Thursday, October 16, 2008
"Ink Features for Diagram Recognition"
by Rachel Patel, Beryl Plimmer, John Grundy, Ross Ihaka
Summary
Previous works use top-down, bottom-up, and combination approaches to do sketch recognition. But few systems support mixed text/shape sketch. This paper uses formal statistical analysis to identify key ink features and tries to tell text and shapes apart.
46 features are selected and grouped into 7 categories: size, time, intersections, curvature, pressure, OS recognition values, and inter-stroke gaps. 1519 strokes are collected for statistical analysis to determine whether each feature is significant in distinguishing between shapes and text. These features can then be arranged into a decision tree, with the most significant features at the root, leaves of the tree are either TEXT or SHAPE. rpart function is applied to the training data to find the optimal position for a split to made that yields minimal number of misclassification. The features that most accurately split the data into text and shape is significant.
This divider outperforms InkKit and Microsoft divider in overall accuracy, though its misclassification rate of text on new diagram set is a little higher than that of InkKit (music nodes were a confounding factor).
Significant features are:
1. time till next stroke (inter-stroke gaps)
2. speed till next stroke (inter-stroke gaps)
3. distance from last stroke (inter-stroke gaps)
4. distance to next stroke (inter-stroke gaps)
5. bounding box width (size)
6. perimeter to area (size)
7. amount of ink inside (size)
8. total angle (curvature)
Discussion
The idea of using rpart to select significant features and organize them into a decision tree is great. The misclassification rate on shapes is twice as high as that on text, meaning that many shapes are classified as text, and this is due mainly to such strokes as music nodes.
One thing is that for English and some other Latin character sets, capitalized characters and non-capitalized characters may be distinct on some features. For non-capitalized characters, curvature / bounding box width might be a useful feature (don't know if this correlates with curvature + bounding box width), as text tend to have high curvature/bounding box width. However, this feature may to be able to distinguish between shapes and most capitalized characters, such as 'H', 'F', 'X', etc, especially on stroke basis.
Summary
Previous works use top-down, bottom-up, and combination approaches to do sketch recognition. But few systems support mixed text/shape sketch. This paper uses formal statistical analysis to identify key ink features and tries to tell text and shapes apart.
46 features are selected and grouped into 7 categories: size, time, intersections, curvature, pressure, OS recognition values, and inter-stroke gaps. 1519 strokes are collected for statistical analysis to determine whether each feature is significant in distinguishing between shapes and text. These features can then be arranged into a decision tree, with the most significant features at the root, leaves of the tree are either TEXT or SHAPE. rpart function is applied to the training data to find the optimal position for a split to made that yields minimal number of misclassification. The features that most accurately split the data into text and shape is significant.
This divider outperforms InkKit and Microsoft divider in overall accuracy, though its misclassification rate of text on new diagram set is a little higher than that of InkKit (music nodes were a confounding factor).
Significant features are:
1. time till next stroke (inter-stroke gaps)
2. speed till next stroke (inter-stroke gaps)
3. distance from last stroke (inter-stroke gaps)
4. distance to next stroke (inter-stroke gaps)
5. bounding box width (size)
6. perimeter to area (size)
7. amount of ink inside (size)
8. total angle (curvature)
Discussion
The idea of using rpart to select significant features and organize them into a decision tree is great. The misclassification rate on shapes is twice as high as that on text, meaning that many shapes are classified as text, and this is due mainly to such strokes as music nodes.
One thing is that for English and some other Latin character sets, capitalized characters and non-capitalized characters may be distinct on some features. For non-capitalized characters, curvature / bounding box width might be a useful feature (don't know if this correlates with curvature + bounding box width), as text tend to have high curvature/bounding box width. However, this feature may to be able to distinguish between shapes and most capitalized characters, such as 'H', 'F', 'X', etc, especially on stroke basis.
Wednesday, October 15, 2008
"Distinguishing Text from Graphics in On-line Handwritten Ink"
by Christopher M. Bishop, Markus Svensen, Geoffrey E. Hinton
Summary
This paper introduces an approach to classify strokes into text or graphics (non-text). It tries to combine stroke-based feature with temporal information in computing probabilities.
For each stroke, 11 features are extracted, then a total least squares (TLS) model is fitted to the stroke. The stroke is divided into fragments at points with local mixima curvature and TLS is applied again to the largest resulting fragment. Features are: 1) stroke arc length, 2) total absolute curvature, 3) main direction given by TLS, 4) length-width ratio of TLS fit, 5) total number of fragments found in the stroke, 6) arc length of the largest fragment, 7) total absolute curvature of the largest fragment, 8) main direction of the largest fragment, 9) length of long side of bounding rectangle of the largest fragment. Features 6-9 are based on the assumption that indeed large largest fragment (may include the entire stroke) with high length-to-width TLS ratio indicates a graphics stroke. Multilayer perceptron (MLP) model is trained with these features. Probability p(tn|xn) is calculated. Since the distribution is biased towards text, adjusting objective function used for fitting the model can solve the problem.
Next, the temporal context is used to achieve better performance. The identity of successive strokes will tend to be correlated (text stroke followed by text stroke, graphics stroke followed by graphics stroke.) p(t1=1) = 0.5467, and a Hidden Markov Model (uni-partite HMM) of p(tn|tn-1) is built from the training data. Then p(tn|xn) of the discriminative model and p(tn|tn-1) are combined.
Finally, gap information is also incorporated (characteristics of gap between two text strokes and gap between a text stroke are different.) Features are: 1) logarithm of the difference of the pen-down times for the surrounding strokes, 2) x- and y- differences of the pen-down location for the surrounding strokes, and 3) x- and y- differences of the pen-up location of the preceding stroke and the pen-down location of the following stroke. Thus, labels (tn = tn+1) and (tn != tn+1) can be assigned (bi-partite HMM).
Accuracy on Cambridge data was 91.49%, and accuracy on Redmond data was 96.71% (22.64% graphics strokes are classified as text strokes).
Discussion
This is a text vs. shape paper. It tries to classify strokes into text or graphics based on features of the strokes as well as on the context. The result shows that the rate of misclassifying graphics strokes into text strokes is relatively high. The HMM may not be helpful in sketches with frequent text-graphics switches.
Summary
This paper introduces an approach to classify strokes into text or graphics (non-text). It tries to combine stroke-based feature with temporal information in computing probabilities.
For each stroke, 11 features are extracted, then a total least squares (TLS) model is fitted to the stroke. The stroke is divided into fragments at points with local mixima curvature and TLS is applied again to the largest resulting fragment. Features are: 1) stroke arc length, 2) total absolute curvature, 3) main direction given by TLS, 4) length-width ratio of TLS fit, 5) total number of fragments found in the stroke, 6) arc length of the largest fragment, 7) total absolute curvature of the largest fragment, 8) main direction of the largest fragment, 9) length of long side of bounding rectangle of the largest fragment. Features 6-9 are based on the assumption that indeed large largest fragment (may include the entire stroke) with high length-to-width TLS ratio indicates a graphics stroke. Multilayer perceptron (MLP) model is trained with these features. Probability p(tn|xn) is calculated. Since the distribution is biased towards text, adjusting objective function used for fitting the model can solve the problem.
Next, the temporal context is used to achieve better performance. The identity of successive strokes will tend to be correlated (text stroke followed by text stroke, graphics stroke followed by graphics stroke.) p(t1=1) = 0.5467, and a Hidden Markov Model (uni-partite HMM) of p(tn|tn-1) is built from the training data. Then p(tn|xn) of the discriminative model and p(tn|tn-1) are combined.
Finally, gap information is also incorporated (characteristics of gap between two text strokes and gap between a text stroke are different.) Features are: 1) logarithm of the difference of the pen-down times for the surrounding strokes, 2) x- and y- differences of the pen-down location for the surrounding strokes, and 3) x- and y- differences of the pen-up location of the preceding stroke and the pen-down location of the following stroke. Thus, labels (tn = tn+1) and (tn != tn+1) can be assigned (bi-partite HMM).
Accuracy on Cambridge data was 91.49%, and accuracy on Redmond data was 96.71% (22.64% graphics strokes are classified as text strokes).
Discussion
This is a text vs. shape paper. It tries to classify strokes into text or graphics based on features of the strokes as well as on the context. The result shows that the rate of misclassifying graphics strokes into text strokes is relatively high. The HMM may not be helpful in sketches with frequent text-graphics switches.
Tuesday, October 14, 2008
"MathBrush: A Case Study for Pen-based Interactive Mathematics"
by George Labahn, Edward Lank, Mirette Marzouk, Andrea Bunt, Scott MacLean, and David Tausky
Summary
This paper introduces MathBrush to provide a pen-based interface to support mathematical problem-solving tasks, which are currently done on computer algebra systems (CAS). It recognizes and transforms 2-dimensional hand-written mathematical expressions into 1-demensional text string, which can then be passed to the back-end CAS.
MathBrush support large and complex input and output expressions. It incorporates a context sensitive pop-up menu based on a cursory analysis of the mathematical expression and manipulations most likely to be performed (e.g. inverse, rank on matrices). It has an editing-in-place mechanism that allows interaction with CAS output, which enables separate operations on subexpressions. It supports two- or three-dimentional plotting and angles and limits of the plots can be adjusted. Users can select a new CAS to check for alternative answers to their problems.
Evalution methods include thinkalouds and semi-structured interview. Equation entry is a three step process. Users first draw the mathematical equation. They then interact with the equation in an input validation panel (IVP) to correct any character recognition errors. Finally, they render the equation to validate structural analysis. Extraneous ink, esp. dots resulted in errors in parsing. Features like scratch-out and translation(circling and moving terms around) was not used. Correcting recognition that uses a separate input validation panel to correct character recognition followed by a rendering step to present structural analysis caused confusion due to lack of perception in how structural analysis is performed. Mathematical interactions were a strength of the system.
Users tend to modify their writing to adjust to the tolerances of recognition (slowing down, taking care with symbole spacing, etc).
Discussion
This paper focuses on UI introduction, detail in recognition is not mensioned. I'm wondering if ambiguity solving techniques is used in MathBrush, by which I mean that possible interpretations of the expression are sent to CAS and those interpretations that yield errors will not likely be what the user meant to input, so they can be eliminated at this point.
Summary
This paper introduces MathBrush to provide a pen-based interface to support mathematical problem-solving tasks, which are currently done on computer algebra systems (CAS). It recognizes and transforms 2-dimensional hand-written mathematical expressions into 1-demensional text string, which can then be passed to the back-end CAS.
MathBrush support large and complex input and output expressions. It incorporates a context sensitive pop-up menu based on a cursory analysis of the mathematical expression and manipulations most likely to be performed (e.g. inverse, rank on matrices). It has an editing-in-place mechanism that allows interaction with CAS output, which enables separate operations on subexpressions. It supports two- or three-dimentional plotting and angles and limits of the plots can be adjusted. Users can select a new CAS to check for alternative answers to their problems.
Evalution methods include thinkalouds and semi-structured interview. Equation entry is a three step process. Users first draw the mathematical equation. They then interact with the equation in an input validation panel (IVP) to correct any character recognition errors. Finally, they render the equation to validate structural analysis. Extraneous ink, esp. dots resulted in errors in parsing. Features like scratch-out and translation(circling and moving terms around) was not used. Correcting recognition that uses a separate input validation panel to correct character recognition followed by a rendering step to present structural analysis caused confusion due to lack of perception in how structural analysis is performed. Mathematical interactions were a strength of the system.
Users tend to modify their writing to adjust to the tolerances of recognition (slowing down, taking care with symbole spacing, etc).
Discussion
This paper focuses on UI introduction, detail in recognition is not mensioned. I'm wondering if ambiguity solving techniques is used in MathBrush, by which I mean that possible interpretations of the expression are sent to CAS and those interpretations that yield errors will not likely be what the user meant to input, so they can be eliminated at this point.
Thursday, October 9, 2008
"Sketch-Based Educational Games: 'Drawing' Kids Away from Traditional Interfaces"
by Brandon Paulson, Brian Eoff, Aaron Wolin, Joshua Johnston, Tracy Hammond
Summary
This paper introduces games developed using LADDER. These games help promote kinesthetic and
tactile learning of child. The games include memory games, simple physics simulation games, games to learn shapes and geography, as well as a system for providing feedback about sentence diagramming.
User tests on graduate students give positive feedback. User tests on children are delayed by IRB constraint.
Discussion
Interesing! I hope to see more complicated games in the future (like real-time strategy game where players can draw their armies and give instructions.)
Summary
This paper introduces games developed using LADDER. These games help promote kinesthetic and
tactile learning of child. The games include memory games, simple physics simulation games, games to learn shapes and geography, as well as a system for providing feedback about sentence diagramming.
User tests on graduate students give positive feedback. User tests on children are delayed by IRB constraint.
Discussion
Interesing! I hope to see more complicated games in the future (like real-time strategy game where players can draw their armies and give instructions.)
"Recognizing Free-form Hand-sketched Constraint Network Diagrams by Combining Geometry and Context"
by Tracy Hammond, Barry O’Sullivan
Summary
Constraint programming can be divided very crudely into modeling and solving. Modeling defines the problem: which variables are involved, and what constraints on them. Solving finds values for all the variables that simultaneously satisfy all the constraints. While solving can seek to embedded algorithms, libraries, and programming languages, bottleneck exists in modelling phase. This paper tries to solve this bottleneck by providing a free-hand sketching system.
The system uses geometric-based to allow free-form drawing. It is built using LADDER, and the user interface is generated by GUILD. In a modelling graph of a constraint problem graph, nodes are CSP variables and edges between variables represent constraints. The set of possible values (domain) each variable can take is specified by writing a set adjacent to a node. The constraint between two variables is specified by giving a direction on an edge and a relation adjacent to the edge. Text can be enterred by handwriting, typing, or variable names (t, v, w , x, y, z) handwritten directly onto the diagram. Then the corresponding constraint satisfaction model is automatically generated.
Nodes are recognized as ellipses that contain a variable name within them. An undirected link is recognized geometrically as a line. Directed links are recognized as an arrow that extends between two nodes. Other shapes are variable name, variables: T, V, W, X, Y, Z, and constraints: less than, equal, not equal.
Shapes appear to be similar (less than, V, W, arrow head) are distinguished in context. Two lines in each direction (partners) are added in the system to allow flexible drawing order, for example, arrow head coincide with either end of the shaft will be recognized as an arrow. And to keep recognition response times short, as soon as shapes are recognized, their properties are computed and placed into a hash-table. Thresholds are determined by combining absolute (within 10 pixels), relative (5% of stroke length), and perceptual ("coincididence": closer to the endpoint than to the midpoint) rules to give a more accurate representation of the
error.
Accuracy was 100% but (1) UI for drawing gives feedback if shapes are drawn wrong, (2) testing graphs are drawn by designer of the system.
Discussion
This is an application of LADDER in a specific domain. The benefit of sketch recognition system over paper-based drawing is evident in that the sketch can be automatically converted to the model.
Summary
Constraint programming can be divided very crudely into modeling and solving. Modeling defines the problem: which variables are involved, and what constraints on them. Solving finds values for all the variables that simultaneously satisfy all the constraints. While solving can seek to embedded algorithms, libraries, and programming languages, bottleneck exists in modelling phase. This paper tries to solve this bottleneck by providing a free-hand sketching system.
The system uses geometric-based to allow free-form drawing. It is built using LADDER, and the user interface is generated by GUILD. In a modelling graph of a constraint problem graph, nodes are CSP variables and edges between variables represent constraints. The set of possible values (domain) each variable can take is specified by writing a set adjacent to a node. The constraint between two variables is specified by giving a direction on an edge and a relation adjacent to the edge. Text can be enterred by handwriting, typing, or variable names (t, v, w , x, y, z) handwritten directly onto the diagram. Then the corresponding constraint satisfaction model is automatically generated.
Nodes are recognized as ellipses that contain a variable name within them. An undirected link is recognized geometrically as a line. Directed links are recognized as an arrow that extends between two nodes. Other shapes are variable name, variables: T, V, W, X, Y, Z, and constraints: less than, equal, not equal.
Shapes appear to be similar (less than, V, W, arrow head) are distinguished in context. Two lines in each direction (partners) are added in the system to allow flexible drawing order, for example, arrow head coincide with either end of the shaft will be recognized as an arrow. And to keep recognition response times short, as soon as shapes are recognized, their properties are computed and placed into a hash-table. Thresholds are determined by combining absolute (within 10 pixels), relative (5% of stroke length), and perceptual ("coincididence": closer to the endpoint than to the midpoint) rules to give a more accurate representation of the
error.
Accuracy was 100% but (1) UI for drawing gives feedback if shapes are drawn wrong, (2) testing graphs are drawn by designer of the system.
Discussion
This is an application of LADDER in a specific domain. The benefit of sketch recognition system over paper-based drawing is evident in that the sketch can be automatically converted to the model.
Tuesday, October 7, 2008
"Ambiguous Intentions: a Paper-like Interface for Creative Design"
by Mark D Gross, Ellen Yi-Luen Do
Summary
Designers prefer to use paper and pencil because it supports ambiguity, imprecision, and incremental formalization of ideas as well as rapid exploration of alternatives. Yet computers offer advantages: editing, 3D modeling, and rendering, as well as simulation, critiquing, case bases, and distance collaboration. The Electronic Cocktail Napkin introduced in this paper supports abstraction through end user programming of graphical rewrite rules, ambiguity by carrying alternative interpretations using context to resolve unintended ambiguities, and imprecision with constraint based interactive behavior.
Configurations are high level structures made up of primitive elements and some constraints (four boxes around a larger box recognized as a dining table). The program replaces low-level elements with an abstract configuration. Defining a configuration is a special event, so the Napkin provides a special pattern definition window where it identifies the element types and spatial relations both graphically and as symbolic expressions. Recognizing configurations depends on context.
Alternative interpretations of a drawing element or configuration are carried until the ambiguity can be resolve by finding additional information elsewhere in the drawing or later in time, or by being explicitly resolved by the user. The ambiguity can be resolved later in one of two ways: The user can identify the element by drawing over it. Or, the program may find that the glyph is part of a known configuration, in a position that resolves the ambiguity. Contexts map primitive elements to different interpretations, on the other hand, a single unique element or configuration can often determine context.
Representing imprecision with constraints: for example, after recognizing a diagram as a graph of nodes and arcs, the Napkin establishes connection constraints. The user can drag the nodes and the graph connectivity remains intact. The Napkin identifies spatial relations among drawing
elements, but schemes must determine which relationships are incidental and which are intended. Only context relevant constraints are asserted. But users can also add, delete, and modify constraints. It is important to support incremental refinement and formalization of design.
The low level recognizer identify the best match(es) in a given context for an input glyph along with a certainty value (1-5, 1 for high certainty). A glyph is time delimited by when the user lifts the pen for longer than a certain time out period. Therefore a glyph can include more than one stroke. A 3x3 grid is inscribed within the bounding box of a glyph, numbered 1-9. Stroke path is described as a sequence, for example [1 2 3 6 9 8 7 4] represents a clockwise circle or box, a corner count helps to distinguish shapes with the same pen path. Corners are identified when the change in heading of segments between raw points exceeds 45 degrees. Pen path, aspect ratio and size of the bounding box, stroke and corner count are features of a glyph. The glyph template records a constraint on acceptable values (e.g., a set of possible corner counts {3 4 5} ). Templates also contain a slot that indicates which transformations of a glyph are permitted. Match proceeds as follows: (1) templates that match the input glyph's pen path sequence and all other features are selected. If only one candidate is found, assign certainty 1 to it. If there are multiple candidates, assgin certainty 2 to each of them. (2) permuted the path sequence to match rotated, backwards, or reflected cases, also assign 1 or 2. (3) approximate matches with sequences that differ one or two squares are assigned 3 or 4 (multiple). (4) if no match can be found, assign certainty value of 5.
To identify configurations, the program runs recognizer functions over all pairs of appropriately typed elements in the diagram to look for these patterns. Replace elements with configuration, and then the configuration may be parsed as an element in a larger 'graph' configuration. Each configuration recognizer is a Lisp function, constraints on the types of elements and their spatial relations. The Napkin also provides a text and a graphic view to adjust the configuration recognizers.
The Napkin maintains a list of all known contexts, a 'current context' and a 'current context chain' - the search path for contextual recognition. A context has four components: glyph templates, configuration recognizes, spatial relations, and mappings of glyphs from other contexts. The first recognition attempt is in the current context, but all contexts in the current context chain are consulted. If there are several matches they are returned as a list of alternatives, in the order their contexts appear in the context chain. The Napkin’s initial context is the most general one, which includes only shapes, alphanumeric characters, and lines. context. As soon as the user draws a glyph or makes a configuration that the Napkin can identify as belonging to a more specific context, the Napkin adjusts the current context and the current context chain.
Discussion
The low level recognition is feature-based, while the high-level configurations are defined in the way somewhat like that used in LADDER. The combination of context and recognition of current context is the highlight of this system. The way of resolving ambiguity and imprecision reminds me of Sudoku. Although this system only supports a small number of primitive shapes, the concept of supporting abstraction, ambiguity and imprecision is good.
Summary
Designers prefer to use paper and pencil because it supports ambiguity, imprecision, and incremental formalization of ideas as well as rapid exploration of alternatives. Yet computers offer advantages: editing, 3D modeling, and rendering, as well as simulation, critiquing, case bases, and distance collaboration. The Electronic Cocktail Napkin introduced in this paper supports abstraction through end user programming of graphical rewrite rules, ambiguity by carrying alternative interpretations using context to resolve unintended ambiguities, and imprecision with constraint based interactive behavior.
Configurations are high level structures made up of primitive elements and some constraints (four boxes around a larger box recognized as a dining table). The program replaces low-level elements with an abstract configuration. Defining a configuration is a special event, so the Napkin provides a special pattern definition window where it identifies the element types and spatial relations both graphically and as symbolic expressions. Recognizing configurations depends on context.
Alternative interpretations of a drawing element or configuration are carried until the ambiguity can be resolve by finding additional information elsewhere in the drawing or later in time, or by being explicitly resolved by the user. The ambiguity can be resolved later in one of two ways: The user can identify the element by drawing over it. Or, the program may find that the glyph is part of a known configuration, in a position that resolves the ambiguity. Contexts map primitive elements to different interpretations, on the other hand, a single unique element or configuration can often determine context.
Representing imprecision with constraints: for example, after recognizing a diagram as a graph of nodes and arcs, the Napkin establishes connection constraints. The user can drag the nodes and the graph connectivity remains intact. The Napkin identifies spatial relations among drawing
elements, but schemes must determine which relationships are incidental and which are intended. Only context relevant constraints are asserted. But users can also add, delete, and modify constraints. It is important to support incremental refinement and formalization of design.
The low level recognizer identify the best match(es) in a given context for an input glyph along with a certainty value (1-5, 1 for high certainty). A glyph is time delimited by when the user lifts the pen for longer than a certain time out period. Therefore a glyph can include more than one stroke. A 3x3 grid is inscribed within the bounding box of a glyph, numbered 1-9. Stroke path is described as a sequence, for example [1 2 3 6 9 8 7 4] represents a clockwise circle or box, a corner count helps to distinguish shapes with the same pen path. Corners are identified when the change in heading of segments between raw points exceeds 45 degrees. Pen path, aspect ratio and size of the bounding box, stroke and corner count are features of a glyph. The glyph template records a constraint on acceptable values (e.g., a set of possible corner counts {3 4 5} ). Templates also contain a slot that indicates which transformations of a glyph are permitted. Match proceeds as follows: (1) templates that match the input glyph's pen path sequence and all other features are selected. If only one candidate is found, assign certainty 1 to it. If there are multiple candidates, assgin certainty 2 to each of them. (2) permuted the path sequence to match rotated, backwards, or reflected cases, also assign 1 or 2. (3) approximate matches with sequences that differ one or two squares are assigned 3 or 4 (multiple). (4) if no match can be found, assign certainty value of 5.
To identify configurations, the program runs recognizer functions over all pairs of appropriately typed elements in the diagram to look for these patterns. Replace elements with configuration, and then the configuration may be parsed as an element in a larger 'graph' configuration. Each configuration recognizer is a Lisp function, constraints on the types of elements and their spatial relations. The Napkin also provides a text and a graphic view to adjust the configuration recognizers.
The Napkin maintains a list of all known contexts, a 'current context' and a 'current context chain' - the search path for contextual recognition. A context has four components: glyph templates, configuration recognizes, spatial relations, and mappings of glyphs from other contexts. The first recognition attempt is in the current context, but all contexts in the current context chain are consulted. If there are several matches they are returned as a list of alternatives, in the order their contexts appear in the context chain. The Napkin’s initial context is the most general one, which includes only shapes, alphanumeric characters, and lines. context. As soon as the user draws a glyph or makes a configuration that the Napkin can identify as belonging to a more specific context, the Napkin adjusts the current context and the current context chain.
Discussion
The low level recognition is feature-based, while the high-level configurations are defined in the way somewhat like that used in LADDER. The combination of context and recognition of current context is the highlight of this system. The way of resolving ambiguity and imprecision reminds me of Sudoku. Although this system only supports a small number of primitive shapes, the concept of supporting abstraction, ambiguity and imprecision is good.
"LADDER, a sketching language for user interface developers"
by Tracy Hammond, Randall Davis
Summary
Sketch recognition systems can be quite time consuming to build if they are to handle the intricacies of each domain. With LADDER, a developer need only write a domain description which describes what the domain shapes looklik e, and how they should be displayed and edited after they are recognized. The language consists of predefined shapes, constraints, editing behaviors, and display methods, as well as a syntax for specifying a domain description. LADDER allows the developer to specify both hard and soft constraints. LADDER is the first language that not only can define how shapes are to be recognized, but also can define how shapes are displayed and edited. Trigger and action allows user defined editing.
(drawn usingDescription limitations: (1) LADDER can only describe shapes with a fixed graphical grammar (drawn with the same graphical components each time). (2) The shapes must be composed solely of the primitive constraints contained in LADDER and using only the constraints available in LADDER. (3) LADDER can only describe domains that have few curves or where the curve details are not important for distinguishing between different shapes. (4) LADDER can only describe shapes that have a lot of regularity and not too much detail.
Shapes are defined by a list of components, their aliases, constraints that define relationship of the components, editing behaviors (triggers and actions), and display methods (original, clean-up, ideal shape, custom shape). Shapes can be defined hierarchically. Also can extend abstract shape with is-a section (like extends in java). Shape groups (abstract shape groups) can be used by the recognition system to provide top-down recognition, and ‘‘chain reaction’’ editing behaviors.
The language includes the primitive shapes SHAPE, POINT, PATH, LINE, BEZIERCURVE, CURVE, ARC, ELLIPSE, and SPIRAL, also includes predefined shapes built from these primitives including RECTANGLE, DIAMOND, etc. SHAPE (properties: boundingbox, centerpoint, width, and height) is an abstract shape which all other shapes extend.
A number of predefined constraints are included in the language, including perpendicular, parallel, collinear, sameSide, oppositeSide, coincident, connected, meet, intersect, tangent, contains, concentric, larger, near, drawOrder, equalLength, equal, lessThan, lessThanEqual, angle, angleDir, acute, obtuse, acuteMeet, and obtuseMeet. constraints that are valid only in a particular orientation, including horizontal, vertical, posSlope, negSlope, leftOf, rightOf, above, below, sameXPos, sameYPos, aboveLeft, aboveRight, belowLeft, belowRight, centeredBelow, centeredAbove, centeredLeft, centeredRight, and angleL. isRotatable implies the shape can be found in any orientation.
There are predefined editing behaviors like dragInside. The possible editing actions include wait, select, deselect, color, delete, translate, rotate, scale, resize, rubberBand, showHandle, and setCursor. The possible triggers include click, doubleClick, hold, holdDrag, draw, drawOver, scribbleOver, and encircle.
Key word vector can be used to specify variable number of components, and must specify the minimum and maximum number of components.
The domain independent modules determine if the stroke can be classified as an ELLIPSE, LINE, CURVE, ARC, POINT, POLYLINE or some combination. Interpretations from low-level recognizers are produced and sent off to the higher level recognizer. In order to ensure that only one interpretation is chosen, each shape has an ID, and each shape keeps a list of its subshapes, including its strokes. At any particular time, each subshape is allowed to belong to only one final recognized domain shape. But if the primitive shape recognizer does not provide the correct interpretation of a stroke, the domain shape recognizer will never be able to correctly recognize a higher level shape using this stroke.
Domain shape recognition is performed by the Jess rule-based system. To improve efficiency, a greedy algorithm is used that removes subshapes from the rule-based system once a final higher level shape is chosen.
If an editing gesture such as click-and-hold or double-click occurs, the system checks to see (1) if an editing gesture for that trigger is defined for any shape, and (2) if the mouse is over the shape the gesture is defined for. If so then the drawing gesture is short-circuited and the editing gesture takes over.
To display a shape’s ideal strokes, the system uses a shape constraint solver (using optimization functions from Mathematica) which takes in a shape description and initial locations for all of the subshapes and outputs the shape with all of the constraints satisfied while moving the points as little as possible.
Domain shape recognizers, exhibitors, and editors are automatically generated during the translation process.
Discussion
The system is so powerful. The shapes are defined hierarchically, which is just like OO languages. User defined editing behaviors and display modes are also great.
Summary
Sketch recognition systems can be quite time consuming to build if they are to handle the intricacies of each domain. With LADDER, a developer need only write a domain description which describes what the domain shapes looklik e, and how they should be displayed and edited after they are recognized. The language consists of predefined shapes, constraints, editing behaviors, and display methods, as well as a syntax for specifying a domain description. LADDER allows the developer to specify both hard and soft constraints. LADDER is the first language that not only can define how shapes are to be recognized, but also can define how shapes are displayed and edited. Trigger and action allows user defined editing.
(drawn usingDescription limitations: (1) LADDER can only describe shapes with a fixed graphical grammar (drawn with the same graphical components each time). (2) The shapes must be composed solely of the primitive constraints contained in LADDER and using only the constraints available in LADDER. (3) LADDER can only describe domains that have few curves or where the curve details are not important for distinguishing between different shapes. (4) LADDER can only describe shapes that have a lot of regularity and not too much detail.
Shapes are defined by a list of components, their aliases, constraints that define relationship of the components, editing behaviors (triggers and actions), and display methods (original, clean-up, ideal shape, custom shape). Shapes can be defined hierarchically. Also can extend abstract shape with is-a section (like extends in java). Shape groups (abstract shape groups) can be used by the recognition system to provide top-down recognition, and ‘‘chain reaction’’ editing behaviors.
The language includes the primitive shapes SHAPE, POINT, PATH, LINE, BEZIERCURVE, CURVE, ARC, ELLIPSE, and SPIRAL, also includes predefined shapes built from these primitives including RECTANGLE, DIAMOND, etc. SHAPE (properties: boundingbox, centerpoint, width, and height) is an abstract shape which all other shapes extend.
A number of predefined constraints are included in the language, including perpendicular, parallel, collinear, sameSide, oppositeSide, coincident, connected, meet, intersect, tangent, contains, concentric, larger, near, drawOrder, equalLength, equal, lessThan, lessThanEqual, angle, angleDir, acute, obtuse, acuteMeet, and obtuseMeet. constraints that are valid only in a particular orientation, including horizontal, vertical, posSlope, negSlope, leftOf, rightOf, above, below, sameXPos, sameYPos, aboveLeft, aboveRight, belowLeft, belowRight, centeredBelow, centeredAbove, centeredLeft, centeredRight, and angleL. isRotatable implies the shape can be found in any orientation.
There are predefined editing behaviors like dragInside. The possible editing actions include wait, select, deselect, color, delete, translate, rotate, scale, resize, rubberBand, showHandle, and setCursor. The possible triggers include click, doubleClick, hold, holdDrag, draw, drawOver, scribbleOver, and encircle.
Key word vector can be used to specify variable number of components, and must specify the minimum and maximum number of components.
The domain independent modules determine if the stroke can be classified as an ELLIPSE, LINE, CURVE, ARC, POINT, POLYLINE or some combination. Interpretations from low-level recognizers are produced and sent off to the higher level recognizer. In order to ensure that only one interpretation is chosen, each shape has an ID, and each shape keeps a list of its subshapes, including its strokes. At any particular time, each subshape is allowed to belong to only one final recognized domain shape. But if the primitive shape recognizer does not provide the correct interpretation of a stroke, the domain shape recognizer will never be able to correctly recognize a higher level shape using this stroke.
Domain shape recognition is performed by the Jess rule-based system. To improve efficiency, a greedy algorithm is used that removes subshapes from the rule-based system once a final higher level shape is chosen.
If an editing gesture such as click-and-hold or double-click occurs, the system checks to see (1) if an editing gesture for that trigger is defined for any shape, and (2) if the mouse is over the shape the gesture is defined for. If so then the drawing gesture is short-circuited and the editing gesture takes over.
To display a shape’s ideal strokes, the system uses a shape constraint solver (using optimization functions from Mathematica) which takes in a shape description and initial locations for all of the subshapes and outputs the shape with all of the constraints satisfied while moving the points as little as possible.
Domain shape recognizers, exhibitors, and editors are automatically generated during the translation process.
Discussion
The system is so powerful. The shapes are defined hierarchically, which is just like OO languages. User defined editing behaviors and display modes are also great.
Thursday, October 2, 2008
"What!?! No Rubine Features?: Using Geometric-based Features to Produce Normalized Confidence Values for Sketch Recognition"
by Brandon Paulson, Pankaj Rajan, Pedro Davalos, Ricardo Gutierrez-Osuna, Tracy Hammond
Summary
Gesture-based sketch recognition has the benefit of using mathematically sound classifiers which produce fast classifications along with normalized confidence values, but has the disadvantage of using feature sets which are user-dependent and require individual training by each user to give good recognition results. Geometric-based techniques classifies sketches based on what they really look like, but inferences about generalization are hard to determine because classification is not statistical, and ranking alternative interpretations is difficult. So this paper tries to combine these two approaches.
31 features used in Paleosketch and 12 features used in Rubine are used in a statistical quadratic classifier shown in the table below. Then it uses a sequential forward selection (SFS) to decide the optimal feature subset. 10 folds of SFS is performed using sketches from 10 groups of randomly chosen half users as training data, and the rest are used for training. This produces 10 groups of optimal subsets. Then additional subsets are picked as those appear 100%, 100%+90%, 100%+90%+80%... in previous 10 subsets. The overall optimal subset contains features that appear at least 50% of the time in multi-fold subset selection process. As a result only one Rubine feature is in the optimal feature subset.
=========FROM PALEOSKETCH===========
1. Endpoint to stroke length ratio (100%)
2. NDDE (90%)
3. DCR (90%)
4. Slope of the direction graph (20%)
5. Maximum curvature (40%)
6. Average curvature (30%)
7. # of corners (30%)
8. Line least squares error (0%)
9. Line feature area error (40%)
10. Arc fit: radius estimate (0%)
11. Arc feature area error (20%)
12. Curve least squares error (90%)
13. Polyline fit: # of sub-strokes (70%)
14. Polyline fit: percent of sub-strokes pass line test (50%)
15. Polyline feature area error (80%)
16. Polyline least squares error (30%)
17. Ellipse fit: major axis length estimate (20%)
18. Ellipse fit: minor axis length estimate (30%)
19. Ellipse feature area error (10%)
20. Circle fit: radius estimate (30%)
21. Circle fit: major axis to minor axis ratio (80%)
22. Circle feature area error (0%)
23. Spiral fit: avg. radius/bounding box radius ratio (60%)
24. Spiral fit: center closeness error (70%)
25. Spiral fit: max distance between consecutive centers (20%)
26. Spiral fit: average radius estimate (10%)
27. Spiral fit: radius test passed (1.0 or 0.0) (40%)
28. Complex fit: # of sub-fits (60%)
29. Complex fit: # of non-polyline primitives (50%)
30. Complex fit: percent of sub-fits that are lines (90%)
31. Complex score / rank (50%)
=========FROM RUBINE===========
32. Cosine of the starting angle (30%)
33. Sine of the starting angle (10%)
34. Length of bounding box diagonal (20%)
35. Angle of the bounding box diagonal (40%)
36. Distance between endpoints (10%)
37. Cosine of angle between endpoints (0%)
38. Sine of angle between endpoints (10%)
39. Total stroke length (20%)
40. Total rotation (100%)
41. Absolute rotation (10%)
42. Rotation squared (10%)
43. Maximum speed (20%)
44. Total time (30%)
Discussion
This paper shows the way to select features, the experiment method is great. It is kind of like brute force to find the combination that yields highest accuracy.
Summary
Gesture-based sketch recognition has the benefit of using mathematically sound classifiers which produce fast classifications along with normalized confidence values, but has the disadvantage of using feature sets which are user-dependent and require individual training by each user to give good recognition results. Geometric-based techniques classifies sketches based on what they really look like, but inferences about generalization are hard to determine because classification is not statistical, and ranking alternative interpretations is difficult. So this paper tries to combine these two approaches.
31 features used in Paleosketch and 12 features used in Rubine are used in a statistical quadratic classifier shown in the table below. Then it uses a sequential forward selection (SFS) to decide the optimal feature subset. 10 folds of SFS is performed using sketches from 10 groups of randomly chosen half users as training data, and the rest are used for training. This produces 10 groups of optimal subsets. Then additional subsets are picked as those appear 100%, 100%+90%, 100%+90%+80%... in previous 10 subsets. The overall optimal subset contains features that appear at least 50% of the time in multi-fold subset selection process. As a result only one Rubine feature is in the optimal feature subset.
=========FROM PALEOSKETCH===========
1. Endpoint to stroke length ratio (100%)
2. NDDE (90%)
3. DCR (90%)
4. Slope of the direction graph (20%)
5. Maximum curvature (40%)
6. Average curvature (30%)
7. # of corners (30%)
8. Line least squares error (0%)
9. Line feature area error (40%)
10. Arc fit: radius estimate (0%)
11. Arc feature area error (20%)
12. Curve least squares error (90%)
13. Polyline fit: # of sub-strokes (70%)
14. Polyline fit: percent of sub-strokes pass line test (50%)
15. Polyline feature area error (80%)
16. Polyline least squares error (30%)
17. Ellipse fit: major axis length estimate (20%)
18. Ellipse fit: minor axis length estimate (30%)
19. Ellipse feature area error (10%)
20. Circle fit: radius estimate (30%)
21. Circle fit: major axis to minor axis ratio (80%)
22. Circle feature area error (0%)
23. Spiral fit: avg. radius/bounding box radius ratio (60%)
24. Spiral fit: center closeness error (70%)
25. Spiral fit: max distance between consecutive centers (20%)
26. Spiral fit: average radius estimate (10%)
27. Spiral fit: radius test passed (1.0 or 0.0) (40%)
28. Complex fit: # of sub-fits (60%)
29. Complex fit: # of non-polyline primitives (50%)
30. Complex fit: percent of sub-fits that are lines (90%)
31. Complex score / rank (50%)
=========FROM RUBINE===========
32. Cosine of the starting angle (30%)
33. Sine of the starting angle (10%)
34. Length of bounding box diagonal (20%)
35. Angle of the bounding box diagonal (40%)
36. Distance between endpoints (10%)
37. Cosine of angle between endpoints (0%)
38. Sine of angle between endpoints (10%)
39. Total stroke length (20%)
40. Total rotation (100%)
41. Absolute rotation (10%)
42. Rotation squared (10%)
43. Maximum speed (20%)
44. Total time (30%)
Discussion
This paper shows the way to select features, the experiment method is great. It is kind of like brute force to find the combination that yields highest accuracy.
Wednesday, October 1, 2008
"Backpropagation Applied to Handwritten Zip Code Recognition"
by Y. LeCun, B. Boser, et al
Summary
This paper introduces a neural network based approach of recgnizing handwritten zip codes.
Training & testing data is collected from a post office in Buffalo, altogether 9298 segmented numerals. 7291 is used for training and the remaining 2007 saved for testing. All the zip codes are preprocessed by seperating digits from their neighbors and then apply linear transformating to convert them into 16x16 pixel images, and the grey levels are scaled -1~1.
The input of the network is 16x16 images, and the output is 10 units, one per class. The precise location of a feature is not relevant to the classification, while the approximation location does. In order to detect a particular feature at any location, it uses the same set of weights - "weight sharing". And the neural network is composed of 3 layers:
H1: 12 feature maps with 8x8 units, each unit takes a 5x5 field from the 16x16 image as input. Units in 1st layer that are one unit part, their receptive fields are 2 pixels apart. Thus the image is undersampled and some position information is eliminated. Connections extending past the boundaries are filled with virtual background plane, whose value is chosen to be -1. It uses the same set of 25 weights on all the 5x5 input fields, but each unit in H1 has its own bias. (19,968 connections & 1068 free parameters)
H2: 12 feature maps with 4x4 units built on H1 and just like the way H1 is built, except that each unit in H2 combines 8 of the 12 feature maps in H1. The sample fields are also 5x5, and also identical weight vectors and different bias for each unit. (38,592 connections & 2592 free parameters)
H3: 30 units fully connected to units in H2. (192*30 + 30biases = 5790 connections)
Output: 10 units fully connected to units in H3, adding another 310 weights.
The nonlinear function used at each node was a scaled hyperbolic tangent. The target values for the output units were chosen within the quasilinear range of the sigmoid. The weights were initialized with random values ranging between -2.4/Fi and 2.4/Fi, where Fi is the number of inputs.
The network was trained 23 passes (167,693 pattern presentations) - backpropagation. The accuracy was 0.14% for training data, and 5% for testing data. The percentage of rejection was 12.1% to achieve 1% error rate. Overall, it outperforms previous works in accuracy and speed.
Discussion
To be honest, I know little about neural network. But it seems to be using a bunch of functions to undersample sketches by their features, and train the network to make certain weights bigger, so that the input can follow the path which maximize the weights to get to the corresponding output. However, this may not work equally well in non-digit sketches, and also worth noted is that the system uses images of grey level -1~1, while the sketches we are considering uses binery images most of the time. Anyway, this is a different solution from the ones we've learned so far.
Summary
This paper introduces a neural network based approach of recgnizing handwritten zip codes.
Training & testing data is collected from a post office in Buffalo, altogether 9298 segmented numerals. 7291 is used for training and the remaining 2007 saved for testing. All the zip codes are preprocessed by seperating digits from their neighbors and then apply linear transformating to convert them into 16x16 pixel images, and the grey levels are scaled -1~1.
The input of the network is 16x16 images, and the output is 10 units, one per class. The precise location of a feature is not relevant to the classification, while the approximation location does. In order to detect a particular feature at any location, it uses the same set of weights - "weight sharing". And the neural network is composed of 3 layers:
H1: 12 feature maps with 8x8 units, each unit takes a 5x5 field from the 16x16 image as input. Units in 1st layer that are one unit part, their receptive fields are 2 pixels apart. Thus the image is undersampled and some position information is eliminated. Connections extending past the boundaries are filled with virtual background plane, whose value is chosen to be -1. It uses the same set of 25 weights on all the 5x5 input fields, but each unit in H1 has its own bias. (19,968 connections & 1068 free parameters)
H2: 12 feature maps with 4x4 units built on H1 and just like the way H1 is built, except that each unit in H2 combines 8 of the 12 feature maps in H1. The sample fields are also 5x5, and also identical weight vectors and different bias for each unit. (38,592 connections & 2592 free parameters)
H3: 30 units fully connected to units in H2. (192*30 + 30biases = 5790 connections)
Output: 10 units fully connected to units in H3, adding another 310 weights.
The nonlinear function used at each node was a scaled hyperbolic tangent. The target values for the output units were chosen within the quasilinear range of the sigmoid. The weights were initialized with random values ranging between -2.4/Fi and 2.4/Fi, where Fi is the number of inputs.
The network was trained 23 passes (167,693 pattern presentations) - backpropagation. The accuracy was 0.14% for training data, and 5% for testing data. The percentage of rejection was 12.1% to achieve 1% error rate. Overall, it outperforms previous works in accuracy and speed.
Discussion
To be honest, I know little about neural network. But it seems to be using a bunch of functions to undersample sketches by their features, and train the network to make certain weights bigger, so that the input can follow the path which maximize the weights to get to the corresponding output. However, this may not work equally well in non-digit sketches, and also worth noted is that the system uses images of grey level -1~1, while the sketches we are considering uses binery images most of the time. Anyway, this is a different solution from the ones we've learned so far.
"Constellation Models for Sketch Recognition"
by D. Sharon and M. van de Panne
Summary
The authors introduces a vision-based sketch recognition approach. It utilizes the pictorial structure , or constellation, to recognize strokes in sketches of particular classes. The system makes two assumptions: (1) similar parts are drawn with similar strokes, (2) mandatory parts must have exactly one instance in the sketch, optional parts may have multiple instances in the sketch.
The constellation model consists features of individual object parts as well as features of pairs of parts. Individual features capture shape and global positions, whereas pairwise features (only computed when at least one part is mandatory) summarize relative positions. The sketch recognition process has two phases: (1) search the space of possible mandatory label assignments, (2) search for optional labels for remaining unlabelled strokes. Probability distribution is used.
The labels are searched by using a brand-and-bound tree for maximum likelihood (ML) searth. Mandatory parts are labelled first. A node in depth i of the tree is extended by evaluating all possible assignments of mandatory label i+1 to unlabelled strokes. To speed up the search, multipass thresholding and hard constraints are introduced. If a node's likelihood, as computed by its current partial set of label assignments, is lower than a threshold, that search branch is terminated. And hard constraints (e.g. nose should be above mouth) are relationships that must be met for some pairs, since all training examples have meet these relationships bewteen their corresponding parts.
Discussion
This approach recognizes objects based on relative locations of the strokes of sketches. However, the feature vector just uses bounding box of the strokes, which means that on the one hand, as long as the strokes can be fit into corresponding bounding box and maintain the specified relationship, the sketch will be recognized as an object no matter how each stroke is drawn; and on the other hand, even if the sketch looks the same, if each part is not drawn in the specified way (i.e. a nose drawn in one stroke and a nose drawn in 2 strokes are not the same), the whole sketch may not be able to successfully recognized.
Summary
The authors introduces a vision-based sketch recognition approach. It utilizes the pictorial structure , or constellation, to recognize strokes in sketches of particular classes. The system makes two assumptions: (1) similar parts are drawn with similar strokes, (2) mandatory parts must have exactly one instance in the sketch, optional parts may have multiple instances in the sketch.
The constellation model consists features of individual object parts as well as features of pairs of parts. Individual features capture shape and global positions, whereas pairwise features (only computed when at least one part is mandatory) summarize relative positions. The sketch recognition process has two phases: (1) search the space of possible mandatory label assignments, (2) search for optional labels for remaining unlabelled strokes. Probability distribution is used.
The labels are searched by using a brand-and-bound tree for maximum likelihood (ML) searth. Mandatory parts are labelled first. A node in depth i of the tree is extended by evaluating all possible assignments of mandatory label i+1 to unlabelled strokes. To speed up the search, multipass thresholding and hard constraints are introduced. If a node's likelihood, as computed by its current partial set of label assignments, is lower than a threshold, that search branch is terminated. And hard constraints (e.g. nose should be above mouth) are relationships that must be met for some pairs, since all training examples have meet these relationships bewteen their corresponding parts.
Discussion
This approach recognizes objects based on relative locations of the strokes of sketches. However, the feature vector just uses bounding box of the strokes, which means that on the one hand, as long as the strokes can be fit into corresponding bounding box and maintain the specified relationship, the sketch will be recognized as an object no matter how each stroke is drawn; and on the other hand, even if the sketch looks the same, if each part is not drawn in the specified way (i.e. a nose drawn in one stroke and a nose drawn in 2 strokes are not the same), the whole sketch may not be able to successfully recognized.
"Envisioning Sketch Recognition: A Local Feature Based Approach to Recognizing Informal Sketches"
by Michael Oltmans
Summary
Chapter1 Introduction
The approach is based on a representation of a sketched shape in terms of the visual parts it is made of. It is developed to recognize shapes in free sketching, which is different from sketches with feedback. When the goal is to jot something down quickly or explore early design possibilities, a less intrusive style of interaction is preferable. Feedback allows the user to repair recognition errors early on, touch up strokes are cleaned up before next shape is drawn, and feedback can teach the user what the recognition system is capable of recognizing, but freely drawn sketches don't have these features.
The principal challenges are:
(1) Shapes vary on both a signal and a conceptual level. (noises and conceptual variation, same thing drawn differently)
(2) Users draw strokes that overtrace previous strokes, and that may not be necessary to depict the shape being drawn. (doesn't change appearance, so visual approach is well suited)
(3) Segmenting strokes into groups corresponding to individual shapes is a complex task. (grouping according to the order they were drawn is not sufficient)
Chapter2 Representation of Visual Parts
The part-based representation is used to train a classifier to distinguish between shapes based on their parts. A visual part is an abstraction of the appearance of a region of the shape. It represents the appearance of a region of the shape (e.g. the top right hand corner of a rectangle). Each visual part is a circular region (bullseye) that is subdivided into rings and wedges, like a dartboard. The rings are spaced so that they are separated equally in log-space. They are spaced like this so that the inner bins are smaller and represent more detailed information about the appearance, while the outer bins contain more general information. A shape context is calculated
every 5 pixels along the the strokes in the shape.
The histogram boundaries are rotated relative to the interest point's orientation to make it rotationally invariant;the bins are rotated by an additional half bin width to prevent the stroke from lying along the boundary. So the bullseyes are rotationally invariant. The stroke direction relative to the orientation of the bullseye feature (calculating the tangent over a window of 19 points, four orientations: 0-45, 45-90, 90-135, 135-180) is added as a third dimension of the histogram, providing a count of how many points fall into each spatial bin at each direction (a single bin can distinguish between vertical and horizontal line).
Stroke preprocessing: first scale the shape so that the maximum distance between any two points in the shape is 75 pixels. Then resample the points along the strokes to have an approximately constant separation along the stroke. And interpolate additional points so that no two consecutive points have a distance greater than 1 pixel.
The distance between two bullseyes is calculated by comparing their vectors of bin counts: delta(p,q) = sigma(square(pi-qi)/(pi+qi)). Since the outer bins are less important and they usually contain more points, normalize the total weight of the histogram to be 1 to avoid differences arising from the total number of points in a region.
Chapter3 Representation of Shapes
Codebook: calculate the collection of bullseye parts for each shape in our training set, then cluster them into groups of similar parts, and use a representative from each cluster center as one of the codebook entries. Use a quality threshold (QT) clusterer, which forms clusters such that the distance between any two members of a clusterer is under a threshold. To improve
eciency, instead of trying each point as a cluster seed, the authors randomly selected 20 points as seeds and choose the largest cluster formed from those 20 seeds. Randomly select 1000 bullseye parts from the training set instead of using all of the calculated parts. The clustering is terminated after finding the 200 largest clusters.
To recognize a sketched shape, it is represented based on the parts it contains and then train a classifier to distinguish shapes based on those parts. First calculate the visual parts at a sampling of locations that cover the sketched symbol. Typically, 50 parts are calculated for a shape. In the next step, represent the sketched symbol in terms of the standard vocabulary of parts (codebook), compare each part in the codebook to each of the parts calculated from the sketched symbol, a distance matrix is got. Then a match vector is calculated by finding the minimum distance in each row of the matrix (match value: small match value means a codebook part presents in the input shape). Using the match vector representation, a classifier is trained to learn the differences between shape classes and to learn the allowable variations within a shape class.
The visual parts insulate the shape descriptions from the signal noise: bin (fuzzy), the way that are used to describe a shape (distance measure, low if similar, not exactly the same).
Chapter4 Recognition
With a representation of fixed cardinality and ordering corresponding to the codebook entries we can train a support vector machine (SVM) to learn a decision boundary that separates typical match vectors of one shape class from those from another shape class. To assign each input shape a single label from the set of possible shape classes, a one-vs-one strategy is used. This is implemented by training one classifier to distinguish between each pair of shape classes. This results in n(n-1)/2 binary classiers for n shape classes. For one input shape, the result from
each classifier is counted as a vote for the class it predicted. Assgin the class which received the most votes to the shape.
Shape localization involved sevel steps: selecting candidate regions, classifying the candidate regions, and combining the classified regions to form the final predictions. Candidate regions are selected by scanning a rectangular window over the input sketch every 10 pixels along the stroke with several different sized windows. The classifier is trained on an additional WIRE class to identify wire segments. The candidate finder generates many overlapping candidates so each shape in the image is partially contained in many candidate regions. The algorithm for combining the initial set of candidates into a nal set of predictions has two steps that are repeated for each shape class: (1) forming initial clusters of candidates and (2) splitting clusters that are too large into multiple predictions. To make a nal set of predictions we follow a greedy strategy based
on the total weight of each cluster. The weight of each cluster is determined by EM (expectation maximization), by combining the weights of the individual candidates in the cluster. Highly weighted clusters are generally tightly-grouped and have many high scoring candidates. As a result clusters with high weights are generally correct whereas clusters with low weights are less likely to be correct.
Chapter5 Evaluation
The isolated shape recognizer and the full sketch processor are evaluated separately. Analog circuit drawings report a recognition rate of 89.5%. HHreco data set that contains PowerPoint style shapes produced a recognition rate of 94.4%. One of the limitations is that it has difficulty identifying shapes that contain the same low level features in different combinations. The full sketch processor correctly identified 92.3% of the regions.
Discussion
This paper uses a vision-based approach, which successfully tackles the noises and conceptual variance of shapes appear in free sketching. And the full sketch processor which identifies locations and scales of components in sketch also has quite high accuracy. But the fuzzying also makes it unable to distinguish between shapes with the same set of parts that are differently connected. Also, this is an instance-based regcognizer, the input must be compared with all templates. So it slows quickly when new classes are added. And each time a new class is added, all the classifiers must be retrained. For sketches not like analog circuits where small components connected by wires, this recognizer may not work well (for example, a sketch with a circle in a rectangle).
Summary
Chapter1 Introduction
The approach is based on a representation of a sketched shape in terms of the visual parts it is made of. It is developed to recognize shapes in free sketching, which is different from sketches with feedback. When the goal is to jot something down quickly or explore early design possibilities, a less intrusive style of interaction is preferable. Feedback allows the user to repair recognition errors early on, touch up strokes are cleaned up before next shape is drawn, and feedback can teach the user what the recognition system is capable of recognizing, but freely drawn sketches don't have these features.
The principal challenges are:
(1) Shapes vary on both a signal and a conceptual level. (noises and conceptual variation, same thing drawn differently)
(2) Users draw strokes that overtrace previous strokes, and that may not be necessary to depict the shape being drawn. (doesn't change appearance, so visual approach is well suited)
(3) Segmenting strokes into groups corresponding to individual shapes is a complex task. (grouping according to the order they were drawn is not sufficient)
Chapter2 Representation of Visual Parts
The part-based representation is used to train a classifier to distinguish between shapes based on their parts. A visual part is an abstraction of the appearance of a region of the shape. It represents the appearance of a region of the shape (e.g. the top right hand corner of a rectangle). Each visual part is a circular region (bullseye) that is subdivided into rings and wedges, like a dartboard. The rings are spaced so that they are separated equally in log-space. They are spaced like this so that the inner bins are smaller and represent more detailed information about the appearance, while the outer bins contain more general information. A shape context is calculated
every 5 pixels along the the strokes in the shape.
The histogram boundaries are rotated relative to the interest point's orientation to make it rotationally invariant;the bins are rotated by an additional half bin width to prevent the stroke from lying along the boundary. So the bullseyes are rotationally invariant. The stroke direction relative to the orientation of the bullseye feature (calculating the tangent over a window of 19 points, four orientations: 0-45, 45-90, 90-135, 135-180) is added as a third dimension of the histogram, providing a count of how many points fall into each spatial bin at each direction (a single bin can distinguish between vertical and horizontal line).
Stroke preprocessing: first scale the shape so that the maximum distance between any two points in the shape is 75 pixels. Then resample the points along the strokes to have an approximately constant separation along the stroke. And interpolate additional points so that no two consecutive points have a distance greater than 1 pixel.
The distance between two bullseyes is calculated by comparing their vectors of bin counts: delta(p,q) = sigma(square(pi-qi)/(pi+qi)). Since the outer bins are less important and they usually contain more points, normalize the total weight of the histogram to be 1 to avoid differences arising from the total number of points in a region.
Chapter3 Representation of Shapes
Codebook: calculate the collection of bullseye parts for each shape in our training set, then cluster them into groups of similar parts, and use a representative from each cluster center as one of the codebook entries. Use a quality threshold (QT) clusterer, which forms clusters such that the distance between any two members of a clusterer is under a threshold. To improve
eciency, instead of trying each point as a cluster seed, the authors randomly selected 20 points as seeds and choose the largest cluster formed from those 20 seeds. Randomly select 1000 bullseye parts from the training set instead of using all of the calculated parts. The clustering is terminated after finding the 200 largest clusters.
To recognize a sketched shape, it is represented based on the parts it contains and then train a classifier to distinguish shapes based on those parts. First calculate the visual parts at a sampling of locations that cover the sketched symbol. Typically, 50 parts are calculated for a shape. In the next step, represent the sketched symbol in terms of the standard vocabulary of parts (codebook), compare each part in the codebook to each of the parts calculated from the sketched symbol, a distance matrix is got. Then a match vector is calculated by finding the minimum distance in each row of the matrix (match value: small match value means a codebook part presents in the input shape). Using the match vector representation, a classifier is trained to learn the differences between shape classes and to learn the allowable variations within a shape class.
The visual parts insulate the shape descriptions from the signal noise: bin (fuzzy), the way that are used to describe a shape (distance measure, low if similar, not exactly the same).
Chapter4 Recognition
With a representation of fixed cardinality and ordering corresponding to the codebook entries we can train a support vector machine (SVM) to learn a decision boundary that separates typical match vectors of one shape class from those from another shape class. To assign each input shape a single label from the set of possible shape classes, a one-vs-one strategy is used. This is implemented by training one classifier to distinguish between each pair of shape classes. This results in n(n-1)/2 binary classiers for n shape classes. For one input shape, the result from
each classifier is counted as a vote for the class it predicted. Assgin the class which received the most votes to the shape.
Shape localization involved sevel steps: selecting candidate regions, classifying the candidate regions, and combining the classified regions to form the final predictions. Candidate regions are selected by scanning a rectangular window over the input sketch every 10 pixels along the stroke with several different sized windows. The classifier is trained on an additional WIRE class to identify wire segments. The candidate finder generates many overlapping candidates so each shape in the image is partially contained in many candidate regions. The algorithm for combining the initial set of candidates into a nal set of predictions has two steps that are repeated for each shape class: (1) forming initial clusters of candidates and (2) splitting clusters that are too large into multiple predictions. To make a nal set of predictions we follow a greedy strategy based
on the total weight of each cluster. The weight of each cluster is determined by EM (expectation maximization), by combining the weights of the individual candidates in the cluster. Highly weighted clusters are generally tightly-grouped and have many high scoring candidates. As a result clusters with high weights are generally correct whereas clusters with low weights are less likely to be correct.
Chapter5 Evaluation
The isolated shape recognizer and the full sketch processor are evaluated separately. Analog circuit drawings report a recognition rate of 89.5%. HHreco data set that contains PowerPoint style shapes produced a recognition rate of 94.4%. One of the limitations is that it has difficulty identifying shapes that contain the same low level features in different combinations. The full sketch processor correctly identified 92.3% of the regions.
Discussion
This paper uses a vision-based approach, which successfully tackles the noises and conceptual variance of shapes appear in free sketching. And the full sketch processor which identifies locations and scales of components in sketch also has quite high accuracy. But the fuzzying also makes it unable to distinguish between shapes with the same set of parts that are differently connected. Also, this is an instance-based regcognizer, the input must be compared with all templates. So it slows quickly when new classes are added. And each time a new class is added, all the classifiers must be retrained. For sketches not like analog circuits where small components connected by wires, this recognizer may not work well (for example, a sketch with a circle in a rectangle).
"A Domain-Independent System for Sketch Recognition"
by Bo Yu, Shijie Cai
Summary
This paper utilizes low-level geometric features to handle smooth curves and hybrid shapes as well as poloylines. Previous sketch recognition systems mainly have these deficiencies: (1) strict restrictions on the drawing manner, (2) lack ability to analyze hybrid or smooth curves and decompose them into primitive shapes.
Their system is diveded into two stages: imprecise stroke approximation and post-process. Approximation classifies the strokes into lines, arcs, circles, ellipses and helixes and a combination of these primitive shapes. While post-process is applied after the user finishes drawing.
Imprecise stroke approximation phase uses a top-down approach. A stroke is segmented at the highest curvature point into 2 parts, then do feature-area verification with respect to primitive shapes on each segment to decide if it can be classified as one of those shapes, if approximation fails, the segment is recursively decomposes at the point with the next highest curvature. Curvature is defined as the sum of change in direction from Pn-k to Pn+k divided by path distance between Pn-k and Pn+k, where k is set to 2.
If the direction graph fits a horizontal line and number of deviation point to the least-square best fit line of the direction graph or the actual coordinate of the stroke falls below a loose threshold, then it could be a line. If the feature-area / candidate line's standard area is less than a strict threshold 1, then it's immediately approximated to be a line segment, otherwise, the decision is delayed till after the arc recognition. Circles are arcs with direction change of 2Pi. If the direction change of the stroke is greater than 2Pi, it's segmented into parts with direction change of 2Pi, and decide whether it's overtraced circle or helix (descending or ascending radii). To deal with self-intersection strokes, two copies of stroke is made, one segmented using the original procedure (at high curvature points), the other segmented at self-intersection points to see which gives better result. ((1) simpler is better: fewer parts of primitive shapes, (2) specifier is better: circles, ellipses, helixes are better than lines and arcs)
Post-process does simple relation retrieval (parallelism, perpendicularity, etc.), cleanup (tails removal, merge primitive elements), and basic object recognition (boundaries, rectangles, triangles, arrows, dash lines, etc.)
The overall accuracy is 98%, while accuray for arcs alone is 94%. Hybrid shape of arcs connected to lines gives only 70% accuracy because of low curvature at smooth connecting points.
Discussion
The top-down approach tackles the problem of noises in curvature and direction graph. And the rules used in cleanup phase is quite useful, 60%-80% meaningless elements are removed while meaningful parts are preserved. For the self-intersection part, would a spiral be interpreted into circles attached to arcs and lins? I think this may change the topological relation and the meaning of the sketch.
Summary
This paper utilizes low-level geometric features to handle smooth curves and hybrid shapes as well as poloylines. Previous sketch recognition systems mainly have these deficiencies: (1) strict restrictions on the drawing manner, (2) lack ability to analyze hybrid or smooth curves and decompose them into primitive shapes.
Their system is diveded into two stages: imprecise stroke approximation and post-process. Approximation classifies the strokes into lines, arcs, circles, ellipses and helixes and a combination of these primitive shapes. While post-process is applied after the user finishes drawing.
Imprecise stroke approximation phase uses a top-down approach. A stroke is segmented at the highest curvature point into 2 parts, then do feature-area verification with respect to primitive shapes on each segment to decide if it can be classified as one of those shapes, if approximation fails, the segment is recursively decomposes at the point with the next highest curvature. Curvature is defined as the sum of change in direction from Pn-k to Pn+k divided by path distance between Pn-k and Pn+k, where k is set to 2.
If the direction graph fits a horizontal line and number of deviation point to the least-square best fit line of the direction graph or the actual coordinate of the stroke falls below a loose threshold, then it could be a line. If the feature-area / candidate line's standard area is less than a strict threshold 1, then it's immediately approximated to be a line segment, otherwise, the decision is delayed till after the arc recognition. Circles are arcs with direction change of 2Pi. If the direction change of the stroke is greater than 2Pi, it's segmented into parts with direction change of 2Pi, and decide whether it's overtraced circle or helix (descending or ascending radii). To deal with self-intersection strokes, two copies of stroke is made, one segmented using the original procedure (at high curvature points), the other segmented at self-intersection points to see which gives better result. ((1) simpler is better: fewer parts of primitive shapes, (2) specifier is better: circles, ellipses, helixes are better than lines and arcs)
Post-process does simple relation retrieval (parallelism, perpendicularity, etc.), cleanup (tails removal, merge primitive elements), and basic object recognition (boundaries, rectangles, triangles, arrows, dash lines, etc.)
The overall accuracy is 98%, while accuray for arcs alone is 94%. Hybrid shape of arcs connected to lines gives only 70% accuracy because of low curvature at smooth connecting points.
Discussion
The top-down approach tackles the problem of noises in curvature and direction graph. And the rules used in cleanup phase is quite useful, 60%-80% meaningless elements are removed while meaningful parts are preserved. For the self-intersection part, would a spiral be interpreted into circles attached to arcs and lins? I think this may change the topological relation and the meaning of the sketch.
"GLADDER: Combining Gesture and Geometric Sketch Recognition"
by Paul Corey and Tracy Hammond
Summary
This paper discusses an approach to combine Rubine features with the geometric classifier of LADDER.
Modified Rubine Recognizer: In addition to average feature vectors, class specific covariance matrices are maintained for each class. Input strokes are assigned to a specific class based on computation of Mahalanobis distances.
LADDER Recognizer: Primitives are recognized by low-level shape tests, and are orders based on the hierarchical classifier. Top three are chosen as possible interpretation, and less complex interpretations are put before more complex ones.
Combination: The minimal Mahalanobis distance of a stroke to any Rubine class is computed, if it exceed a threshold of 35, then it's likely to be a LADDER primitive (average Mahalanobis distance 100), rather than one of the Rubine glyphs (average Mahalanobis distance 24). If it is below the threshold, then Rubine interpretation is put in top of the fit list, otherwise, Rubine interpretation is put at the bottom. Context can later be used to rectify an incorrect ordering of fits.
The accuracies are: Modified Rubine 61.9%, LADDER 75.2%, Integrated 79.9%. Some errors are due to overlaping of classes between Rubin and LADDER (zero and circle). And they also suggests in future work that a tiered thresholding system can be used to put Rubine glyphs after less complex primitives and before complex primitives of LADDER interpretations.
Discussion
The combination approach partly solved the problem of the mix of characters and shapes in a sketch. It's a step forward the "free" or "natural" sketch recognition. The tiered thresholding system may be implemented by dividing LADDER primitives and Rubine classes into groups, whose average Mahalanobis distance are quite different, but the deviation of each class within the group is small.
Summary
This paper discusses an approach to combine Rubine features with the geometric classifier of LADDER.
Modified Rubine Recognizer: In addition to average feature vectors, class specific covariance matrices are maintained for each class. Input strokes are assigned to a specific class based on computation of Mahalanobis distances.
LADDER Recognizer: Primitives are recognized by low-level shape tests, and are orders based on the hierarchical classifier. Top three are chosen as possible interpretation, and less complex interpretations are put before more complex ones.
Combination: The minimal Mahalanobis distance of a stroke to any Rubine class is computed, if it exceed a threshold of 35, then it's likely to be a LADDER primitive (average Mahalanobis distance 100), rather than one of the Rubine glyphs (average Mahalanobis distance 24). If it is below the threshold, then Rubine interpretation is put in top of the fit list, otherwise, Rubine interpretation is put at the bottom. Context can later be used to rectify an incorrect ordering of fits.
The accuracies are: Modified Rubine 61.9%, LADDER 75.2%, Integrated 79.9%. Some errors are due to overlaping of classes between Rubin and LADDER (zero and circle). And they also suggests in future work that a tiered thresholding system can be used to put Rubine glyphs after less complex primitives and before complex primitives of LADDER interpretations.
Discussion
The combination approach partly solved the problem of the mix of characters and shapes in a sketch. It's a step forward the "free" or "natural" sketch recognition. The tiered thresholding system may be implemented by dividing LADDER primitives and Rubine classes into groups, whose average Mahalanobis distance are quite different, but the deviation of each class within the group is small.
Tuesday, September 23, 2008
"A Curvature Estimation For Pen Input Segmentation in Sketch-based Modeling"
by Dae Hyun Kim, Myoung-Jun Kim
Summary
This paper discusses the use of curvature to identify segmenting points in strokes. Strokes are first resampled to equally distanced points. So that the curvature at Ci with support k (support is a set of adjacent points used for the curvature estimation) can be computed by adding all directions within the range Di-k to Di+k (Di is the angle between Pi-1Pi and PiPi+1).
Then they define the local convexity: extend the support the each side of Pi with the current point Pj only if it's locally convex with respect to Pi (have the same sign of direction). But using local convexity alone sometimes indistinguishes actual segmenting point, so local monotonicity is introduced: extend the support only if the absolute direction monotonically decreases. The advantages of this method are that (1) there's no complex computation, and (2) it's invariant under the rotation and reordering of pen markings (produces the same result.)
The segmenting points are the points with local maximum absolute curvature. Experiments show a 95% accuracy (correctly found points / actual segmenting points). Also, speed information can be added to aid segmenting point detection, and adjacent points with high curvature are merged to return only one as the segmenting point.
Discussion
This paper actually uses the geometric feature of curve to dynamically adjust the window size (support) of points used to measure the curvature. This is after the wiggy edges are smoothed by resampling. So the resampling distance should be carefully chosen to both keep curvature information and smooth out the noises.
Summary
This paper discusses the use of curvature to identify segmenting points in strokes. Strokes are first resampled to equally distanced points. So that the curvature at Ci with support k (support is a set of adjacent points used for the curvature estimation) can be computed by adding all directions within the range Di-k to Di+k (Di is the angle between Pi-1Pi and PiPi+1).
Then they define the local convexity: extend the support the each side of Pi with the current point Pj only if it's locally convex with respect to Pi (have the same sign of direction). But using local convexity alone sometimes indistinguishes actual segmenting point, so local monotonicity is introduced: extend the support only if the absolute direction monotonically decreases. The advantages of this method are that (1) there's no complex computation, and (2) it's invariant under the rotation and reordering of pen markings (produces the same result.)
The segmenting points are the points with local maximum absolute curvature. Experiments show a 95% accuracy (correctly found points / actual segmenting points). Also, speed information can be added to aid segmenting point detection, and adjacent points with high curvature are merged to return only one as the segmenting point.
Discussion
This paper actually uses the geometric feature of curve to dynamically adjust the window size (support) of points used to measure the curvature. This is after the wiggy edges are smoothed by resampling. So the resampling distance should be carefully chosen to both keep curvature information and smooth out the noises.
"Eliminating False Positives During Corner Finding by Merging Similar Segments"
by Aaron Wolin, Brandon Paulson, Tracy Hammond
Summary
This paper introduces MergeCF, which utilizes stroke's curvature and drawing speed to find corners in sketches. Then false positive corners are removed and segments that are alike are merged.
Distances: Chord distance (euclidean distance) and path length.
Speed: Speed at Pi is path length / duration for the neighborhood Pi-1 to Pi+1.
Initial fit: Corners are points with high curvature and low speed. Also check corners that are close together and remove those with smallest curvature.
Merging segments: Corners surrounding the smallest stroke segments are likely to be false
positives. Find the smallest segment, than combine it with one of its neighbors that yields smaller primitive fit error. No merging is performed if the error after merging is much higher than the sum of original errors of the two segments.
Experiment using polylines and complex shapes shows that Merge outperforms Sezgin and Kim on both "correct corner found" accuracy (0.971 vs 0.822 & 0.783) and all-or-nothing accuracy (0.667 vs 0.298 & 0.194).
Discussion
This algorithm first find as many potential corners as possible and then eliminate false positives. The important part is how to define the threshold for "small segment" and "not much higher error".
Summary
This paper introduces MergeCF, which utilizes stroke's curvature and drawing speed to find corners in sketches. Then false positive corners are removed and segments that are alike are merged.
Distances: Chord distance (euclidean distance) and path length.
Speed: Speed at Pi is path length / duration for the neighborhood Pi-1 to Pi+1.
Initial fit: Corners are points with high curvature and low speed. Also check corners that are close together and remove those with smallest curvature.
Merging segments: Corners surrounding the smallest stroke segments are likely to be false
positives. Find the smallest segment, than combine it with one of its neighbors that yields smaller primitive fit error. No merging is performed if the error after merging is much higher than the sum of original errors of the two segments.
Experiment using polylines and complex shapes shows that Merge outperforms Sezgin and Kim on both "correct corner found" accuracy (0.971 vs 0.822 & 0.783) and all-or-nothing accuracy (0.667 vs 0.298 & 0.194).
Discussion
This algorithm first find as many potential corners as possible and then eliminate false positives. The important part is how to define the threshold for "small segment" and "not much higher error".
Monday, September 22, 2008
"PaleoSketch: Accurate Primitive Sketch Recognition and Beautification"
by Brandon Paulson, Tracy Hammond
Summary
In addition to placing few drawing constraints on users, the recognizer introduced in this paper is also able to return multiple interpretations as well as support more primitive shapes (line, circle, ellipse, arc, curve, spiral, helix). The recogizer in this paper is geometric-based. Strokes are pre-recognized before going on to low-level shape recognizers, beautification and the subsequent hierarchy.
Pre-recognition begins with removing consecutive points with duplicate position or time. Then the direction graph, speed graph, and curvature graph are computed. And corners are found by 1) going through points on the sketch and find all points before the points that fail line test, 2) merging corners that fall within a neighborhood with their average index, and 3) replacing non-end corners with the points that hasve the highest curvature in their neighborhood. In addition, two new features are introduced: normalized distance between direction extremes (NDDE = stroke length between point with highest direction and point with lowest direction / total stroke length), and direction change ratio (DCR = maximum change in direction / average change in direction). Curves typically have high NDDE and low DCR while polylines have low NDDE and high DCR. For long and many-points strokes, if the highest curvature within first and last 20% of the stroke, then tail removal is performed. Then overtracing and closure is tested.
Then low-level shape tests (line, polyline, ellipse, circle, arc, curve, spiral, helix, and complex test) are performed by computing various features of the stroke, comparing feature values with the thresholds and calculating errors. Of them, complex test is the default test that recursively divide the substroke into two parts at the highest curvature point and send back into the recognizer until all non-polyline primitives are got. Beautified shapes are returned after tests.
Hierarchy uses a ranking algorithm to find the best fit among multiple interpretations. Assgin scores to shapes: line = 1, arc = 3, curve = 5, circle = 5, ellipse = 5, helix = 5, spiral = 5, and sum all the scores of the substrokes. The interpretations are ranked by the total score, interpretation with lower score has higher ranking. Lastly, tails are removed, and the substrokes at both ends will be removed or adjusted accordingly.
Results show that its accuracy is as high as 99.89%, and 98.56% of the correct interpretation is the top interpretation. Most of the misclassified strokes would actually also be misclassified by human recognizer. And this low-level recognizer is integrated into a high-level system - LADDER.
Discussion
In terms of recogition accuracy, this recognizer is amazing, and it supports a handful of primitive shapes, which enables it to recognize quite complex sketches. Tons of thresholds are defined to distinguish minute differences between shapes. And NDDE and DCR are really useful features. The ideas of "continuation strokes" and "invisible corners" seems promising, for rounded rectangle or polygon, the edges can be moved to pass the centroid of the relevant substroke and parallel to the originally recognized edge.
Summary
In addition to placing few drawing constraints on users, the recognizer introduced in this paper is also able to return multiple interpretations as well as support more primitive shapes (line, circle, ellipse, arc, curve, spiral, helix). The recogizer in this paper is geometric-based. Strokes are pre-recognized before going on to low-level shape recognizers, beautification and the subsequent hierarchy.
Pre-recognition begins with removing consecutive points with duplicate position or time. Then the direction graph, speed graph, and curvature graph are computed. And corners are found by 1) going through points on the sketch and find all points before the points that fail line test, 2) merging corners that fall within a neighborhood with their average index, and 3) replacing non-end corners with the points that hasve the highest curvature in their neighborhood. In addition, two new features are introduced: normalized distance between direction extremes (NDDE = stroke length between point with highest direction and point with lowest direction / total stroke length), and direction change ratio (DCR = maximum change in direction / average change in direction). Curves typically have high NDDE and low DCR while polylines have low NDDE and high DCR. For long and many-points strokes, if the highest curvature within first and last 20% of the stroke, then tail removal is performed. Then overtracing and closure is tested.
Then low-level shape tests (line, polyline, ellipse, circle, arc, curve, spiral, helix, and complex test) are performed by computing various features of the stroke, comparing feature values with the thresholds and calculating errors. Of them, complex test is the default test that recursively divide the substroke into two parts at the highest curvature point and send back into the recognizer until all non-polyline primitives are got. Beautified shapes are returned after tests.
Hierarchy uses a ranking algorithm to find the best fit among multiple interpretations. Assgin scores to shapes: line = 1, arc = 3, curve = 5, circle = 5, ellipse = 5, helix = 5, spiral = 5, and sum all the scores of the substrokes. The interpretations are ranked by the total score, interpretation with lower score has higher ranking. Lastly, tails are removed, and the substrokes at both ends will be removed or adjusted accordingly.
Results show that its accuracy is as high as 99.89%, and 98.56% of the correct interpretation is the top interpretation. Most of the misclassified strokes would actually also be misclassified by human recognizer. And this low-level recognizer is integrated into a high-level system - LADDER.
Discussion
In terms of recogition accuracy, this recognizer is amazing, and it supports a handful of primitive shapes, which enables it to recognize quite complex sketches. Tons of thresholds are defined to distinguish minute differences between shapes. And NDDE and DCR are really useful features. The ideas of "continuation strokes" and "invisible corners" seems promising, for rounded rectangle or polygon, the edges can be moved to pass the centroid of the relevant substroke and parallel to the originally recognized edge.
Sunday, September 21, 2008
"Sketch Based Interfaces: Early Processing for Sketch Understanding"
by T. M. Sezgin, T. Stahovich, R. Davis
Summary
The authors is trying to provide a system where users can sketch naturally and have the their sketches understood, with no constraints on how things are drawn.
Early processing in the sytem consists of three phases: approximation, beautification, and basic recognition. Approximation finds the representative points of strokes in the sketch. Beautification modifies the points found in approximation phase to make the output nicer and more meaningful (show parallelism, perpendicularity, etc.) Basic recognition tries to interprete strokes into objects (rectangles, circles, etc.) And this paper focuses mainly on the approximation approaches. Stroke approximation is divided into vertex detection and curves handling.
The authors do vertex detection by combining the information in curvature and speed, since vertexes usually have high curvature and low drawing speed. Fd is the set of all points with local maximum curvatures above average curvature, while Fs is the set of all points with local minimum speed below 90% average speed. Using speed data alone misses short lines that are not drawn fast enough, and using curvature data alone produces many false positive point due to hand dynamics during freehand sketching. So the idea is to combine them. Three stages in generating hybrid fits:
1) Computing vertex certainties: (a) certainty for curvature canditate Vi is |Di-k - Di+k| / L, where Di is curvature at Vi, k is neighborhood size (not specified), L is path distance of substroke from Vi-k to Vi+k. (b) certainty for speed candidate Vi is 1 - vi/vmax.
2) Generating a set of hybrid fits: H0 is the intersection of Fd and Fs. In succession iterations, Hi' = Hi + Vs (Vs is the best remaining speed fit candidate), Hi" = Hi + Vd (Vd is the best remaining curvature fit candidate.) Goodness metric: least squares error (average of the sum of the squares of the distances to the fit from each point in the stroke S.) Ei = 1/|S| sigma ODSQ(s, Hi). Here |S| is the stroke length, and ODSQ stands for orthogonal distance squared, i.e. the square of the distance from the stroke point to the relevant line segment of the polyline difined by Hi. Compare the errors for Hi' and Hi", the one with smaller error become Hi+1. Repeat until all points in Fd and Fs have been used.
3) Selecting the best fit: set an error upper bound (not specified) and designate as final fit Hf, the Hi with the fewest vertices that also has an error below the threshold.
The second phase in approximation is curves handling. Define L1 as the euclidean distance between a pair of consecutive vertices found in vertex detection phase, and L2 be the path length of the sketch between these two points. The ratio of L2/L1 is significantly higher than 1 in curved regions of the sketch. Approximate the curved regions with Bezier curves: end points are consecutive vertices u = Si, v = Sj, and control points are c1 = KT1 + v, c2 = KT2 + u, where K = 1/3*sigma|Sk - Sk+1| where k from i to j (K = 1/3 path length of curve segment), and T1, T2 are unit length tangent vectors pointing inwards at the curve segment. The curve is subdivided in the middle if error of Bezier approximation is higher than a threshold (not specified), until the desired precision is achieved.
Discussion
The idea of using speed to find candidate verticies had already been stated in Herot's paper over 2 decades before this paper was written, and on this point, the recognition may rely partly on how the users draw, for example, some may draw very carefully with even speed. The method of choosing verticies by their certainty scores and least squares errors is the great part of this paper, though many threshold values are left unspecified (I guess some of these thresholds should vary according to the sketch it is dealing with.)
The accuracy of object recognition is as high as 96%, but this doesn't mean the approximation has equally high accuracy, for 96% might as well be attributed to the effect of beautification and the primitives in the system, for example, even the sketch is not properly approximated, the object recognizer may still correctly classify the sketch as rectangle, circle or spring, since it's easy to tell these objects apart.
Summary
The authors is trying to provide a system where users can sketch naturally and have the their sketches understood, with no constraints on how things are drawn.
Early processing in the sytem consists of three phases: approximation, beautification, and basic recognition. Approximation finds the representative points of strokes in the sketch. Beautification modifies the points found in approximation phase to make the output nicer and more meaningful (show parallelism, perpendicularity, etc.) Basic recognition tries to interprete strokes into objects (rectangles, circles, etc.) And this paper focuses mainly on the approximation approaches. Stroke approximation is divided into vertex detection and curves handling.
The authors do vertex detection by combining the information in curvature and speed, since vertexes usually have high curvature and low drawing speed. Fd is the set of all points with local maximum curvatures above average curvature, while Fs is the set of all points with local minimum speed below 90% average speed. Using speed data alone misses short lines that are not drawn fast enough, and using curvature data alone produces many false positive point due to hand dynamics during freehand sketching. So the idea is to combine them. Three stages in generating hybrid fits:
1) Computing vertex certainties: (a) certainty for curvature canditate Vi is |Di-k - Di+k| / L, where Di is curvature at Vi, k is neighborhood size (not specified), L is path distance of substroke from Vi-k to Vi+k. (b) certainty for speed candidate Vi is 1 - vi/vmax.
2) Generating a set of hybrid fits: H0 is the intersection of Fd and Fs. In succession iterations, Hi' = Hi + Vs (Vs is the best remaining speed fit candidate), Hi" = Hi + Vd (Vd is the best remaining curvature fit candidate.) Goodness metric: least squares error (average of the sum of the squares of the distances to the fit from each point in the stroke S.) Ei = 1/|S| sigma ODSQ(s, Hi). Here |S| is the stroke length, and ODSQ stands for orthogonal distance squared, i.e. the square of the distance from the stroke point to the relevant line segment of the polyline difined by Hi. Compare the errors for Hi' and Hi", the one with smaller error become Hi+1. Repeat until all points in Fd and Fs have been used.
3) Selecting the best fit: set an error upper bound (not specified) and designate as final fit Hf, the Hi with the fewest vertices that also has an error below the threshold.
The second phase in approximation is curves handling. Define L1 as the euclidean distance between a pair of consecutive vertices found in vertex detection phase, and L2 be the path length of the sketch between these two points. The ratio of L2/L1 is significantly higher than 1 in curved regions of the sketch. Approximate the curved regions with Bezier curves: end points are consecutive vertices u = Si, v = Sj, and control points are c1 = KT1 + v, c2 = KT2 + u, where K = 1/3*sigma|Sk - Sk+1| where k from i to j (K = 1/3 path length of curve segment), and T1, T2 are unit length tangent vectors pointing inwards at the curve segment. The curve is subdivided in the middle if error of Bezier approximation is higher than a threshold (not specified), until the desired precision is achieved.
Discussion
The idea of using speed to find candidate verticies had already been stated in Herot's paper over 2 decades before this paper was written, and on this point, the recognition may rely partly on how the users draw, for example, some may draw very carefully with even speed. The method of choosing verticies by their certainty scores and least squares errors is the great part of this paper, though many threshold values are left unspecified (I guess some of these thresholds should vary according to the sketch it is dealing with.)
The accuracy of object recognition is as high as 96%, but this doesn't mean the approximation has equally high accuracy, for 96% might as well be attributed to the effect of beautification and the primitives in the system, for example, even the sketch is not properly approximated, the object recognizer may still correctly classify the sketch as rectangle, circle or spring, since it's easy to tell these objects apart.
Saturday, September 20, 2008
"Algorithms for the Reduction of the Number of Points Required to Represent a Digitized Line Or Its Caricature"
by Davis H Douglas and Thomas K Peucker
Summary
Finding representative points of lines helps to reduce the storage space needed and computational time of other computings which is proportional to the number of points in a caricature. One way to do this is by training computers with human decitions, but there are too many possible ways to represent a single class of feature. Another way is to approximate the points along a line with mathematical functions (ends of segments are averages of fixed number of points along the line), this misses important information in high curvature parts, while keeping too much information in the smoother part. Some algorithms focus on which points to be deleted (points closer than a threshold to neighbors are dropped, dropping points if line direction doesn't change much, deleting all but every Nth point), while others focus on which to be maintaained.
The method provided by A.E.G which was discribed by T. Lang in 1969 is similar to the algorithm created by the authors. But the result is unsatisfying where sharp angles are numerous. This algorithm concentrates on selecting points.
An intuitive way to decide whether a series of points form a line is to find the furthest point from the line segments between end points, see whether the distance from the furthest point to the line connecting end points exceeds some threshold. The authors used 2 methods:
Method 1: 1) assign start point of curve as "anchor", the last point as "floater". 2) find the point with greatest perpendicular distance to the line segment between anchor and floater, assgin it to floater, repeat this step until furthest distance satisfy threshold. 3) move anchor to floater and assign the last point as floater, iterate from step 2.
Method 2: same as method 1, except keeping floaters of each iteration in a stack. The next iteration only chooses floaters from the stack built in the previous iteration. This method saves time by avoiding re-examining points.
For Lang's algorithm, it does line test on points in the order of (1, 2, 3), (1, 2, 4), (1, 3, 4), (1, 2, 5)...anchor moves to the point before the first point that does not allow all intervening points to satisfy the tolerance. Two modificiaitons on this algorithm is given by the authors. Modification 1 has the anchor move to the furthest point from the segment instead of the point before the floater. Modification 2 avoiding unnecessary repeated calculations of distance by testing if the accumulated total (sigma(Distance(Pn, P0Pn+1))) is greater than the tolerance.
Comparison on AEG (Lang's algorithm), AEG+Mod1, AEG+Mod2, AEG+Mod1+Mod2, Method1, Method2 shows that Method2 outperforms the other algorithms.
Discussion
This algorithm is like recursively doing line tests on the curve. But in terms of "corner" finding, it gives much more points that are just not in a line but are not corners, for example, points on a smooth arc, points in middle of the bottom line in "|__|".
Monday, September 15, 2008
"Short Straw: A Simple and Effective Corner Finder for Polylines"
by Aaron Wolin, Tracy Hammond
Summary
Corner finding is the technique that involves splitting up a stroke into primatives, such as lines and arcs. ShortStraw is a simple and effective algorithm for corner finding.
Sezgin et al. use a stroke's curvature and pen speed to determine stroke corners. Kim and Kim use local convexity and local monoticity and also use resampling in their algorithm. These algorithms are complex, while ShortStraw is simple and highly accurate.
Implementation of ShortStraw:
Step 1, resample points of a stroke to be evenly spaced apart, so that the path distances between adjacent points are equal. The interspacing distance is bounding box diagonal / 40 (where 40 is a constant number carefully chosen.)
Setp 2, bottom-up corner finding: Corners are points with local minimum straw below a threshold of median straw * 0.95. (window size of plus/minus 3 is chosen here)
Step 3, top-down corner checking to remove false positives and add false negatives. (1) do line tests between adjacent corners found by step 2, if Dist(a, b) / Path-Dist(a, b) < 0.95, then there are missing corners in the halfway, relax threshold and add the point with local minimum straw to corner set, repeat untill all the stroke segments between corners pass line test. (2) if a subset of triplet passes line test, then the middle corner is removed. Corners returned are from the resampled points. Or it can be the nearest original point to that corner.
Results show that ShortStraw has much higher all-or-nothing accurracy (calculated by taking the number of correctly segmented strokes divided by the total number of strokes) of 0.74 compared with less than 0.3 using Sezgin and Kims' algorithms. And the number of false negatives with ShortStraw is 1/10 the number with Sezgin and Kim. The missed corners at obtuse angles can possibly be fixed by utilizing a varied windowor straw length for each point, but this may sacrifice algorithm simplicity.
Discussion
This is a quite straightforward algorithm, and the accuracy is high. The top-down checking phase further increases accuracy. I think maybe context information can be included in the future, for example, human viewers will think some point as a corner even if it looks more like an arc than some line segments meet at corners.
Algorithm Video
Summary
Corner finding is the technique that involves splitting up a stroke into primatives, such as lines and arcs. ShortStraw is a simple and effective algorithm for corner finding.
Sezgin et al. use a stroke's curvature and pen speed to determine stroke corners. Kim and Kim use local convexity and local monoticity and also use resampling in their algorithm. These algorithms are complex, while ShortStraw is simple and highly accurate.
Implementation of ShortStraw:
Step 1, resample points of a stroke to be evenly spaced apart, so that the path distances between adjacent points are equal. The interspacing distance is bounding box diagonal / 40 (where 40 is a constant number carefully chosen.)
Setp 2, bottom-up corner finding: Corners are points with local minimum straw below a threshold of median straw * 0.95. (window size of plus/minus 3 is chosen here)
Step 3, top-down corner checking to remove false positives and add false negatives. (1) do line tests between adjacent corners found by step 2, if Dist(a, b) / Path-Dist(a, b) < 0.95, then there are missing corners in the halfway, relax threshold and add the point with local minimum straw to corner set, repeat untill all the stroke segments between corners pass line test. (2) if a subset of triplet passes line test, then the middle corner is removed. Corners returned are from the resampled points. Or it can be the nearest original point to that corner.
Results show that ShortStraw has much higher all-or-nothing accurracy (calculated by taking the number of correctly segmented strokes divided by the total number of strokes) of 0.74 compared with less than 0.3 using Sezgin and Kims' algorithms. And the number of false negatives with ShortStraw is 1/10 the number with Sezgin and Kim. The missed corners at obtuse angles can possibly be fixed by utilizing a varied windowor straw length for each point, but this may sacrifice algorithm simplicity.
Discussion
This is a quite straightforward algorithm, and the accuracy is high. The top-down checking phase further increases accuracy. I think maybe context information can be included in the future, for example, human viewers will think some point as a corner even if it looks more like an arc than some line segments meet at corners.
Algorithm Video
"Prototype Pruning by Feature Extraction for Handwritten Mathematical Symbol Recognition"
by Stephen M. Watt, Xiaofang Xie
Summary
Mathematical handwriting differs from other forms of handwriting: the set of possible input symbols is very extensive, (2) the spatial relation among symbols can use complex context-sensitive two dimensional rules, (3) can involve large operators such as matrix brackets, fraction bars or square roots. This paper presents approaches to recognize a large set of mathematical symbols.
Pruning prototypes is an intuitive way to break a large vocabulary into several smaller groups. For an unknown symbol, find the small group to which it belongs, then try to match the symbol in the group. During prototype pruning, the mathematical symbols are grouped according to some features. Then comparison during recognition is performed on groups instead of the whole set of symbols.
Data is gathered by using questionnaire, and about 50 people's handwritten is put into the database. Several pre-processing operations were performed, including selectively chopping the head and the tail of strokes, re-sampling, smoothing and size normalization.
Then some features are introduced:
Geometric Features:
(1) Number of Loops: by finding number of minimum distance pairs - sweep a line parallel to the line between "begin" and "end" (a pair of points within a certain distance threshold on a stroke) in the direction that shortens distanse between the two neighboring intersections with the stroke to find either to be locally connected or a minimus distance pair.
(2) Number of Intersections: using modified sweepline algorithm - on finding an intersection, delete the two line segments associated with the intersection, and insert two line segments that begin from intersection and end with their old ending points.
(3) Number of Cusps: cusp is formed by a series of five points p0, p1, p2, p3 and p4, such that angle between p0p1 and p1p2 > 170 degree, angle between p2p3 and p3p4 also > 170 degree, but angle between p1p2 and p2p3 is small.
Ink Related Features
(1) Number of Strokes
(2) Point Density: determine density similar to "o", "b" or "p".
Directional Features
(1) Initial, End and Initial-End Directions: discretize all directions to the nearest of N, NE, E, SE, S, SW, W, NW.
Global Features
(1) Initial and End Position: NE, NW, SE, SW of the bounding box.
(2) Width to Height Ratio: 0 for a slim symbol, 1 for a wide symbol, and 2 for a normal symbol.
Of them, the number of strokes (exact match), the (discretized) width to height ratio (exact match), the initial position (could differ by one), initial to end direction (could differ by one), and end direction (could differ by two) was used to prune models. (Models' feature values are pre-computed and stored.)
After pruning, elastic matching was used for recognition step, achieved by calculating the minimum distance between the unknown symbol and a set of models.
Experiments results show that although using feature to pre-classify lowers accuracy a little bit - while still quite high - the comparison is reduced to a large extent (approximately 89%), thus recognition speed is greatly promoted. And the authors suggest to introcude context and HMM recognizer to reduce ambiguity.
Discussion
I think the really important part in this paper is the introduction of features. Although only several was used, the authors discussed a lot. The geometric features seems to be quite useful if they are used in feature-based recognition. For other features, discrete values are used ignore small errors. This fuzzy logic actually improves recognition accuracy. The suggestion about using context is a good idea in mathemetical handwriting recognition and maybe some other fields.
Summary
Mathematical handwriting differs from other forms of handwriting: the set of possible input symbols is very extensive, (2) the spatial relation among symbols can use complex context-sensitive two dimensional rules, (3) can involve large operators such as matrix brackets, fraction bars or square roots. This paper presents approaches to recognize a large set of mathematical symbols.
Pruning prototypes is an intuitive way to break a large vocabulary into several smaller groups. For an unknown symbol, find the small group to which it belongs, then try to match the symbol in the group. During prototype pruning, the mathematical symbols are grouped according to some features. Then comparison during recognition is performed on groups instead of the whole set of symbols.
Data is gathered by using questionnaire, and about 50 people's handwritten is put into the database. Several pre-processing operations were performed, including selectively chopping the head and the tail of strokes, re-sampling, smoothing and size normalization.
Then some features are introduced:
Geometric Features:
(1) Number of Loops: by finding number of minimum distance pairs - sweep a line parallel to the line between "begin" and "end" (a pair of points within a certain distance threshold on a stroke) in the direction that shortens distanse between the two neighboring intersections with the stroke to find either to be locally connected or a minimus distance pair.
(2) Number of Intersections: using modified sweepline algorithm - on finding an intersection, delete the two line segments associated with the intersection, and insert two line segments that begin from intersection and end with their old ending points.
(3) Number of Cusps: cusp is formed by a series of five points p0, p1, p2, p3 and p4, such that angle between p0p1 and p1p2 > 170 degree, angle between p2p3 and p3p4 also > 170 degree, but angle between p1p2 and p2p3 is small.
Ink Related Features
(1) Number of Strokes
(2) Point Density: determine density similar to "o", "b" or "p".
Directional Features
(1) Initial, End and Initial-End Directions: discretize all directions to the nearest of N, NE, E, SE, S, SW, W, NW.
Global Features
(1) Initial and End Position: NE, NW, SE, SW of the bounding box.
(2) Width to Height Ratio: 0 for a slim symbol, 1 for a wide symbol, and 2 for a normal symbol.
Of them, the number of strokes (exact match), the (discretized) width to height ratio (exact match), the initial position (could differ by one), initial to end direction (could differ by one), and end direction (could differ by two) was used to prune models. (Models' feature values are pre-computed and stored.)
After pruning, elastic matching was used for recognition step, achieved by calculating the minimum distance between the unknown symbol and a set of models.
Experiments results show that although using feature to pre-classify lowers accuracy a little bit - while still quite high - the comparison is reduced to a large extent (approximately 89%), thus recognition speed is greatly promoted. And the authors suggest to introcude context and HMM recognizer to reduce ambiguity.
Discussion
I think the really important part in this paper is the introduction of features. Although only several was used, the authors discussed a lot. The geometric features seems to be quite useful if they are used in feature-based recognition. For other features, discrete values are used ignore small errors. This fuzzy logic actually improves recognition accuracy. The suggestion about using context is a good idea in mathemetical handwriting recognition and maybe some other fields.
Sunday, September 14, 2008
"User Sketches: A Quick, Inexpensive, and Effective way to Elicit More Reflective User Feedback"
by M. Tohidi, W. Buxton, R. Baecker, A. Sellen
Summary
Usability testing (UT) is well-known and commonly used to involve users in various stages of the design process. But in current UT process, participants generate significantly more reactive comments and criticisms than reflective design suggestions or redesign proposals. So user sketching is proposed to provide the right means for generating reflective feedback, without adding significantly to the cost and without eliminating the active involvement of users in the process.
Iterative process between sketch and knowledge/mind allows for an increasing improvement of the sketch/design. It provides users with a richer medium for communicating their feedback to designers. The work has taken place in two stages.
In their first study, they examined the differences that would result between a usability test that exposed users to a single design, versus one where they tried three different alternatives. The product was a touch-sensitive House Climate Control System. Three distinctive prototypes were developed, each reflecting a different design language: Circular, Tabular, and Linear. All three were consistent in terms of fidelity, functionality and quality. Think-aloud protocol and interview give much less reflective feedback than reactive feedback. Participants lack confidence in their own redesign suggestions. The process of thinking and verbalizing one's thoughts at the same time is very distracting.
Then in the second stage of the study, participants were asked to redesign the system. The results suggests that the less the original design is reflected in the user sketches, the less successful it is, and the sketches from the multiple design condition were richer in variation from any of the original designs compared with those from the three single conditions. Being able to recognize patterns at a glance, or answering unanticipated questions are unique benefits of sketches over numeric, textual, audio and video data. Sketching is a cheap and efficient way to enable the participants to refine already generated ideas and discover new ones.
Then three examples are given to illustrate a number of ways in which sketching helped the participants to better organize their thoughts, come up with new and unexpected design ideas, reflect on and refine previously stated ideas, and communicate their ideas to the experimenters.
Discussion
This paper illustrates the idea of applying sketching in UT. From the sketch recognition respect of view, I think it quite inspiring. The approach of paper-based sketching can totally be replaced by sketch recognition in the future.
The sentence "It is when sketches are seen in a larger group that patterns emerge and relationships are discovered." in Discussion suggests that we can use sketch recognition to abstract patterns from multiple sketches drawn by users. And this will be a major advantage of sketch recognition to paper-based sketching in UT process.
Another sentence in the Future Work part "one of our longer term objectives is to explore such techniques further in ideation, especially when comparing multiple design alternatives." gives us the picture that how sketch recognition can facilitate design. For example, the elements in multiple design alternatives can be abstracted and recombined in other forms to generage new design, or the users can give reflective feedback by just modifying the original design with sketch recognition instead of drawing their own from the beginning...
some videos:
UML design http://www.youtube.com/watch?v=v3-ozq-ZbHE
LADDER GUI http://www.youtube.com/watch?v=RjwdpB0Y924
GUI design http://www.youtube.com/watch?v=6gVrw1Wgdfg
Summary
Usability testing (UT) is well-known and commonly used to involve users in various stages of the design process. But in current UT process, participants generate significantly more reactive comments and criticisms than reflective design suggestions or redesign proposals. So user sketching is proposed to provide the right means for generating reflective feedback, without adding significantly to the cost and without eliminating the active involvement of users in the process.
Iterative process between sketch and knowledge/mind allows for an increasing improvement of the sketch/design. It provides users with a richer medium for communicating their feedback to designers. The work has taken place in two stages.
In their first study, they examined the differences that would result between a usability test that exposed users to a single design, versus one where they tried three different alternatives. The product was a touch-sensitive House Climate Control System. Three distinctive prototypes were developed, each reflecting a different design language: Circular, Tabular, and Linear. All three were consistent in terms of fidelity, functionality and quality. Think-aloud protocol and interview give much less reflective feedback than reactive feedback. Participants lack confidence in their own redesign suggestions. The process of thinking and verbalizing one's thoughts at the same time is very distracting.
Then in the second stage of the study, participants were asked to redesign the system. The results suggests that the less the original design is reflected in the user sketches, the less successful it is, and the sketches from the multiple design condition were richer in variation from any of the original designs compared with those from the three single conditions. Being able to recognize patterns at a glance, or answering unanticipated questions are unique benefits of sketches over numeric, textual, audio and video data. Sketching is a cheap and efficient way to enable the participants to refine already generated ideas and discover new ones.
Then three examples are given to illustrate a number of ways in which sketching helped the participants to better organize their thoughts, come up with new and unexpected design ideas, reflect on and refine previously stated ideas, and communicate their ideas to the experimenters.
Discussion
This paper illustrates the idea of applying sketching in UT. From the sketch recognition respect of view, I think it quite inspiring. The approach of paper-based sketching can totally be replaced by sketch recognition in the future.
The sentence "It is when sketches are seen in a larger group that patterns emerge and relationships are discovered." in Discussion suggests that we can use sketch recognition to abstract patterns from multiple sketches drawn by users. And this will be a major advantage of sketch recognition to paper-based sketching in UT process.
Another sentence in the Future Work part "one of our longer term objectives is to explore such techniques further in ideation, especially when comparing multiple design alternatives." gives us the picture that how sketch recognition can facilitate design. For example, the elements in multiple design alternatives can be abstracted and recombined in other forms to generage new design, or the users can give reflective feedback by just modifying the original design with sketch recognition instead of drawing their own from the beginning...
some videos:
UML design http://www.youtube.com/watch?v=v3-ozq-ZbHE
LADDER GUI http://www.youtube.com/watch?v=RjwdpB0Y924
GUI design http://www.youtube.com/watch?v=6gVrw1Wgdfg
Monday, September 8, 2008
"Graphical Input Through Machine Recognition Of Sketches"
by Christopher F. Herot
Summary
Traditional systems of computer-aided design is more akin to computer-aided evaluation or manipulation. And this paper discusses the approaches to let machines make inference about the meaning of sketches as well as user's attitude toward his design.
The HUNCH system is introduced. It's composed of a set of programs that interpret sketches at different levels. HUNCH was conceived around STRAIT, which found corners in the sketch based on the assumption that drawing speed decreased at corners. Curves are considered to be special corners which are recognized by CURVIT. The first version of STRAIT used a latching algorithm that combined endpoints falling close together, which rendered some bizzare results, which then brings STRAIT without latching - STRAIN. Experiment on computing latching radius by a function of speed also shows unsatisfactory result, because it's based on the assumption that user's certainty in endpoints' position correlate with speed of drawing the line, which doesn't reflect contidion at endpoints. Comparing candidates for latching with GUESS in 3d space may solve the problem, but paradoxically, GUESS relies on the latched data as input. This may require that latching make initial decisions and be modified in later stages if proved untenable. Overtracing reduces quantity of data and turning several overtraced lines into one. 3D projection based on 2D network of lines, and room finding based on floorplan shows success in machine recognition to some extent.
Truely successful system would have to make use of context at lowest level. General case description is mathed against the context-free structure in a top-down fashion and generate a composite structure where all of the syntactic entities of the sketch are assigned a meaning, but the matching process is complicated.
A more interactive system that combines bottom-up data flow of and top-down flow of context information comes from the user as well as high level program is promising. The new system consists of a data base and a set of programs to manipulate it. A set of functions such as speed and bentness are used to find lines and curves. By varying the number of points in an interval, machine's interpretation can be manipulated to fit closest to user's intention. Inference-making procedures are run in background, allowing interpretation on the fly. Latching is decided by certainty factors, and should be controlled by user profile. Also, latching radius should adjust according to some factors.
Discussion
This paper shows an "intelligent" system. It really tries to recognize what the user has drawn, although the speed factor depends purely on drawing habits of individual users. The idea of involving the context information (used by human observer) into the system is great, which facilitats decision making by assgining meaning to things being sketched. However, human intervention - context, intentions, etc. - is still playing a part in interpretation ("For example, if the program were told that the user has drawn a square, it could vary the size of the interval until the two functions produced exactly five peaks.")
Summary
Traditional systems of computer-aided design is more akin to computer-aided evaluation or manipulation. And this paper discusses the approaches to let machines make inference about the meaning of sketches as well as user's attitude toward his design.
The HUNCH system is introduced. It's composed of a set of programs that interpret sketches at different levels. HUNCH was conceived around STRAIT, which found corners in the sketch based on the assumption that drawing speed decreased at corners. Curves are considered to be special corners which are recognized by CURVIT. The first version of STRAIT used a latching algorithm that combined endpoints falling close together, which rendered some bizzare results, which then brings STRAIT without latching - STRAIN. Experiment on computing latching radius by a function of speed also shows unsatisfactory result, because it's based on the assumption that user's certainty in endpoints' position correlate with speed of drawing the line, which doesn't reflect contidion at endpoints. Comparing candidates for latching with GUESS in 3d space may solve the problem, but paradoxically, GUESS relies on the latched data as input. This may require that latching make initial decisions and be modified in later stages if proved untenable. Overtracing reduces quantity of data and turning several overtraced lines into one. 3D projection based on 2D network of lines, and room finding based on floorplan shows success in machine recognition to some extent.
Truely successful system would have to make use of context at lowest level. General case description is mathed against the context-free structure in a top-down fashion and generate a composite structure where all of the syntactic entities of the sketch are assigned a meaning, but the matching process is complicated.
A more interactive system that combines bottom-up data flow of and top-down flow of context information comes from the user as well as high level program is promising. The new system consists of a data base and a set of programs to manipulate it. A set of functions such as speed and bentness are used to find lines and curves. By varying the number of points in an interval, machine's interpretation can be manipulated to fit closest to user's intention. Inference-making procedures are run in background, allowing interpretation on the fly. Latching is decided by certainty factors, and should be controlled by user profile. Also, latching radius should adjust according to some factors.
Discussion
This paper shows an "intelligent" system. It really tries to recognize what the user has drawn, although the speed factor depends purely on drawing habits of individual users. The idea of involving the context information (used by human observer) into the system is great, which facilitats decision making by assgining meaning to things being sketched. However, human intervention - context, intentions, etc. - is still playing a part in interpretation ("For example, if the program were told that the user has drawn a square, it could vary the size of the interval until the two functions produced exactly five peaks.")
Wednesday, September 3, 2008
"Gestures without Libraries, Toolkits or Training: A $1 Recognizer for User Interface Prototypes"
by Wobbrock, Wilson, Li
Comment
1. Manoj's blog
Summary
Gesture recognition has been the privilege of experts in pattern recognition and AI, not HCI experts. $1 recognizer facilitates incorporation the gestures into UI. It supports configurable rotation, scale, and position invariance, no feature selection or training examples.
Although toolkits like Artkit, Amulet, SATIN and libraries like Siger are powerful, they cannot help in most new prototyping environments where they are not available.
The algorithm is in 4 steps:
Step 1: , resample a points path into n (n = 64 is adequate) evenly spaced points due to variation in hardware and software sampling time.
Step 2: Rotate points so that their indicative angle (angle between centroid of gesture (x¯,y¯) and the gesture's first point) is at 0°.
Step 3: Scale points so that the resulting bounding box will be of
size*size dimension, and move the bounding box so that centroid of the gesture is at (0,0). For gestures serving as templates, Steps 1-3 should be carried out once on the raw input points. For candidates, Steps 1-4 should be used just after the candidate is articulated.
Step 4: Recognition and score. Compare candidate with templates, calculate average distance between corresponding points. The template with least path-distance to candidate is the result of recognition. Scoring can be calculated by 1 - path-distance between template and candidate / largest possible path-distance. Note that Step 2 only finds approximated best angular alignment between template and candidate, so candidate may need further rotation (only rotate in direction that renders a decrease in path-distance, and apply Golden Section Search (GSS) bounded by ±45°and a 2° threshold) to find optimal indicative angle and thus a global minimum path-distance.
Then, experiments are performed that show $1 has very good performance over DTW and Rubine's algorithm.
Discussion
This algorithm is quite useful to novices and non-specialists, at leat the steps really make sense to me. It is a geometric-based algorithm, and it is a good algorithm in terms of computational speed and no need of training examples. One thing I don't quite understand is GSS.
Comment
1. Manoj's blog
Summary
Gesture recognition has been the privilege of experts in pattern recognition and AI, not HCI experts. $1 recognizer facilitates incorporation the gestures into UI. It supports configurable rotation, scale, and position invariance, no feature selection or training examples.
Although toolkits like Artkit, Amulet, SATIN and libraries like Siger are powerful, they cannot help in most new prototyping environments where they are not available.
The algorithm is in 4 steps:
Step 1: , resample a points path into n (n = 64 is adequate) evenly spaced points due to variation in hardware and software sampling time.
Step 2: Rotate points so that their indicative angle (angle between centroid of gesture (x¯,y¯) and the gesture's first point) is at 0°.
Step 3: Scale points so that the resulting bounding box will be of
size*size dimension, and move the bounding box so that centroid of the gesture is at (0,0). For gestures serving as templates, Steps 1-3 should be carried out once on the raw input points. For candidates, Steps 1-4 should be used just after the candidate is articulated.
Step 4: Recognition and score. Compare candidate with templates, calculate average distance between corresponding points. The template with least path-distance to candidate is the result of recognition. Scoring can be calculated by 1 - path-distance between template and candidate / largest possible path-distance. Note that Step 2 only finds approximated best angular alignment between template and candidate, so candidate may need further rotation (only rotate in direction that renders a decrease in path-distance, and apply Golden Section Search (GSS) bounded by ±45°and a 2° threshold) to find optimal indicative angle and thus a global minimum path-distance.
Then, experiments are performed that show $1 has very good performance over DTW and Rubine's algorithm.
Discussion
This algorithm is quite useful to novices and non-specialists, at leat the steps really make sense to me. It is a geometric-based algorithm, and it is a good algorithm in terms of computational speed and no need of training examples. One thing I don't quite understand is GSS.
Subscribe to:
Comments (Atom)
