Fast Document Rectification and Enhancement

// By Jongmin Baek • Aug 16, 2016

Dropbox’s document scanner lets users capture a photo of a document with their phone and convert it into a clean, rectangular PDF. It works even if the input is rotated, slightly crumpled, or partially in shadow—but how?

In our previous blog post, we explained how we detect the boundaries of the document. In this post, we cover the next parts of the pipeline: rectifying the document (turning it from a general quadrilateral to a rectangle) and enhancing it to make it evenly illuminated with high contrast. In a traditional flatbed scanner, you get all of these for free, since the document capture environment is tightly controlled: the document is firmly pressed against a brightly-lit rectangular screen. However, when the camera and document can both move freely, this becomes a much tougher problem.


We would like our scans to be easy to read, no matter the conditions in which they were captured . We define a pleasing scan to have the following properties:
  1. The document itself has been neatly rectified and cropped.
  2. The background of the document is mostly a uniform white, with even illumination.
  3. The foreground text and figures are crisp and visible with high contrast.
Here's an example input and output:

Rectifying a Document

We assume that the input document is rectangular in the physical world, but if it is not exactly facing the camera, the resulting corners in the image will be a general convex quadrilateral. So to satisfy our first goal, we must undo the geometric transform applied by the capture process. This transformation depends on the viewpoint of the camera relative to the document (these are the so-called extrinsic parameters), in addition to things like the focal length of the camera (the intrinsic parameters). Here’s a diagram of the capture scenario:

In order to undo the geometric transform, we must first determine the said parameters. If we assume a nicely symmetric camera (no astigmatism, no skew, et cetera), the unknowns in this model are:

  • the 3D location of the camera relative to the document (3 degrees of freedom),
  • the 3D orientation of the camera relative to the document (3 degrees of freedom),
  • the dimensions of the document (2 degrees of freedom), and
  • the focal length of the camera (1 degree of freedom).

On the flip side, the x- and y-coordinates of the four detected document corners gives us effectively eight constraints. While there are seemingly more unknowns (9) than constraints (8), the unknowns are not entirely free variables—one could imagine scaling the document physically and placing it further from the camera, to obtain an identical photo. This relation places an additional constraint, so we have a fully constrained system to be solved. (The actual system of equations we solve involves a few other considerations; the relevant Wikipedia article gives a good summary.)

Once the parameters have been recovered, we can undo the geometric transform applied by the capture process to obtain a nice rectangular image. However, this is potentially a time-consuming process: one would look up, for each output pixel, the value of the corresponding input pixel in the source image. Of course, GPUs are specifically designed for tasks like this: rendering a texture in a virtual space. There exists a view transform—which happens to be the inverse of the camera transform we just solved for!—with which one can render the full input image and obtain the rectified document. (An easy way to see this is to note that once you have the full input image on the screen of your phone, you can tilt and translate the phone such that the projection of the document region on the screen appears rectilinear to you.)

Lastly, recall that there was an ambiguity with respect to scale: we can’t tell whether the document was a letter-sized paper (8.5” x 11”) or a poster board (17” x 22”), for instance. What should the dimensions of the output image be? To resolve this ambiguity, we count the number of pixels within the quadrilateral in the input image, and set the output resolution as to match this pixel count. The idea is that we don’t want to upsample or downsample the image too much.

Document enhancement as an optimization

Once we have a rectangular rendering of the document, the next step is to give it a clean and crisp scanned appearance. We can explicitly formulate this as an optimization problem; that is, we solve for the final output image J(x,y) as a function of the input image I(x, y) that satisfies the two aforementioned requirements to the greatest extent possible:

  • The background of the document is mostly a uniform white, with even illumination.
  • The foreground text and figures are crisp and visible with high contrast.

If we could tell whether a given pixel belongs to the foreground or to the background, this task would be straightforward. However, assigning a binary label leads to aliasing, especially for text with small font. A simple linear transform based on the pixel value is not sufficient, either, because there are often shadows or other lighting variations across the image. Hence, we will try to compute the final output image J without explicitly solving the foreground/background classification problem.

We achieve the above requirements by writing a cost function that penalizes things we don’t want, and then running it through a standard optimization procedure to arrive at the solution with the lowest cost possible; hopefully, this will correspond to the best possible output image.

So now, the degree to which a potential solution J adheres to the first requirement is fairly straightforward to write as a cost:


where 255 denotes white pixels and the indices xy range over the extent of the image. If the output image is mostly white, this measure would be minimized.

For the second requirement, we’d like to ensure that the foreground has a crisp contrast against the background for ease of reading, despite changes in brightness throughout the image. Since we are not explicitly assigning foreground labels, what we need is a way to preserve local structure while factoring out global brightness changes. One common measure of the local structure within an image is its gradient, which denotes the difference between neighboring pixels. Hence, to preserve the local structure, we can use as our cost the degree to which the output gradient deviates from that of the original image:

Combining the two, we obtain an optimization problem we can tackle:

This yields a well-known system of equations called Poisson’s equation, which is commonly used in computer graphics and physics. It can be solved efficiently via either conjugate gradient descent or the fast Fourier transform. We make use of Apple’s Accelerate framework and open-source template meta-programming libraries, such as Eigen and our own Lopper, for further accelerating the computation.

Optimizing the optimization

Solving Poisson’s equation on a full-resolution image (8–12 megapixels on the latest iPhones) is still computationally demanding, and can take several seconds on older devices. If the user is creating a multi-page PDF, the wait time increases commensurately. To provide a smoother user experience, we would like to reduce the processing time by an order of magnitude.

One observation is that the output is generally linearly correlated with the input, at least locally—if one were to apply some gain to the input and add an offset, it would be a reasonably good solution locally, i.e.,

Of course, the rationale behind using a mathematical machinery like the Poisson’s equation in the first place was that there is no single gain and offset that works for the whole image. In order to handle uneven illuminations and shadows, however, we could allow the gain and the offset to vary across the image:

While this new formulation is more flexible than before, it has twice as many unknowns (the gain and the offset at each pixel, rather than simply the final output value), making it trickier and more expensive to solve.

The key insight for reducing the computational cost and further constraining the problem is that the gain and the offset should vary relatively slowly across the image—we’re aiming to deal with illumination changes, not rainbow-colored paper! This allows us to solve the optimization problem at a much lower resolution compared to the input image, and therefore much faster. This also implicitly forces the unknown values to correlate locally, because we then upsample the gain and offset back to the original resolution. Once the gain and the offset are known across the image, we can plug them back into the above equation to obtain the final output image.

What about color?

So far our derivations have ignored color, even though most photos come in the form of an RGB image. The simplest way to deal with this is to apply the above algorithm to each of the R, G, B channels independently, but this can result in color shifts, since the channels are no longer constrained together.

In our initial effort to combat this, we tried to substitute in the original RGB values for the output pixels that are not close to white. However, when we tried this, we encountered the effect of color constancy. Here’s a great illustration of this “illusion,” in which the two tiles marked A and B have the same pixel values, but appear to be very different:

In our experiments, the output colors using this simple algorithm would look faded, even though the RGB values were exactly the same as the input! The reason is that the human visual system is based on relative brightness, not absolute ones; this makes colors “pop” more relative to the dull gray of the input, but not relative to the bright white background of the enhanced image. To deal with this effect, our algorithm converts the image into the HSV color space, and then copies the hue and saturation from the original image wherever appropriate. This results in much better color constancy, as seen below:

Left: the original image. Middle: an enhanced image, with the background becoming white and the foreground preserved in exact R, G, B values. Note that the colors appear faded. Right: an enhanced image that tries to correct for the perceptual discrepancy.

Handling (dog-eared) corner cases

So far we have assumed that our document detection and rectification steps (which precede enhancement) work perfectly. While they provide reasonable results on most inputs, sometimes they have to operate on input images that violate our assumptions: In practice, many documents being scanned have edges that are not perfectly straight, and sometimes the corners are dog-eared. As a consequence, the rectified image may include some background that is not a part of the document. We detect and segment out these areas via a simple min-cut algorithm, so that the final output is more reflective of the user intent; the algorithm removes well-delineated dark areas near the borders.

Left: an enhanced document image, without treating the boundaries. Because the document edges may be physically curved or dog-eared, the rectified image may contain some non-document regions. Right: an enhanced document image, with the boundaries treated.

Putting it together

Starting from a document boundary, we have shown how we rectify the image and then enhance it for readability. Our enhancement algorithm also contains a single free parameter that controls the contrast of the output document (plus the relative values of the coefficients introduced above, like k1 and k2), and we make it adjustable with a user-friendly interface so that you can get the exact appearance you want, as shown here:

Try it out

Try out the Dropbox document scanner today, and stay tuned for our next blog post.


// Copy link