HN2new | past | comments | ask | show | jobs | submitlogin
Fast and Accurate Document Detection for Scanning (dropbox.com)
188 points by samber on Aug 9, 2016 | hide | past | favorite | 51 comments


Worked on this problem exactly 2-3 years ago (developed automated document processing in the accounts receivable and accounts payable sector for a decade plus). It's a fun iceberg problem that looks simple on the surface but tends to have some real thorns the deeper down you go.

Document identification like this is unfortunately the "easy" (and it's not particularly easy to do real time) part. The next two steps involve 3D de-deformation since unlike a flatbed scanner you cannot assume the paper is actually completely flat -- imagine a previously folded page, etc.

I love this stuff as it is at a crossroads of a half dozen different disciplines. Lots of money to be had if this can be done is a really robust manner.

Edit:

A couple examples of why this gets really hairy really fast:

* You'll notice that all the documents are shown on a high contrast background (dark wood grain) without a lot of stark lighting. One of your first steps in edge detection and line identification is image segmentation to remove background from foreground and then start removing noise. If you have a white piece of paper on a white table, or a large lighting contrast (say from an open window casting daylight on half the page) it really wreaks havoc with the algorithms.

* Imagine you're trying to recognize a page from a text book in the middle of the book. The way the page lies you end up with non-rectangular pages (they curve due to the spine) which kills the hough line transformation (there are also hough circle algorithms, but you get the point) and the rectangle selection.


I remember this SO question from the high-contract background point you brought up -- http://stackoverflow.com/questions/36982736/how-to-crop-bigg...


Thanks for sharing this, really helpful!


Since I am working on a similar problem at the moment myself, It'd be great if you could share some insights on fixing the 3D deformation -- I imagine fitting a polygon followed with a warp transformation could be an "ideal" process?

In the contrast problem you mention there, I found (in a few samples that I tested with) that adaptive thresholding seem to be sufficiently good [0].

[0] I am using ``skimage.filters.threshold_adaptive`` for this.


On the contrast topic: adaptive thresholding can be very helpful (I believe Bradley Local Thresholding was one I had particular success with) however most of these algorithms work in a grayscale domain which means they are dependent upon which color->grayscale transformation is used[1]. I spent a long time researching full color algorithms but never got to a truly successful end result with them. And even if you get a good image with huge contrast you still will end up with the actual light/dark transition looking like an edge.

On 3D deformation, you're officially in academic research land. Nearly all algorithms require you to have a solid guess as to what the aspect ratio of the target object is. Other algorithms use heuristics based upon what you expect to find on a page. One particularly fun algorithm used the baseline of text (I believe for that paper it was Arabic) and fit a high-order curve to it which was then reversed. Unfortunately I haven't seen a truly generic approach that doesn't require a implementation-specific input.

[1] Frankly my feeling is that RGB to grayscale is a mistake and holding back many of these algorithms


Yep, we turned RGB into LUV space before extracting edges, which helps a lot on contrast and keeps essential edge information that could've been lost if converted to grayscale.

Agree with that 3D deformation is a difficult open problem, and we haven't gotten into that yet. Currently we assumed the document is a flat rectangle, which maps to a quadrilateral in image space. A homography is then applied to rectify it, and it seems to work quite well if the paper is slightly curved or folded.


Excellent. It's a little funny how when you start problems like these you start becoming an expert in fields you never thought you'd have to play in like color spaces, color perception theory, etc.

Great work, and I look forward to seeing future posts on the solutions you've been able to come up with!


Yeah, I got a serious education doing this for mail items. And I had it easier as I was able to control the background and lighting and camera and everything.

Well, I couldn't control the autofocus very well, going from a $500 DSLR to a $1200 DSLR made HUGE gains since it'd have far, far more autofocus points.

I was really interested in the text output of the OCR that I later did (which was a treat in itself since mail has so many different fonts, even on the same item!). I learned a lot about a lot of things too.


I have found colorspace transformation to be an important factor as well. My current problem would not require fixing 3D deformation, but I am finding it really interesting thing that I'd like to be working on in future.

Thanks for this additional information, much appreciated!


For the 3D deformation take a look at this part of OpenCV:

http://docs.opencv.org/2.4/doc/tutorials/features2d/feature_...


Hi everyone, this is Ying Xiong from Dropbox, and I'm the author of the blog post. Feel free to let me know if you have any question, comments or suggestions.

Hope you enjoy this post, and keep tuned as we have other posts to be published in coming weeks about other part of our scanning feature.


Could you elaborate more on the edge detector? I thought it was a bit of a juxtaposition to go from:

> We decided to develop a customized computer vision algorithm that relies on a series of well-studied fundamental components, rather than the “black box” of machine learning algorithms such as DNNs.

To:

> To overcome these shortcomings, we used a modern machine learning-based algorithm. The algorithm is trained on images where humans annotate the most significant edges and object boundaries. Given this labeled dataset, a machine learning model is trained to predict the probability of each pixel in an image belonging to an object boundary.

This seems like a crucial step in the algorithm and sounds exactly like a black box DNN...


The learning algorithm we used is not a neural network that got trained in end-to-end fashion. Instead, it is a local prediction model that takes an input image patch and produces a patch of the same dimension with probability for each pixel of belonging to a document boundary. Those per-patch predictions are then aggregated together to reduce variance, resulting in an edge map of the same dimension as the input image.


What is a patch in your case? Are you running a sliding window over the image or tiling it? Then are you marking each pixel as belonging to the edge of a document or are you marking detected edges as valid document boundaries? Also how do you model the links between the 4 sides? A reference to a paper or follow up blog post would be greatly appreciated.

Great work. Laurent


Ah ok, thanks! Do you have a paper/reference for this (I guess you have a proprietary implementation though)?

As the sibling says, this sounds like a good random forest problem, so you just pass in a load of patches that have been labelled with ground truth and let the classifier give you a probability for each pixel?


I believe the algorithm he's using to be Random Forest, not exactly a black box DNN but close enough :)


To overcome these shortcomings, we used a modern machine learning-based algorithm. The algorithm is trained on images where humans annotate the most significant edges and object boundaries.

Does anyone know which "modern machine-learning algorithm" they are referring to here? Is there something like this available in OpenCV?


We can reasonably assume it's nothing more complicated than what you can do using a combination of machine learning libraries and OpenCV (however if they have instead some new technique, I hope to find a paper from them in a few months :) ).

EDIT: Adding more details.

If you are looking for similar ideas, you should read papers in the area of object-class segmentation / classification[0][1][2], and generic supervised learning.

[0] https://arxiv.org/abs/1510.03727

[1] https://www.microsoft.com/en-us/research/publication/object-...

[2] https://www.ais.uni-bonn.de/papers/DAGM_NC2_2011_Schulz.pdf


Yup, there aren't any ML built into OpenCV but perhaps they use a ML library on top of OpenCV.


OpenCV doesn't have any ML algorithms builtin that I know of, but the article is pretty vague there eh? Either way, document-scanning from a phone camera is no picnic.

I tried a little while ago. Memory is kind of hazy, but depending on how well you do the image transformation (automatically[ish] skew to rectangle, etc), image quality might get poor. Then you have to do the actual OCR. Now the only complete OSS solution is Tesseract and it's not a state-of-the-art one. There's also ocrpy, but it's more of a toolkit and it's model needs to be trained (one single-line text when I last checked). So yeah, it's fairly hard to do.



Obviously it is a convolutional neural net. Here you can find source code for one of the latest work:

https://github.com/s9xie/hed


Not so obvious. In fact, I believe they are using Random Forest.


Good point


Hi, they are using Random Forest to get the edges :)


If you want this on Android, there are a couple of good apps. Office Lens from microsoft, camscanner and scanbot. Office Lens is really good for scanning but other parts of the app are not very polished.


You can also use the Google Drive app. Touch the plus button and then scan. It is working decently.


On iOS I've been using Scanner Pro for years and it's worked very well for all manners of things, from receipts to notes taken during classes, or other special papers.


Great overview, the Hough transform has a soft place in my heart so I love anyone mentioning it and actually using it.

I wonder if anyone from Dropbox could go into more detail about the technical aspects? This sounds like the perfect thing to build as a C/Rust/whatever embedded library so you can share it with an Android app later on, is that what happened here or is this all in Swift/Obj-C?


Glad you liked the post!

Yeah, Hough transform is definitely a time-tested algorithm that embodies both elegance and efficacy. I truly love that.

On the technical side, we wrote the detection library in C++, so that it can be easily ported cross-platformly. For iOS integration, we simply integrated with Obj-C++.


I got inspired to make this hough transform visualizer from a few examples I saw online, check it out! https://liquiddandruff.github.io/hough-transform-visualizer/


I have not yet discovered an app that solves this problem (edge detection) good enough for me. It's like 50/50 with Genius Scan, and Dropbox maybe manages to recognize edges at 60% or so correctly. I think they should have dared go down that deep learning route.


Indeed, this is a deceptively really hard problem that I think nobody perfectly solved yet. The main problem with the deep learning route is it being resource demanding (both computation and memory expensive). Hopefully these problems will automatically go away in a couple of years as the mobile devices become more powerful and the deep learning architecture gets more light-weighted.


Have you tried Office Lens by Microsoft? The scanner part of the app is brilliant


The problem is data. I am not sure how to collect enough data for this easily (on the order of 50k or so, we don't need to train from scratch).


Try the latest update of Genius Scan, it's more like 80% :)


Can someone shed some further light on the hough transform image used in the article [0]. I can't seem to make sense of why the hough transform of the canny edged image looks like that. Are they using an adaptive hough transform?

[0] https://blogs.dropbox.com/tech/2016/08/fast-and-accurate-doc...


Very good question. As stated in the blog post (one line above that figure), we actually used a polar parametrization r=x·sinθ+y·cosθ than the slope-intercept version y=mx+b.

If we were to use y=mx+b, then the hough transform image would look like many straight lines intercepting at a few points, which makes most intuitive sense. The issue with this form is it gets ill-formed when the line becomes near vertical (m goes to infinity).

The polar parametrization r=x·sinθ+y·cosθ solves this problem, and in the hough space, the axes will be r and θ. A point in image space maps to a sinusoid in hough space, which is why the transformed image looks like that.


Thanks for the additional input. This is really fascinating. Now I remember having seen this in the HoughTransform function in OpenCV [0] but could never make sense to how it relates to the real world

[0]: http://docs.opencv.org/2.4/doc/tutorials/imgproc/imgtrans/ho...


“To overcome these shortcomings, we used a modern machine learning-based algorithm. The algorithm is trained on images where humans annotate the most significant edges and object boundaries. “

-- did Dropbox use Turk for this?


My first thought was why not use the simple edge detection with connected components? Assume the document of interest is the prominent feature of high connection. Discard high frequency (low length) line segments that are not connected to form the largest quadrilateral.

Further segmentation could be done by having the user "tap" to select the document.


I'm not sure if evernote does the same technique, in my experience, evernote app scans documents pretty well most of the time.


Very interesting; I've tried to do this a few times, but you really need a corpus of labelled images to do this properly.


Ok, now the standard question: when will this feature arrive in the Android mobile app? It is really cool!


Are the detection steps mentioned happing on the actual device or is it happening server-side?


It's definitely happening on the device. Document recognition like this moved onto the device about 3 years ago, and in fact if they didn't do this device side they would have a harder time dealing with the Mitek patents[1] that are in this space.

The actual OCR and data extraction likely occurs on the server side, but the document recognition on device is a much better user experience.

[1] USAA and Mitek were suing each other over the patents from 2012-2014.


Yep, we do the entire document detection and other following steps (to be described in coming posts) on the mobile device.


Does anyone know of an open-source implementation of a similar pipeline?


Having worked with several of the commercial products in this space almost all of them lean on OpenCV for the hard parts, and I'd be surprised if this didn't either.


can't wait for the machine learning part of these blog posts. Hopefully they go into more detail


2 Fast 2 Accurate




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: