![]() ![]() ![]() It uses the CTC loss, introduced by Alex Graves, to decode a sequence efficiently. It is basically composed of the first half of the mobilenetV2 layers to extract features and it is followed by 2 bi-LSTMs to decode visual features as character sequences (words). More information on this architecture can be found here. The recognition model we used is also our lighter architecture: a CRNN (convolutional recurrent neural network) with a mobilenetV2 backbone. We have a private dataset composed of 130,000 annotated documents that was used to train this model. We trained this model with an input size of (512, 512, 3) to decrease latency and memory usage. The implementation details can be found in the DocTR Github. ![]() Here we used a mobilenetV2 backbone with a DB (Differentiable Binarization) head. We have different architectures implemented in DocTR, but we chose a very light one for use on the client side as device hardware can change from person to person. Global architecture of the OCR model used in this Demo The second model is a convolutional recurrent neural network (CRNN), which extracts features from word-images and then decodes the sequence of letters on the image with recurrent layers (LSTM). In DocTR, the detection model is a CNN (convolutional neural network) which segments the input image to find text areas, then text boxes are cropped around each detected word and sent to a recognition model. OCR models can be divided into 2 parts: A detection model and a text recognition model. Hence, performance might not be optimal on documents that have a very small writing size vs the size of the document or images with a very high aspect ratio. Keep in mind that these models have been designed to offer performance while running in the browser. It is optimized to work on documents with a significant word size (for example receipts, cards, etc). For rectangles with a very high aspect ratio, segmentation results might not be as good because we don’t preserve the aspect ratio (with padding) at the text detection step. Images are resized to be squares, so it generalizes well to most of the documents which have an aspect ratio close to 1: cards, smaller receipts, tickets, A4, etc. This demo is designed to be very simple to use and run quickly on most computers, therefore we provided a single pretrained model that we trained with a small (512 x 512) input size to save memory. The demo interface with a picture of 2 receipts being parsed by the OCR: 89 words were found here We managed to achieve this using the TensorFlow.js API, which resulted in a web demo that you can now try for yourself using images of your own. If you want to learn more on that topic, this article is a good introduction.Īt Mindee, we have developed an open-source Python-based OCR called DocTR, however we also wanted to deploy it in the browser to ensure that it was accessible to all developers - especially as ~70% developers choose to use JavaScript. Optical Character Recognition (OCR) refers to technologies capable of capturing text elements from images or documents and converting them into a machine-readable text format. ![]()
0 Comments
Leave a Reply. |