LayoutLMv3 is a pre-trained transformer model published by Microsoft that can be used for various document AI tasks, including:
LayoutLMv3 incorporates both text and visual image information into a single multimodal transformer model, making it quite good at both text-based tasks (form understanding, id card extraction and document question answering) and image-based tasks (document classification and layout analysis).
If you’d like to learn more about what LayoutLMv3 is, you can check out the white paper or the Github repo.
Many great guides exist on how to train LayoutLM on common large public datasets such as FUNSD and CORD. If that is what you are looking for, we recommend taking a look at this great python notebook that Niels Rogge from Hugging Face team has put out.
When it comes to building a document processing model for real-world applications, you’ll typically be training a model on your own custom documents, rather than one of these common datasets.
This guide is intended to walk you through the process of training LayoutLM on your own custom documents.
We’ll walk through a python notebook that covers:
For the purposes of this guide, we’ll train a model for extracting information from US Driver’s Licenses, but feel free to follow along with any document dataset you have.
If you just want the code, you can check it out here.
Let’s get to it!
To start off, let’s install all of the necessary packages:
Now that our environment is set up, we’ll need to prepare our annotated document dataset so that we can use it during training.
We’ll use the Butler document annotation interface for this purpose. If you haven’t already, head over to the Butler app and create your account.
We won’t cover the full details of how to use the Butler UI to annotate documents in this guide (feel free to check out the documentation if you want to learn more), but we’ll cover most of the key points.
Once you have created your first model and given it a helpful name, you’ll want to upload your unannotated documents. Typically when training a LayoutLM model, you’ll want a training dataset that includes a few hundred annotated documents in it.
For this guide, we’ll get started with just 50 example driver’s licenses. We can always annotate more after we see how the model initially performs.
Once the initial OCR has finished on your documents, you should see the documents loaded and ready for annotation.
The next step is to define what information you’d like to extract from your documents. We’ll be extracting the following fields from our driver’s licenses:
Click on the Add button to add each individual field. Once finished, your extraction schema is defined, and you are ready to start annotating:
To annotate, make sure the right field is selected and simply drag a box around the text in the document. You can also click on individual pieces of text if you’d like.
Once your first document is annotated, you can deselect the active field, and you’ll be able to see all of the values you annotated on the document.
Great work! Now we’ll go through and annotate the remaining documents.
A few quick hot keys you might find helpful:
Now that we have our annotated dataset, we need to download it from Butler, so that we can prepare it for use with LayoutLM in the transformers library. This is very easy to do with the DocAI Python SDK:
First, we download all the annotations from the model we created in Butler. You can follow this guide to find your API Key and this guide to find your model id.
The load_annotations function returns an Annotations object which enables you to convert your annotations into different formats.
We’ll then need to convert into a format more readily usable by the transformers library and LayoutLMv3 specifically.
There is a lot actually occurring in these two simple lines of code.
First, we convert our annotations into a common NER format. Let's take a look:
Here is a brief description of each field:
Note: The load_annotations function does not currently support multi-page documents. Only the first page of a multi-page document is included.
After loading in the NER format, we normalize the bounding boxes so that they are between 0 and 1000, which is the format that LayoutLMv3 expects.
Finally, we load our annotations into a Hugging Face Dataset object:
We are now ready to train our model!
Let’s create a few helpful variables, as well as prepare our dataset for training:
Now our dataset is officially ready for training! From here on out, we’ll be following the guides put out by the Hugging Face team for how to fine-tune LayoutLM (here it is for reference).
First load the layoutlmv3-base processor from the Hugging Face hub
Then prepare the train and eval datasets:
Now we are ready to define the evaluation function. We’ll use a utility fn from the DocAI SDK to do so:
Once that is done, we can define the model and training parameters:
There are a lot of different hyperparameters that can be tuned. Check out the Hugging Face documentation here to read through all of them.
Finally, the best part:
If everything was done correctly, your model will begin training.
Depending on the machine, this may take a little while, so sit back and relax. You can also explore some of the pre-trained document extraction models we have ready for use at Butler if you’d like :)
Once training finishes, you should see the metrics about how it performed on the eval dataset:
You can see our model achieved a near 100% accuracy which is quite good. Given our dataset was a relatively simple set of example US Driver’s License images, we’d expect to see this high accuracy.
Even for more complicated use cases, LayoutLM can still reach impressively high accuracy!
If you want, you can publish your model to the Hugging Face hub for future use:
Now let’s try using our model to make predictions on a document. First we’ll generate the predictions:
Before visualizing the predicted results, let’s define a few utility functions to make this a bit easier:
Then let’s visualize the predicted results:
And compare against the actual ground truth values:
When comparing, we can see how accurate our LayoutLM model is:
Great work! As you can see, LayoutLM is a powerful multimodal model that you can apply for many different Document AI tasks.
In this tutorial you:
You can find the full code available at our Train LayoutLM on a Custom Dataset notebook.