Digitizing an invoice using OCR

In this example, you create a Digitisation project to label invoices using OCR data.

Project Overview

You want to create a project to digitize invoices using OCR

To create this project, you must perform the following tasks:

  1. List out your project requirements.

  2. Identify sample invoice images that you can use for labelling.

  3. Configure project metadata.

  4. Manage your project input and output fields.

  5. Create the workflow that you want to implement in your project.

  6. Advanced settings.

  7. Add users to your project and assign them project roles.

  8. Add a dataset to the project.

  9. Start labelling

This document explains how you can perform each of the tasks listed above.

Listing Project Requirements

In this project, you want to:

  • Identify and label invoice data from the invoice images.

Sample Input Data

For the purpose of this example, we shall use images uploaded via media option. The images can be uploaded from cloud as well.

Configuring Project Metadata

Project Metadata is the first tab that appears when you create a project. The Project Metadata tab enables you to provide basic information, such as the name, process, and project type, associated with your project. You can also upload any documentation that you may want to add to your project.

  1. To create the project, click the Create Project floating button on the left side of the Projects page.

The Project Metadata tab associated with your new project appears.

 

  1. Enter InvoiceLabellingProject as the Project Name and Labelling as the Process.

  2. Click Project Type > Digitization.

  3. Enable Project Pipeline is False by default, do not change this for the current project.

  4. Click Next.
    The Documents sub-tab appears.

  5. You can upload documents associated with the project if required. This is an optional step, and you can skip it for now.
    Click Next.

The Task Design tab appears. Use this tab to manage your project input and output fields.

Managing Project Input and Output Fields

Project input and output fields are key elements that determine what happens in your project. The input fields that you specify here will appear as available options for input in your project. Similarly, the output fields that you configure here will appear as output options in your project execution UI. In other words, your project can only uptake and output data associated with the input and output fields that you create here.

Taskmonk uses the project type that you specify to add input and/or output fields to projects as required. You can modify these later. In this instance, you selected Digitization as the Project Type, and Taskmonk adds the following fields to the Task Design tab:

  • Input Field

    • Field Name: MediaUrl, Field Type: Image

  • Output Field

    • Field Name: Annotations, Field Type: Annotation, Mandatory: False, Disabled: False, Customer Visible: True

Updating Input Field Details

  1. Click the Input Field tab to display the Input Field UI.

  2. By default, Taskmonk sets the field type to Image.

  3. Add one more input field for our project using the CREATE INPUT FIELD option.

  • Input Fields

    • Field Name: annotation, Field Type: Text

 

Updating Output Field Details

  1. Click the Output Field tab to display the Output Field UI.

 

2. We need to add more output fields for the project. To do so click on CREATE OUTPUT FIELD, enter the Name, Output Field Name, Set Data Type, Select Format. Click on CREATE.

In this way, we have created 3 more output fields

Name: Seller Name, Output Field Name: Seller Name, Set Data Type: Text, Select Format: Any, All Levels: Checked.

Name: Seller State, Output Field Name: Seller State, Set Data Type: Text, Select Format: Any, All Levels: Checked.

Name: Seller Address, Output Field Name: Seller Address, Set Data Type: Text, Select Format: Any, All Levels: Checked.

3. Now let’s create Sub Fields. To do so click on SUB FIELDS under Possible Values for Annotations, the following modal appears.

Enter the Field Name: Serial No, Field Type: Text and click on ADD. For our project we will create the following fields.

  • Field Name: Serial No, Field Type: Text

  • Field Name: Descriptions of Goods, Field Type: Text

  • Field Name: HSN/SAC, Field Type: Text

  • Field Name: Quantity, Field Type: Text

  • Field Name: Rate, Field Type: Text

  • Field Name: Discount, Field Type: Dropdown, Possible Values 10%,20%,30%,40%,50%,60%,70%,80%,90%,100%

  • Field Name: Amount, Field Type: Text

Click on OK to save all the sub fields.

Click Next to move to the next step, Quality Workflows.

Creating Quality Workflows

The Quality Workflows tab enables you to specify how you want to ensure output quality. It also helps you create the execution levels required for your quality workflows. For example, if you want to have a QA analyst reviewing labels, you can create this role using this tab.

  1. In this instance, you want to enable a Maker-Editor workflow. Click the Execution Method field and select Maker-Editor from the drop-down list that appears.

  2. By default, Taskmonk creates the Analyst role for you. This is the role that performs the labelling. You only need to add the QA role. To do so, click the Add Execution Level button on the right side of the page. For this project we will just keep the default role. To know for about adding quality workflow refer to Configuring Project Quality Workflows

 

Advanced settings.

Digitisation projects can be labeled faster using OCR pre-processor which can be configured using below steps:

Navigate to Advanced settings > Project Task > Task Properties> Task Pre-processing. Click on ADD, select Optical Character Recognition , click on Next and choose ‘Optical Character Recognition’.

Configure the fields as below and click on ADD.

In case the input data is PDF instead of image, configure this additional pre-processor to extract image from PDF as OCR can be run only on images.

Extract images from PDF import for Digitization

Managing Users and Roles

You must now add users to your project and assign the execution levels you just created to them.

  1. Click the Users tab just above the Quality Workflow tab.
    The Users > Manage Users tab appears.

  2. Click the Add button in the top-right section of the tab.
    The Select Users tab appears.

  3. Corresponding to each execution level, click the Select Users field and select the desired user from the drop-down list that appears.

 

 

4. Click Add to add the selected users to the project

5. Close the modal. The Manage Users tab reloads to display the updated user details.

 

Managing Project Datasets

Your project is now configured. Congratulations!

Before you can start labelling, you must upload the PDF containing the invoice images.

If you want to import your images from cloud, you need to configure the cloud before creating a batch. To know more about cloud configuration follow the link Managing Project Datasets . We can import the images using Media as well. In this example we will use media import.

  1. Click the Datasets tab. The Datasets page appears. Use this page to manage datasets for your project.

  2. Taskmonk organizes datasets into batches to simplify management and tracking. To add a new dataset, click Add Batch on the right side of the page. The Add Batch modal appears.

  3. Enter Invoice1 as the name for the batch that you want to import in the Add New Batch field. You can ignore the other fields.

 

4. Click Submit. This creates a new batch of data for your project and adds it to the Pending tab of the Datasets page. You can now upload datasets into the batch, as required.

5. To add a dataset to the batch, click the Import button under the Tasks (Import/Export) column.
The Import Task modal appears.

6. Check the Media radio button

7. Choose the images that needs to be uploaded and click on IMPORT.

 

8. Once the dataset is imported, click Close to exit the modal.

Invoice Labeling Using Taskmonk

Your project is now ready for work.

  1. Log in as an analyst and click the My Tasks icon at the top of the page.
    The Tasks page appears.

 

2. Click the Get Tasks button adjacent to the project you wish to work on.
The labelling UI associated with this project appears. 

The output fields and the subfields(configured table) are displayed to the right of the task page. Click on ADD TABLE DATA to create a new row in the table, fill in the appropriate values using the OCR data.

 

© 2020 Taskmonk Technology Pvt. Ltd. All Rights Reserved .