Table Extraction using Deep Learning

Published in

Analytics Vidhya

15 min readJun 20, 2021

Building a deep learning model with TensorFlow to extract tabular data from an image.

A table is a useful structural representation that organizes data into rows and columns and aims to capture the relationships between different elements and attributes in the data.

Tables have become a part of our daily lives: from basic transactions to complicated analysis findings, they have become a part of every paper and document where there is even the tiniest need to reflect information in a concise manner.

The technological advancements in this age of the internet and inconvenience of maintaining documents as hard copies has necessitated digitization of documents, events, transactions or any real-world record including images and other multimedia information. As a result, there has been a massive influx of unstructured data and keeping track of the tabular information with their structural semantics, in this massive unstructured pool, has become more costly and time-consuming in terms of human resources; the only way to deal with the problem is to tackle it with the help of a computer.

But using a computer has a major drawback. The main issue is that there is a lot of digitized unstructured data in the world today, and in this unstructured realm, these tables don’t have their original essence to capture relationships; they can only represent it visually. In terms of human comprehension, the difference between these visual representations and structural representations of tables is negligible. A table in its unstructured form, on the other hand, is nothing more than a patch of an image to a computer.

Addressing the problem, our main goal will be comprehending a table by isolating it from an unstructured format and restoring its original form using computer vision.

Road Map

This section outlines the structure and provides a very high-level view of the main stages that will be covered in this blog.

Pre-requisites
Problem understanding
Data collection
Analysis and pre-processing
Related works
Development
Conclusions
Discussions and future works

So, lets get started…

1. Pre-requisites

Here is a list of pre-requisites that are needed (or rather, helpful) in building similar deep learning based projects.

Virtual environments, python, pip, machine learning and deep learning concepts.
TensorFlow — Visit official documentation to get started.
Streamlit — One can learn about streamlit from this video tutorial by JCharis on Youtube

Don’t worry if the pre-requisites are not present in your toolbox, please go through the blog. One can always come back and learn what is required!

2. Problem understanding

2.1. Overview

The objective of study is to develop a system that would take an image as input and uses computer vision to extract the information from tables present in the input image (if any). The task can be thought of having four major steps: i) detecting the presence of a table in an image ii) localizing the table in the image iii) decoding the structural relational among table cells and iv) understanding the text inside each cell.

This task may be naïve to a five year old, but if we take a deeper leap in our thought abyss, we will realize that detecting a table, localizing it on a sheet of paper doesn’t follow any hard and fast rules. It gets automatically done by the visual cortex (abstracted from our conscious understanding) as part of our cognition.

2.2. Use of deep learning

For over a decade, computer vision has recognized the potential of deep learning. When compared to previous techniques, deep learning has demonstrated promising outcomes in problems that need some amount of cognition and cannot be addressed using rule-based approaches.

In this study we will make use of Convolutional Neural Nets (a deep learning model based on parameter sharing) to address the problem of detecting and localizing a table in a given image and use predefined OCR algorithm to extract the text out of the detected table.

Table detection and localization can be framed as an image segmentation problem in which the system must separate the table region from the rest of the picture by anticipating a masked image of the table.

2.3. Problem constraints

Some of the main constraints that has to be managed for this problem:

i) low latency — Since the solution is intended to be applied to a large chunk of document images the model should not take very long time to extract.

ii) data variation —It is one of the key constraints in this domain. In this era of internet and IoT, the model should accept streaming data from a variety of sources and should handle various types of transformed images (sheared, rotated, noisy etc.).

iii) moderate space requirements — the designed model should have low memory needs. For the devices to access the model, it should be made available either directly through an application (low memory constraint) or through an API which gives us the liberty to memory constraint to a slight extent.

3. Data collection

Deep learning is a data-driven artificial neural net technique that requires a large amount of quality training data to be effective. Table Bank, ICDAR, and Marmot are three significant research efforts that gather data to study and interpret tables in pictures, PDFs, and other portable formats and are accelerating research and study in table comprehension by open-sourcing their datasets.

For this study, we will use Marmot Dataset for developing our model. The dataset contains the annotations for bounding boxes of tables in a given image of the dataset in xml format.

Click here to download the dataset.
Note we will be using another version of the same dataset annotated by the TableNet team. This dataset consists of column level bounding box annotations of the original Marmot data. For the sake of convenience, we will be referring to the original dataset as marmot_v1 and the later as marmot_extended throughout this blog.

4. Analysis and Pre-processing

We have extracted the dataset (marmot_v1) after downloading it from the given URL. Now we’ll take a look at the extracted directory to see how the dataset is organized.

4.1. Experimental observations

The main folder contains a folder data that actually holds the data along with some files describing the dataset.

Marmot dataset is composed of Chinese and English pages, which arekept in separate folders.
Both of these folders (Chinese & English) have same the directory structure, both consisting of folders named positive and negative.
For the sake of simplicity, we will consider only the English folder. There are two folders namely Labeled and Raw, in both positive and negative directories.

Number of files in Positive/raw directory and some of its contents

Number of image data points and number of xml annotations

Now, let’s look at a sample data point from the set

Given below is its corresponding annotation file

The bounding box of the table and the table itself is defined by the (highlighted) Composites tag. But one thing to notice is that the BBox attribute is assigned with a long hexadecimal sequence, so we have to convert this sequence to meaningful notations somehow. We will handle this in the next part (4.2) of this section.

Now lets take a look at the marmot_extended dataset

These annotations contains the bounding box for each columns of the tables in an image
This dataset only contains the images from the English/Positive subfolder of the original dataset.
The number of annotation files is 495 whereas there are 509 bitmap images. Hence 14 xml files of corresponding images are not present.

We have seen a typical image data from the dataset (marmot_v1) before, let’s check only the layout of annotation file from marmot_extended

XML annotation for marmot_extended annotated by TableNet team

These new annotations only consist of bounding boxes for each column present in the image. Each object tag in the xml file is denoting a column and the bndbox tag under each object holds the rectangular coordinates of the bounding box of that particular column
We have already seen that the annotations in marmot_v1 contains only table annotations
We will use these two types of annotations (table from marmot_v1 and column from marmot_extended) and create masks image from these annotations w.r.t. each image present in the dataset.
Since the column level annotations are missing for 14 data points, we will also ignore the 14 table level annotations.

Before proceeding to the pre-processing stage, we will examine the width and height distributions of each image in our dataset.

Distribution of image widths and heights

From the above plot we can see that all images in the dataset are of width and height i.e., (816, 1056)

4.2. Pre-processing

For this phase, we must analyze the provided annotations and generate mask images (for both table and column) from them. The pre-processing operations that were followed to accomplish the task are listed below briefly:

Convert hexadecimal notation and return the corresponding floating point value for marmot_v1. For this we have defined a function that can be invoked while reading the annotations on-the-go
Read bounding box data for table from marmot_v1. This is accomplished by parsing each annotated xml file and locating all Composites tags with the label attribute set to ‘Table.’ The python function defined to perform this task is provided below:

We’ve done the same thing with marmot_extended annotations to get column level annotations.
Combining the above two tasks to obtain table and column level annotations for all the images (using the function given below).

Let’s visualize one such image along with its table and column mask.

Example image from the dataset along its processed masks

5. Related works

Now let us take a quick look at various solutions to this problem. This section is not part of the development process, but it provides a broader perspective by exploring similar implementations of ideas taken in this direction; some of these are addressed briefly below:

5.1. The Benefits of Close-Domain Fine-Tuning for Table Detection in Document Images

[Angela Casado-Garcia et al. 2019]

Their research was mainly focused on how the performance of transfer learning in computer vision models improves if the learned parameters are shared among more similar tasks than dissimilar tasks. The task performed in the research was divided in two stages:

i) Experiments were performed with SOTA object detection models such as: yolo, mask-rnn, retina-net, to learn object detection on Pascal POV dataset (a natural images dataset) and then the trained parameters were fine-tuned using ICDAR, Marmot data(table detection and recognition data).

ii) Another set of experiments were carried out with the same object detection models, but this time they were first trained using TableNet dataset (Table detection and recognition dataset) and then fine tuning of the parameters were performed with the help of ICDAR and Marmot datasets.

For both sets of experiments transfer learning was applied, first from a distant domain (natural images) and later from a close domain(table detection). The results of their experiments showed considerable improvement when close domain transfer learning was applied for table detection in document images.

5.2. GFTE: Graph Based Financial Table Extraction

[Yiren Li et al. 2020]

The main contributions of the work undertaken by this team can briefly be summarized as:

i) Compiling a chinese financial dataset, FinTab, from several types of financing documents. The dataset consists of more than 1600 images of different types of tables along with their structural annotations.

ii) Introducing a graph based convolutional neural network model named GFTE. The working of the model is discussed below.

A table in this approach is represented as a graph where each cell is considered a node. These nodes are interconnected using relations: either vertical, horizontal or unrelated. Hence, their problem can be interpreted as: to predict the relations among a pair of nodes, given a set of nodes and their features .

They used three types of information along with each node as inputs: i) textual content ii)absolute location and iii) the image of the table. These absolute position features are converted to relative position features and the textual features are embedded and sent through a LSTM layer to acquire semantic information, then these obtained features are combined and fed to a two-layer Graph Convolutional Network. Later using the relative position features and output of GCN, node features are calculated. In the meantime, the image is preprocessed and using a three layer CNN, features are drawn from it. These node features and image features are then fed to MLP to model the vertical and horizontal relationship among cells.

5.3. TableNet: Deep Learning Model for End-to-End Table Detection and Tabular Data Extraction from Scanned Document Images

[Paliwal et al. 2020]

In this work they suggested an end-to-end deep learning model, TableNet for table recognition and provided additional annotations for Marmot data. This method incorporates transfer learning, since pretrained VGG-19 with ImageNet weights is used as encoder for the proposed model. The fully connected layers at the end of the VGG-19 networks are replaced by 1x1 convolutional layers with ReLU activation and dropout. The encoder is followed by two different convolutional decoder branches as the model is required to detect the table and recognize its structural semantics simultaneously. Additional layers have been used in these decoder branches to filter useful features and then each of these feature maps are up-sampled to produce images. A single input image produces two different labelled output images for tables and columns separately. Once the semantically labelled images are obtained OCR algorithm is applied to the word positions and rows are extracted as a collection of words using rule based row extraction technique.

6. Development

We are now at the point where we will create and train deep learning models using the data that we collected and processed in earlier phases. Let’s get started with our first cut strategy.

6.1. Implementing TableNet in TensorFlow

To tackle this problem, we will use transfer learning and design architecture inspired from the work of the TableNet team.

TableNet architecture as proposed by Paliwal et al. 2020. Feature maps from block_3, block_4 and block_5 pooling layers of VGG19 encoder is used for context aggregation during the decoding phase.

Model Specifications

Unlike the original TableNet architecture(shown in the above figure), that uses pretrained VGG19 as the base encoder, our model can be made to work with various types pre-trained vision models such as VGG, ResNet, DenseNet, EfficientNet, Xception as base encoders.
The model employs a option that allows the encoder weights to remain constant throughout training. In other words, the option can be used to allow or restrict fine-tuning of the model.
The model also has a regularization parameter that will employ a tf.keras.regularizers to the convolutional layers at the end of the encoder network and convolutinal layers of the column decoder
Since the architecture of the pre-trained models mentioned above are different from one another. The layers from which feature maps drawn, for context aggregation during decoding, differ from one base encoder to another
The architecture of the table and column decoder is kept unchanged.

Note: To check which layers are used for context aggregation (from other encoder architecture) refer to the code given below.

Model Definition

The TensorFlow python code required to define the model is provided below.

6.2. Experimenting with various encoders and settings

In the previous section, we stated that our model is compatible with many types of pre-trained encoders and contains parameters that may enforce regularization and fine tuning during training. In this part, we will conduct experiments with various combinations of encoders and settings, recording the train and test losses to assess the performance of each model.

Summary:

Discussions :

We experimented with the TableNet architecture by adjusting various parameters such as the pretrained encoder, dropout rate, and regularisation at the final layers.
In one of our tests, we also tried to freeze the layers of the initial few blocks of the DenseNet, but this did not assist us to avoid overfitting.
The TableNet model DenseNet encoder (with dropout of 0.6, no regularization, and no layer freezing) had the greatest performance with the validation set, as seen in the table above, and the model has shown less evidence of overfitting than the other models.

6.3. Compressing the obtained model using TensorFlow Lite

Now we’ll import the best model and try to minimize the space requirements using a technique called post training quantization. In this approach, the model is compressed by optimizing the data type representation for model weights. This means the data type of the weights is changed to a lower precision representation from its original one (say, from float 32 to float 16) by retaining as much information as it can.

However, we do not need to create an algorithm for this; TensorFlow Lite will handle all of the complications on our behalf with ease. Simply follow the instructions below to perform quantization using TF Lite.

Load the model

Model size before quantization

Define TF Lite converter object
Set the type of optimizations required
Convert the model

Save the TF Lite model

Model size after quantization

Great! The model is now approximately 75% compressed compared to the .h5 version. It is complete and ready to use for obtaining the segmented mask. We could now extract the needed text information by using an OCR technique such as tesseract to the segmented table region. In the next part, we will look at the whole process and define the final pipeline.

6.4. Defining the final pipeline

We will now utilize the compressed TF Lite model that we acquired in the previous part to construct our final pipeline, that is from accepting an image (that is present as a file on the disk) to obtain a csv with the extracted tabular data (that will be saved also as a file on the disk). The algorithm and python implementation of the final pipeline are provided below.

Algorithm for the final pipeline

The python code for the final pipeline

Inference using the final function

i) Let’s try with the following image as input

Sample image of a document containing table

ii) The segmented table predicted by the model is given below

iii) Extracted table as text

6.5. Web-based UI using Streamlit

In this part we will take our final pipeline and create a wrapper using python’s streamlit package so that model can be tested conveniently.

Note: The python code for streamlit app deployment is not shown in this section; however, if you need it, you may check my repository. Visit my blog Predicting Volcanic Eruptions from Seismic Behavior to learn more about how to leverage Streamlit to create web apps.

The developed system contains some limitations that will be discussed in the next section and it is due these limitations that the model can’t be deployed (to be used by other users). Therefore a working demo of the streamlit web-app (deployed locally), is provided in the following video:

Stream web-app demo

7. Conclusions

Thus, to summarize, let’s go through what has transpired so far in this study and explore some of the real-world problems the system may encounter (if and when deployed).

To begin, we obtained the images provided by marmot data and converted the two types of annotations (marmot_v1 for table and marmot_extended for column) to actual mask images that could be used as labels for supervised learning.
Then, we implemented the TableNet architecture using python’s TensorFlow framework.
We modified the original design to allow it to run with various pre-trained vision models and provided options to enable/disable finetuning and regularization.
After that, we picked the best model from our experiments and compressed it using the TF Lite converter.
Finally, we utilized streamlit to encapsulate our model in a UI that could be used interactively.

Limitations:

Though the model is able to extract the tabular data to good extent, there are some serious limitations/challenges faced by our implementation.

The extraction quality decreases when an image has multiple tables, since it tries to segment just one rectangular region containing all the tables.
Text detection is weak when using OCR on the segmented table region. It would be advantageous if it could learn cell level segmentation and apply OCR to each cell.
It is unable to distinguish between cells containing row/column headers and cells containing the actual data.

8. Future work

We have reached to the end of our blog, and the purpose of this section is to provide a direction for future improvements. Initially, we can try to strike out the limitations discussed in the previous section, for this:

Collecting and curating a large dataset with cell level annotations might be helpful.
We could then train our model to learn and predict cell-level segmentation given an image with a table in it.
The developed model accepts only one image at a time, we can develop an actual application and deploy it on the web that would take a set of images (say, a PDF file with multiple pages) and give all the tables present in the set of images (PDF).