One of the features of our product is a batch upload functionality, which allows users to batch upload multiple files of various types and contents. Depending on the file type and the content of the file, different pipelines may be triggered of each of the file/content types. One such pipeline is Optical Character Recognition (OCR) for table and entity extraction given a pdf or image file.
One issue however is that we don’t always know the document language in advance and also files may not be annotated with language information, for example as part of the file name.
Knowing the language in advance and parametrising the OCR engine accordingly is important, as OCR engines can optimise the performance using the appropriate language models, given the input language.
A question posed then was, given a fixed set of document languages and instances of images with text in these language, can we build a system that recognises the language in such a text image?
To answer the question, let’s first create some data. A good source for multilingual documents is Project Gutenberg [https://www.gutenberg.org/]. Project Gutenberg is a large collection of free books of various languages.
For our purposes we are initially focused on 3 languages: German, English, Italian.
For each of these languages we chose a book, downloaded the corresponding epub version of the book and converted this epub file to pdf. Then we converted the pdfs into series of images, one image per page.
One observation here is that since we are training on books, we don’t want the network to capture artifacts from the books and associate for example footers/headers book titles with language. To make things a bit more robust, instead of using the full page image as source image, we crop images of 150×150 pixels from various positions of the source image (per page).
For each page in the data set, we are cropping four 150×150 images from random positions in the page. In our example, this produces a data set of 5608 images of size 150×150 pixels each.
So training examples in three languages, Italian, English and German would look like this
In the preprocessing step, we are scaling pixel values to be between [0,1] by dividing by 255 and then calculating the “mean image” and subtracting it from all the the images in the training set.
|# features is a 5608x150x150 numpy array
scaled_features = features/255 mean_image = (scaled_features.sum(axis=0)/scaled_features.shape)train_dataset = (scaled_features – mean_image)
The number of the training images per language are the following:
Finally, we split the data set in 80% training images and 20% validation images
|train_x, test_x, train_y, test_y = train_test_split( train_dataset, labels, test_size=.2, random_state = 2019)|
Convolutional Networks and Machine Vision
The underlying hypothesis is that a machine learning model will be able to capture the visual structure of the characters used in a particular language text.
A straightforward way to do so is to use a Convolutional Neural Network (ConvNet, CNN). CNNs are considered state of the art technology in Machine Vision tasks.
A Convolutional Neural Network is a Deep Learning algorithm which can take in an input image, assign importance (learnable weights and biases) to various aspects/objects in the image and be able to differentiate one from the other .
The network will then be trained using the back-propagation algorithm to predict the label (language) given an input image.
More information about CNNs and backpropagation can be found in , ,
So, our training data are essentially tuples of the form as defined above and we train and evaluate the network on train and test data sets as defined above.
The network architecture is the following (implementation is in Keras, keras.io )
|model = Sequential()
model.add(Conv2D(16, kernel_size = (3, 3),input_shape=(IMG_SIZE_X, IMG_SIZE_Y, 1)))model.add( ReLU() )
model.add(Conv2D(32, kernel_size=(3,3)))model.add( ReLU() )model.add(MaxPooling2D(pool_size=(2,2)))model.add(BatchNormalization())
model.add(Conv2D(64, kernel_size=(3,3)))model.add( ReLU() )model.add(MaxPooling2D(pool_size=(2,2)))model.add(BatchNormalization())
model.add(Conv2D(96, kernel_size=(3,3)))model.add( ReLU() )model.add(MaxPooling2D(pool_size=(2,2)))model.add(BatchNormalization())
model.add(Conv2D(32, kernel_size=(3,3) ))model.add( ReLU() )model.add(MaxPooling2D(pool_size=(2,2)))model.add(BatchNormalization())
model.add(Dropout(0.7))model.add(Flatten())model.add(Dense(128))model.add( ReLU() )model.add(Dropout(0.7))
Validation and results
We trained the CNN model defined above as follows:
|optimizer = keras.optimizers.Adam(lr=0.0003)model.compile(loss=’kullback_leibler_divergence’, optimizer=optimizer, metrics = [“accuracy”])|
|callbacks = [EarlyStopping(monitor=’val_loss’, patience=4)]model.fit(train_x, train_y,epochs=100, validation_data=(test_x,test_y), callbacks=callbacks, batch_size = 64, shuffle=True)|
This toy example was trained with early stopping for 34/100 epochs and achieved 97% accuracy on the test set, providing a strong indication that this approach can be used in production systems to infer document languages.
On inference time, we can also use multiple 150×150 pixel snapshots taken per page and make a prediction by jointly considering the output probabilities of the classifier on all images.