It has been a while since our previous blog post. The exam period has come and gone, but this doesn’t mean we’ve been completely idle in the meantime. We are currently still working on tracking text, and thus the release date of the third and final chapter of our trilogy is yet to be determined. For now, we present errata on the previously published first and second volumes.
During our own tests while working on the implementation, we noticed that the time delay experienced by a user was much too long. Before a frame is sent to the server, it requires some processing on the phone (such as turning it into a correctly encoded image and Base64 encoding the image). This took roughly 5 seconds. It then took roughly 4.5 seconds to get a response from the server, with the bounding box information. In total, this is a 9.5 second delay between the start of “let’s analyse this frame” and the first feedback a user would receive. Requesting the translations took another 3.5 seconds. We had to take a look and determine if we could in any way improve these times. Changes were made to both the server code that deals with the received requests, as well as the algorithm used to locate the text in the images.
Implementation: The Trilogy – I, The Roots (the server code)
In the first part we mentioned that we used the Base64 encoded string of the byte array representing the image, to send the image data to the server. Sten, our thesis mentor, said that Base64 is rather cumbersome, however. He was indeed right. Although the server itself did not have trouble decoding the Base64 string (takes about 7 ms), encoding the data on the mobile phone was a time critical step (taking up to around 2 seconds). To improve this, we made Tomcat automatically parse multipart/form-data POST requests (without storing the received files as actual files, as to counter IO delay). This removes the need to Base64 encode the data. We can simply use the in memory byte array and include it in a multipart/form-data POST request on the phone. This change lowered the time for the bounding boxes request (incl. the processing change on the phone) to 7 seconds, a speed gain of 2.5 seconds!
A second minor improvement we made was in regards to the translation request. If the OCR results of a certain region are empty, there is no need to request a translation either. While we so far have not noticed a speed improvement because of this; likely due to the fact that we only rarely end up with no OCR results at all; we do believe it is good to include once we use a paying service.
Implementation: The Trilogy – II, The Core (core algorithms)
The C++ code we based our implementation on uses graphs to find the connected components (used to determine which parts are text). We saw that with the graph library we used, this was very time consuming (and memory intensive). When looking for faster graph implementations that could be used to find connected components, we came across Fast connected component labeling algorithm using a divide and conquer technique, J. Park et al., 2000. This paper describes a divide and conquer technique to find connected components, without using graphs. We implemented this algorithm with success. The time we spend waiting for the bounding box information is now roughly 5 seconds, which is 2 seconds better than before (after the POST request code optimisation)! We didn’t pay attention to the exact numbers concerning memory improvement on our server, but we did see a huge improvement.
The total improvement we managed to obtain was thus 4.5 seconds!