How ePopSoft, maker of Korea’s most popular English instruction app, uses Furiosa’s WARBOY

Technical Updates March 13, 2024

Summary

ePopSoft, maker of South Korea’s most popular English instruction app, uses FuriosaAI’s WARBOY cards to power a new “smart dictionary” feature.
The feature analyzes text in photos taken on users’ smartphones and provides an exact translation that accounts for context and idioms.
The service uses a pipeline of two computer vision models (CRAFT for text detection and ResNet-CTC for recognition) running on WARBOY in the cloud.
WARBOY delivers peak performance of 64TOPS (INT8) and a thermal design of just 40-60W.
By using the new OCR system developed with Furiosa, SayVoca reported an 84% improvement in accuracy over its previous solution.

Share this article

One of the biggest challenges for people learning English is the language’s vast number of idioms and words with multiple meanings. For example the word “take” has different meanings in “take a nap” and “take a taxi,” and the correct translation will depend on that context. Often, the correct translation hinges on the meaning of an entire passage, such as the word “take” in this example: “Many people think naps are only for children. Or for senior citizens. But I take one every day."

ePopSoft, makers of the most widely used English instruction app in South Korea, created a new app called SayVoca Dictionary to help people learn these tricky words and phrases much more easily. The service is powered by Furiosa’s WARBOY accelerator card for computer vision applications.

With ePopSoft’s new app, people can simply point their phone camera at a passage of text and immediately learn the correct meaning of each word or phrase in that specific context. Rather than listing all the possible meanings of “take,” the app presents just the one that’s contextually relevant. SayVoca believes this is the key to a more enjoyable and effective language-learning tool.

But to deliver a good experience, the SayVoca Dictionary app needs to reliably recognize and understand long passages of text in smartphone photos taken in a wide range of tricky real world situations. If the app doesn’t recognize when a text passage is broken up into multiple columns or divided across two pages in a photo of an open book, it will produce an error-ridden translation. These mistakes will lead to a poor user experience that might drive people to abandon the app altogether – and have a lower opinion of ePopSoft in general.

With WARBOY at the heart of a complex data pipeline, ePopSoft’s SayVoca Dictionary language learning app is available now. The app currently has 4.3 stars in the Apple App Store.

Challenges beyond everyday OCR

Optical Character Recognition (OCR) and translation aren’t new technologies, of course. But ePopSoft needed to solve several key challenges in order to create a “smart dictionary” app that understands the precise meaning of each word in context:

Effective text translation needs to disambiguate words by looking at the meaning not just of the surrounding sentence but often the entire paragraph or page of text.
In photos taken in real world situations, text often isn’t presented in a single, clearly defined block; it might be broken up into columns, captions, diagrams, and more.
Text in photos is often skewed or distorted – for example when the pages of an open book don’t lie perfectly flat.
Artifacts (like a smudge on a page) can confuse OCR models.
In high-resolution smartphone photos, text might take up just a small portion of a particular image. The app needs to work well with these kinds of large, detailed photos while also minimizing latency.

To meet these needs, ePopSoft is using four of the 121 WARBOYs deployed in the Kakao Cloud as part of a multistage pipeline that also incorporates CPUs.

When a user takes a photo in the SayVoca Dictionary app, an API call request is sent to the Furiosa AI WARBOY server installed in Kakao Cloud, as shown in the graphic below.

This end-to-end OCR service incorporates several complex steps, including the models running on WARBOY, OpenVINO running using CPU, ONNXRuntime using CPU, and even C++ code (i.e., textline detection algorithm).

The first step in the pipeline is text detection using the CRAFT model running on WARBOY. This model predicts multiple heatmaps, indicating the probability of the existence of characters and spaces between characters, as well as how skewed characters are in the image.

Then, the textline detection algorithm implemented in C++ predicts the text lines and the bounding boxes for characters with structural information

Text line recognition — An example of textline and bounding box detection.

Once this is done, the text recognition model running on WARBOY predicts the specific characters in the bounding boxes generated by the textline detector. Recognizing that some words and spelling may be incorrect due to conditions like distorted text or poor lighting, we run a spell correction model DistilRoBERTa running on CPU via OpenVINO.

Managing a complex pipeline using different runtimes for serving

To power the SayVoca Dictionary app, we needed a way to expose the model pipelines as an online inference API. Even though there are many options for serving frameworks, it is still challenging to combine multiple models using different runtimes and accelerators into a single complex pipeline. furiosa-serving provides great programmability to allow users to build a complicated pipeline easily in Python. Here, we briefly introduce how we organize the complicated pipeline through furiosa-serving.

The following example shows how to serve a model using an ONNX model named “model.onnx” using the endpoint “/models/example.” As you can see, the code is straightforward.

        from furiosa.common.thread import synchronous
from furiosa.serving import ServeAPI, FuriosaRTServeModel

# Main serve API
serve = ServeAPI(repository.repository)

# This is FastAPI instance
app: FastAPI = serve.app

# Define a serving app for a single model
class ExampleApplication(FuriosaRTServeModel):
    def __init__(self, model: Awaitable[FuriosaRTServeModel]):
      self._model = model

    async def inference(inputs) -> ExampleResponse:
      return await self._model.predict(inputs)

# Initialize the serving app with specifying runtime
example_app = ExampleApplication(
    model=serve.model("furiosart")(
        "model name",
        version="v1.0", # model version
        location="path/to/correction/model.onnx", # model path
        compiler_config={}, # configurations
    ),
)

# Define a endpoint and implement how to call the serving app
@app.post("/models/example", response_model=ExampleResponse)
async def example(image: UploadFile = File()) -> AnnotationResponse:
    """Correct wrong texts in annotation."""
    result = await example_app.inference(image)
    return ExampleResponse(result)

A snippet of example code for furiosa-serving.

In this code snippet, ExampleApplication is a custom class that allows users to define an inference pipeline. There is just one simple step to run a single inference using the given model. You can implement your own pipeline with multiple models as following:

        class AnnotationApplication:
    def __init__(
        self,
        detection: Awaitable[FuriosaRTServeModel],
        recognition: Awaitable[FuriosaRTServeModel],
    ):
        self._detection = detection
        self._recognition = recognition

    async def inference(inputs) -> AnnotationResponse:
      detected = await self.detection.predict(inputs)      
      return self.recognition(detected)

Also, to use OpenVINO on CPU, you could initialize an application instance as follows:

        class CorrectionApplication:
    def __init__(self, model: ServeModel):
        self.model = model
        self.tokenizer = RobertaTokenizerFast.from_pretrained(
            f"path/to/pretrained/roberta-base/tokenizer",
            local_files_only=True,
        )

   async def inference(text: CorrectionRequest) -> CorrectionResponse:
       ...



correction = CorrectionApplication(
    model=serve.model("openvino")(
        "correction model name",
        version="v1.0",
        location="path/to/correction/model",
        compiler_config={
            "CPU_THROUGHPUT_STREAMS": "2",
            "CPU_BIND_THREAD": "NUMA",
        },
    ),
)

Handling large images in order to improve text-detection accuracy

A separate challenge is the input size of images for the model. Modern smartphone cameras produce large images (e.g., 4032x2268 pixels) while computer vision models typically use much smaller input images. (224x224 pixels for ResNet and 416x416 for YOLOv5, for example.)

The SayVoca Dictionary pipeline uses a resize step, as common in smartphone computer vision tasks, but we still needed to make sure the image was large enough to maintain accuracy. A large input image may not fit on a single chip’s SRAM, which could degrade performance as the data spills over into DRAM.

To handle large images efficiently, the SayVoca computer vision pipeline splits large images into a number of smaller ones, and then processes them as a single batch in order to maximize the throughput. Splitting the image into smaller tiles is especially useful for text detection, the first step in the SayVoca pipeline.

Profiling execution times of low-level tasks and I/O through Furiosa Profiler

When developing this translation feature for the SayVoca Dictionary app, it was important to identify any bottlenecks that impeded performance or increased latency. To do this, we used the Furiosa Profiler as well as open-telemetry-based tools.

Furiosa Profiler allows users to measure execution times of low level operations running on CPU and WARBOY as well as I/O between host and devices. A recorded trace is generated in Perfetto proto trace format, and users can view the results in a web browser or the Perfetto UI. The following is a visualized example of a recorded trace. To optimize the latency, we tried to hide I/O times and increase WARBOY’s computation utilization.

Sayvoca9 — Trace visualization of the Furiosa Profiler.

Furiosa-serving also allows users to trace certain spans and export the trace results in the open-telemetry formats. We used an open-telemetry tool to measure elapsed times of different models and their pre/processing steps and then used Prometheus to visualize these trace results as shown in the figure below.

WARBOY offers a significantly more power efficient solution, which is an important consideration for enterprise customers looking to show their end users that they are environmentally responsible. WARBOY delivers a peak performance of 64TOPS (INT8), an aggregate memory bandwidth of 64 GB/s, and a thermal design of just 40-60W (configurable). The card is connected through an 8-channel PCIe Gen4 interface providing a high performance (16GB/s) interface.

Performance

Compared to NVIDIA’s T4 GPU, one of the most widely used hardware options for AI inference, WARBOY is more than 20% faster.

Barchart — On the EfficientNetV2-S computer vision model, WARBOY delivers nearly 50% more inferences per watt compared to NVIDIA’s A2, more than double the performance/watt of the T4, and more than 4x more performance than A100.

Overall, SayVoca reported that the new text recognition pipeline delivered an 84% improvement in accuracy over its previous solution.

Conclusion

Compared to both GPU or CPU, WARBOY provides strong inference performance at a low price, reducing the operating costs for ePopSoft and allowing them to reinvest in technology development and infrastructure to provide better services to users. WARBOY can be serviced reliably through a stable infrastructure operated through Kubernetes Infrastructure as Code (KiC).

WARBOY is available now for computer vision applications including object detection, segmentation, classification, and video superresolution upscaling. Contact us to learn more and arrange a demo. Furiosa’s second-generation chip for generative AI and large language models will launch this year.

Share this article

How ePopSoft, maker of Korea’s most popular English instruction app, uses Furiosa’s WARBOY

Challenges beyond everyday OCR

Managing a complex pipeline using different runtimes for serving

Handling large images in order to improve text-detection accuracy

Profiling execution times of low-level tasks and I/O through Furiosa Profiler

Performance

Conclusion

Other posts

World's first NPU Hackathon for Vision Applications with WARBOY

Q&A: ASUS on AI server trends, FuriosaAI partnership and more

A new global survey of businesses’ AI Infra Plans, conducted by FuriosaAI, ClearML and AIIA

Get the latest news