A Tesseract-Based OCR System for Bilingual Chinese-Serbian Grocery Order Processing

Ercegovac, Marko; Zheng, Yao; Njeguš, Angelina

doi:10.15308/Sinteza-2026-220-225

Početna » Sinteza 2026 - International Scientific Conference on Information Technology, Computer Science, and Data Science » Data Science and Applications

A Tesseract-Based OCR System for Bilingual Chinese-Serbian Grocery Order Processing

DOI: https://doi.org/10.15308/Sinteza-2026-220-225

Authors:
Marko Ercegovac
Yao Zheng
Angelina Njeguš

Download full paper

Keywords:
Tesseract OCR, LSTM, Bilingual Translation, Telegram Chatbot, Logistics Automation

Abstract:
This paper presents a new method for the ordering process in Chinese grocery stores using Tesseract OCR for bilingual Chinese-Serbian translation. The system is implemented as a Telegram chatbot, allowing users to capture images of orders, which are then automatically processed into CSV delivery lists. To ensure high performance on low-power devices, we used the Tesseract LSTM engine. To improve accuracy, all images are pre-processed before being passed to the OCR engine, and regular expressions (regex) are used to extract and normalize units and quantities from the recognized text. A domain-specific dictionary containing approximately 3,000 translations serves as a robust error-handling mechanism to mitigate OCR misclassifications. This hybrid approach, combining neural recognition with expert linguistic rules, ensures high reliability even when processing low-quality images from mobile chat applications. The final output provides structured data, including company names, item lists, and units, in order to reduce manual entering CSV data for drivers and logistics.

CITATION:

IEEE format

M. Ercegovac, Y. Zheng, A. Njeguš, “A Tesseract-Based OCR System for Bilingual Chinese-Serbian Grocery Order Processing,” in Sinteza 2026 - International Scientific Conference on Information Technology, Computer Science, and Data Science, Belgrade, Singidunum University, Serbia, 2026, pp. 220-225. doi:10.15308/Sinteza-2026-220-225

APA format

Ercegovac, M., Zheng, Y., Njeguš, A. (2026). A Tesseract-Based OCR System for Bilingual Chinese-Serbian Grocery Order Processing. Paper presented at Sinteza 2026 - International Scientific Conference on Information Technology, Computer Science, and Data Science. doi:10.15308/Sinteza-2026-220-225

BibTeX format
Download

RefWorks Tagged format
Download