Two-Step Scientific Literature Search by Title and Abstract Embeddings




Abstract:
This paper presents the design and implementation of a 2-step similarity search system for scientific literature using their title and abstract embeddings. One of the approaches to semantic similarity is comparing text embeddings – numerical representations of text, obtained with transformer models. Usually, full-body text embeddings are used for search. This paper aims to find a middle ground between the two methods by using paper abstracts. Using a small dataset of papers grouped by topic, the optimal value for filtering abstract embeddings by cosine similarity was found. Another point of study was to find how the language of abstracts influences the similarity metric. For Serbian English papers, the language impact wasn’t significant on the similarity search. A complete solution, including a web scraper, storage and a local embedder, was created, which showed positive results during the test run.

CITATION:

IEEE format

M. Batura, T. Bezdan, “Two-Step Scientific Literature Search by Title and Abstract Embeddings,” in Sinteza 2026 - International Scientific Conference on Information Technology, Computer Science, and Data Science, Belgrade, Singidunum University, Serbia, 2026, pp. 585-590. doi:10.15308/Sinteza-2026-585-590

APA format

Batura, M., Bezdan, T. (2026). Two-Step Scientific Literature Search by Title and Abstract Embeddings. Paper presented at Sinteza 2026 - International Scientific Conference on Information Technology, Computer Science, and Data Science. doi:10.15308/Sinteza-2026-585-590

BibTeX format
Download

RefWorks Tagged format
Download