Leveraging LLMS for Automatic Forum Scraper Generation




Abstract:
Web forums contain valuable user-generated content (UGC), but crawling them presents a challenging task due to the differences between forum technologies and structures. This paper proposes a general approach that uses Large Language Models (LLMs) to automatically detect the forum technology (e.g., phpBB, vBulletin, SMF, Discuz!) and generate a web scraper for that forum’s layout, structure, and pagination. LLM first identifies the platform of a given forum by analysing its HTML patterns after it generates code to efficiently collect available posts and threads that are publicly available and don’t require user registration. Several state-of-the-art LLMs are evaluated (GPT-4, Claude 2, and Mistral 7B) for this task, comparing their speed, accuracy, and reliability in generating functional scraping code. A proof-of-concept functionality was demonstrated on a chosen phpBB forum technology by crawling its content with LLM-generated Python code. Experimental results show that the LLM-generated scrapers can successfully retrieve forum posts with high accuracy, matching manually coded crawlers while adapting automatically to different forum structures. The findings suggest that LLMs can significantly improve forum data collection, avoiding manual per-site adjustments and reducing duplicate content in incremental crawls.

CITATION:

IEEE format

M. Pavković, J. Protić, P. Kresoja, “Leveraging LLMS for Automatic Forum Scraper Generation,” in Sinteza 2025 - International Scientific Conference on Information Technology, Computer Science, and Data Science, Belgrade, Singidunum University, Serbia, 2025, pp. 78-85. doi:10.15308/Sinteza-2025-78-85

APA format

Pavković, M., Protić, J., Kresoja, P. (2025). Leveraging LLMS for Automatic Forum Scraper Generation. Paper presented at Sinteza 2025 - International Scientific Conference on Information Technology, Computer Science, and Data Science. doi:10.15308/Sinteza-2025-78-85

BibTeX format
Download

RefWorks Tagged format
Download