Login (DCU Staff Only)
Login (DCU Staff Only)

DORAS | DCU Research Repository

Explore open access research and scholarly works from DCU

Advanced Search

Data-to-text generation for severely under-resourced languages with GPT-3.5: a bit of help needed from Google Translate

Lorandi, Michela orcid logoORCID: 0000-0002-6131-8763 and Belz, Anya orcid logoORCID: 0000-0002-0552-8096 (2023) Data-to-text generation for severely under-resourced languages with GPT-3.5: a bit of help needed from Google Translate. In: 16th International Natural Language Generation Conference - Workshop on Multimodal, Multilingual Natural Language Generation and Multilingual WebNLG Challenge, 11-12 Sep 2023, Prague, Czech Republic.

Abstract
LLMs like GPT are great at tasks involving English which dominates in their training data. In this paper, we look at how they cope with tasks involving languages that are severely under-represented in their training data, in the context of data-to-text generation for Irish, Maltese, Welsh and Breton. During the prompt-engineering phase we tested a range of prompt types and formats on GPT-3.5 and~4 with a small sample of example input/output pairs. We then fully evaluated the two most promising prompts in two scenarios: (i) direct generation into the under-resourced language, and (ii) generation into English followed by translation into the under-resourced language. We find that few-shot prompting works better for direct generation into under-resourced languages, but that the difference disappears when pivoting via English. The few-shot + translation system variants were submitted to the WebNLG 2023 shared task where they outperformed competitor systems by substantial margins in all languages on all metrics. We conclude that good performance on under-resourced languages can be achieved out-of-the box with state-of-the-art LLMs. However, our best results (for Welsh) remain well below the lowest ranked English system at WebNLG'20.
Metadata
Item Type:Conference or Workshop Item (Paper)
Event Type:Workshop
Refereed:Yes
Subjects:Computer Science > Artificial intelligence
Computer Science > Computational linguistics
DCU Faculties and Centres:DCU Faculties and Schools > Faculty of Engineering and Computing > School of Computing
Publisher:Association for Computational Linguistics (ACL)
Official URL:https://aclanthology.org/2023.mmnlg-1.9
Copyright Information:© 2023 The Authors.
Funders:DCU-NLG Research Group at DCU, Science Foundation Ireland Centre for Research Training in Digitally-Enhanced Reality (d-real) under Grant No.18/CRT/6224, Science Foundation Ireland under Grant Agreement No.13/RC/2106_P2 at the ADAPT SFI Research Centre at Dublin City University
ID Code:28947
Deposited On:20 Sep 2023 14:03 by Michela Lorandi . Last Modified 26 Jan 2024 15:19
Documents

Full text available as:

[thumbnail of Lorandi and Belz - Data-to-text Generation for Severely Under-Resourced Languages with GPT-3.5.pdf]
Preview
PDF - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
Creative Commons: Attribution 4.0
912kB
Downloads

Downloads

Downloads per month over past year

Archive Staff Only: edit this record