An Experimental Study on Data Augmentation Techniques for Named Entity Recognition on Low-Resource Domains

Torres, Arthur Elwing; de Moura, Edleno Silva; da Silva, Altigran Soares; Nascimento, Mario A.; Mesquita, Filipe

Computer Science > Computation and Language

arXiv:2411.14551 (cs)

[Submitted on 21 Nov 2024]

Title:An Experimental Study on Data Augmentation Techniques for Named Entity Recognition on Low-Resource Domains

Authors:Arthur Elwing Torres, Edleno Silva de Moura, Altigran Soares da Silva, Mario A. Nascimento, Filipe Mesquita

View PDF HTML (experimental)

Abstract:Named Entity Recognition (NER) is a machine learning task that traditionally relies on supervised learning and annotated data. Acquiring such data is often a challenge, particularly in specialized fields like medical, legal, and financial sectors. Those are commonly referred to as low-resource domains, which comprise long-tail entities, due to the scarcity of available data. To address this, data augmentation techniques are increasingly being employed to generate additional training instances from the original dataset. In this study, we evaluate the effectiveness of two prominent text augmentation techniques, Mention Replacement and Contextual Word Replacement, on two widely-used NER models, Bi-LSTM+CRF and BERT. We conduct experiments on four datasets from low-resource domains, and we explore the impact of various combinations of training subset sizes and number of augmented examples. We not only confirm that data augmentation is particularly beneficial for smaller datasets, but we also demonstrate that there is no universally optimal number of augmented examples, i.e., NER practitioners must experiment with different quantities in order to fine-tune their projects.

Comments:	21 pages, 2 figures
Subjects:	Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
Cite as:	arXiv:2411.14551 [cs.CL]
	(or arXiv:2411.14551v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2411.14551

Submission history

From: Arthur Elwing Torres [view email]
[v1] Thu, 21 Nov 2024 19:45:48 UTC (931 KB)

Computer Science > Computation and Language

Title:An Experimental Study on Data Augmentation Techniques for Named Entity Recognition on Low-Resource Domains

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:An Experimental Study on Data Augmentation Techniques for Named Entity Recognition on Low-Resource Domains

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators