Assessing Neural Text Systems Susceptibility to Data Contamination in Limited-Data Environments

Dr. Faisal Al-Nuaimi

Authors

Dr. Faisal Al-Nuaimi Department of Educational Sciences, Qatar University, Doha, Qatar

Keywords:

Neural text systems, data contamination, low-resource languages

Abstract

The rapid advancement of neural text systems, particularly large-scale pretrained language models, has transformed natural language processing (NLP) across diverse applications. However, their dependence on vast datasets raises critical concerns regarding vulnerability to data contamination, especially in limited-data environments. This paper investigates the susceptibility of neural text systems to various forms of data poisoning and contamination under constrained data conditions, with a focus on low-resource linguistic contexts. Drawing upon interdisciplinary perspectives from machine learning security, data ethics, and computational linguistics, the study examines how data scarcity amplifies risks associated with adversarial manipulation, bias propagation, and representational distortion.

The research synthesizes existing frameworks of data poisoning attacks, including clean-label, backdoor, and federated poisoning mechanisms, while situating them within the structural limitations of low-resource datasets. Theoretical grounding is established through analyses of model generalization, transfer learning dynamics, and statistical dependency structures inherent in neural architectures. Furthermore, the study explores how limited corpus diversity intensifies model sensitivity to corrupted inputs, leading to systemic degradation in performance, fairness, and robustness.

A conceptual model is developed to illustrate the interaction between dataset quality, model architecture, and adversarial interference. Through analytical evaluation, the study demonstrates that neural text systems operating in low-resource environments exhibit disproportionately higher vulnerability to contamination due to overfitting tendencies, reliance on pretrained representations, and insufficient noise filtering mechanisms. Additionally, the research highlights the role of socio-technical factors, including data curation practices and algorithmic governance, in shaping model resilience.

The findings underscore the necessity for robust data validation protocols, secure training pipelines, and adaptive learning strategies tailored to constrained environments. The paper contributes to ongoing discourse on trustworthy AI by identifying critical vulnerabilities in contemporary NLP systems and proposing strategic directions for enhancing resilience against data contamination. Ultimately, it emphasizes that safeguarding neural text systems requires not only technical innovation but also ethical and institutional interventions.

Downloads

Download data is not yet available.

References

J. Agar, The Government Machine: A Revolutionary History of the Computer. Cambridge, MA, USA : MIT Press, 2003.

P. N. Ahmad, Y. Liu, G. Ali, M. A. Wani, and M. ElAffendi, “Robust benchmark for propagandist text detection and mining high-quality data,” Mathematics, vol. 11, no. 12, 2023, Art. no. 2668.

R. Agerri, “Give your text representation models some love: The case for basque,” in Proc. 12th Lang. Resour. Eval. Conf., Marseille, France, May 2020, pp. 4781–4788. [Online]. Available: https://aclanthology.org/2020.lrec-1.588

S. A. Aluko, “How many Nigerians? An analysis of Nigeria’s census problems, 1901-63,” J. Modern Afr. Stud., vol. 3, no. 3, pp. 371–392, 1965, doi: 10.1017/S0022278X00006170.

M. Artetxe, I. Aldabe, R. Agerri, O. Perez-De-Viñaspre, and A. Soroa, “Does corpus quality really matter for low-resource languages?,” in Proc. 2022 Conf. Empirical Methods Natural Lang. Process., Dec. 2022, pp. 7383–7390. [Online]. Available: https://aclanthology.org/2022.emnlp-main.499

C. Basta, M. R. Costa-jussà, and N. Casas, “Evaluating the underlying gender bias in contextualized word embeddings,” in Proc. 1st Workshop Gender Bias Natural Lang. Process., Florence, Italy, Aug. 2019, pp. 33–39. [Online]. Available: https://www.aclweb.org/anthology/W19-3805

E. M. Bender, T. Gebru, A. McMillan-Major, and S. Shmitchell, “On the dangers of stochastic parrots: Can language models be too big?,” in Proc. 2021 ACM Conf. Fairness, Accountability, Transparency, Mar. 2021, pp. 610–623, doi: 10.1145/3442188.3445922.

R. Bommasani, “On the opportunities and risks of foundation models,” Aug. 2021, arXiv:2108.07258.

T. Brown, “Language models are few-shot learners,” in Proc. Adv. Neural Inf. Process. Syst., 2020, pp. 1877–1901. [Online]. Available: https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html

M. Campbell-Kelly, “Information technology and organizational change in the British census, 1801-1911,” Inf. Syst. Res., vol. 7, no. 1, pp. 22–36, 1996, doi: 10.1287/isre.7.1.22.

M. Campbell-Kelly, “Punched-card machinery,” in Computing Before Computers, W. Aspray, Ed., Ames, IA, USA : Iowa State Univ. Press, 1990, pp. 122–155.

M. Coavoux, S. Narayan, and S. B. Cohen, “Privacy-preserving neural representations of text,” in Proc. 2018 Conf. Empirical Methods Natural Lang. Process., Brussels, Belgium: Association for Computational Linguistics, Oct. 2018, pp. 1–10. [Online]. Available: https://aclanthology.org/D18-1001

J. W. Cortada, Before the Computer. IBM, NCR, Burroughs, & Remington Rand & The Industry They Created 1865-1956. Princeton, NJ, USA : Princeton Univ. Press, 1993.

A. De Wynter, “Mischief: A simple black-box attack against transformer architectures,” Oct. 2020, arXiv:2010.08542.

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proc. 2019 Conf. North Amer. Chapter Assoc. Comput. Linguistics: Hum. Lang. Technol., Jun. 2019, pp. 4171–4186. [Online]. Available: 10/ggbwf6

D. Edgerton, “From innovation to use: Ten eclectic theses on the historiography of technology,” Hist. Technol., vol. 16, no. 2, pp. 111–136, 1999, doi: 10.1080/07341519908581961.

J. Etxaniz, “Latxa: An open language model and evaluation suite for basque,” in Proc. 62nd Annu. Meeting Assoc. Comput. Linguistics, Aug. 2024, pp. 14952–14972. [Online]. Available: https://aclanthology.org/2024.acl-long.799

M. Goldblum, “Dataset security for machine learning: Data poisoning, backdoor attacks, and defenses,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 2, pp. 1563–1580, Feb. 2023.

X. Han, “Pre-trained models: Past, present and future,” AI Open, vol. 2, pp. 225–250, Aug. 2021. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S2666651021000231

W. R. Huang, J. Geiping, L. Fowl, G. Taylor, and T. Goldstein, “MetaPoison: Practical general-purpose clean-label data poisoning,” in Proc. Adv. Neural Inf. Process. Syst., 2020, vol. 33, pp. 12080–12091. [Online]. Available: https://proceedings.neurips.cc/paper/2020/hash/8ce6fc704072e351679ac97d4a985574-Abstract.html

B. Hutchinson, V. Prabhakaran, E. Denton, K. Webster, Y. Zhong, and S. Denuyl, “Social biases in NLP models as barriers for persons with disabilities,” in Proc. 58th Annu. Meeting Assoc. Comput. Linguistics, Jul. 2020, pp. 5491–5501. [Online]. Available: https://aclanthology.org/2020.acl-main.487

P. Kaghazgaran, M. Alfifi, and J. Caverlee, “Wide-ranging review manipulation attacks: Model, empirical study, and countermeasures,” in Proc. 28th ACM Int. Conf. Inf. Knowl. Manage., Nov. 2019, pp. 981–990, doi: 10.1145/3357384.3358034.

A. Kozyreva, P. Lorenz-Spreen, R. Hertwig, S. Lewandowsky, and S. M. Herzog, “Public attitudes towards algorithmic personalization and use of personal data online: Evidence from Germany, great britain, and the United States,” Humanities Social Sci. Commun., vol. 8, no. 1, pp. 1–11, May 2021. [Online]. Available: 10/gmgpfd

K. Kurita, P. Michel, and G. Neubig, “Weight poisoning attacks on pretrained models,” in Proc. 58th Annu. Meeting Assoc. Comput. Linguistics, Assoc. Comput. Linguistics, Apr. 2020, pp. 2793–2806.

L. Li, D. Song, X. Li, J. Zeng, R. Ma, and X. Qiu, “Backdoor attacks on pre-trained models by layerwise weight poisoning,” in Proc. 2021 Conf. Empirical Methods Natural Lang. Process., Online and Punta Cana, Dominican Republic: Association for Computational Linguistics, Nov. 2021, pp. 3023–3032. [Online]. Available: https://aclanthology.org/2021.emnlp-main.241

G. Maheshwari, P. Denis, M. Keller, and A. Bellet, “Fair NLP models with differentially private text encoders,” in Proc. Find. Assoc. Comput. Linguistics: EMNLP 2022, Abu Dhabi, United Arab Emirates, 2022, pp. 6913–6930.

V. Misra, “Black box attacks on transformer language models,” in Proc. ICLR 2019 Debugging Mach. Learn. Models Workshop, 2019, pp. 1–5.

D. Narayanan, “Scaling language model training to a trillion parameters using megatron,” Nvidia, Apr. 2021. [Online]. Available: https://developer.nvidia.com/blog/scaling-language-model-training-to-a-trillion-parameters-using-megatron/

K. Ogueji, Y. Zhu, and J. Lin, “Small data? No problem! Exploring the viability of pretrained multilingual language models for low-resourced languages,” in Proc. 1st Workshop Multilingual Representation Learn., Nov. 2021, pp. 116–126. [Online]. Available: https://aclanthology.org/2021.mrl-1.11

R. Pang, “A tale of evil twins: Adversarial inputs versus poisoned models,” in Proc. 2020 ACM SIGSAC Conf. Comput. Commun. Secur., New York, NY, USA: Association for Computing Machinery, Oct. 2020, pp. 85–99, doi: 10.1145/3372297.3417253.

P. Papadopoulos, O. T. V. Essen, N. Pitropakis, C. Chrysoulas, A. Mylonas, and W. J. Buchanan, “Launching adversarial attacks against network intrusion detection systems for IoT,” J. Cybersecurity Privacy, vol. 1, no. 2, pp. 252–273, Jun. 2021. [Online]. Available: https://www.mdpi.com/2624-800X/1/2/14

M. E. Peters, S. Ruder, and N. A. Smith, “To tune or not to tune? Adapting pretrained representations to diverse tasks,” in Proc. 4th Workshop Representation Learn. NLP, Aug. 2019, pp. 7–14.

N. Pitropakis, E. Panaousis, T. Giannetsos, E. Anastasiadis, and G. Loukas, “A taxonomy and survey of attacks against machine learning,” Comput. Sci. Rev., vol. 34, Nov. 2019, Art. no. 100199. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S1574013718303289

F. Petroni, “Language models as knowledge bases?,” in Proc. 2019 Conf. Empirical Methods Natural Lang. Process. 9th Int. Joint Conf. Natural Lang. Process., Hong Kong, China, Nov. 2019, pp. 2463–2473. [Online]. Available: https://aclanthology.org/D19-1250

X. Qiu, T. Sun, Y. Xu, Y. Shao, N. Dai, and X. Huang, “Pre-trained models for natural language processing: A survey,” Sci. China Technol. Sci., vol. 63, no. 10, pp. 1872–1897, Oct. 2020.

A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language models are unsupervised multitask learners,” OpenAI, Tech. Rep., 2018.

C. Raffel, “Exploring the limits of transfer learning with a unified text-to-text transformer,” J. Mach. Learn. Res., vol. 21, no. 140, pp. 1–67, 2020. [Online]. Available: https://jmlr.org/papers/volume21/20-074/20-074.pdf

R. Schuster, C. Song, E. Tromer, and V. Shmatikov, “You autocomplete me: Poisoning vulnerabilities in neural code completion,” in Proc. 30th USENIX Secur. Symp. USENIX Assoc., 2021, pp. 1559–1575. [Online]. Available: https://www.usenix.org/conference/usenixsecurity21/presentation/schuster

A. Shafahi, “Poison frogs! targeted clean-label poisoning attacks on neural networks,” in Proc. 32nd Int. Conf. Neural Inf. Process. Syst., 2018, pp. 6106–6116. [Online]. Available: https://proceedings.neurips.cc/paper/2018/hash/22722a343513ed45f14905eb07621686-Abstract.html

A. M. Shah and N. Schweiggart, “# Boycottmurree campaign on twitter: Monitoring public response to the negative destination events during a crisis,” Int. J. Disaster Risk Reduction, vol. 92, 2023, Art. no. 103734.

A. Srivastava, “Beyond the imitation game: Quantifying and extrapolating the capabilities of language models,” in Proc. Trans. Mach. Learn. Res., Jun. 2023. [Online]. Available: https://openreview.net/forum?id=uyTL5Bvosj

G. Sun, Y. Cong, J. Dong, Q. Wang, and J. Liu, “Data poisoning attacks on federated machine learning,” IEEE Internet Things J., vol. 9, no. 13, pp. 11365–11375, Jul. 2022, arXiv:2004.10020. [Online]. Available: https://ieeexplore.ieee.org/document/9618642

Y. C. Tan and L. E. Celis, “Assessing social and intersectional biases in contextualized word representations,” in Proc. Adv. Neural Inf. Process. Syst., Vancouver, BC, Canada, 2019, pp. 13209–13220. [Online]. Available: https://proceedings.neurips.cc/paper/2019/hash/201d546992726352471cfea6b0df0a48-Abstract.html

V. Tolpegin, S. Truex, M. E. Gursoy, and L. Liu, “Data poisoning attacks against federated learning systems,” in Proc. Comput. Secur.–ESORICS 2020, Cham, Switzerland: Springer, Jul. 2020, pp. 480–501.

T. Wolf, “Transformers: State-of-the-art natural language processing,” in Proc. 2020 Conf. Empirical Methods Natural Lang. Process.: Syst. Demonstrations, Oct. 2020, pp. 38–45.

Assessing Neural Text Systems Susceptibility to Data Contamination in Limited-Data Environments

Authors

Keywords:

Abstract

Downloads

References

Downloads

Published

How to Cite

Issue

Section

License

Address

Contact Info: