Constructing and Augmenting a Bidirectional Paraphrases Dataset from an English-Arabic Subtitling Parallel Corpus

Ahmed, Mohamed Attia; AlGhamdi, Fahad; Hawwari, Abdelati

doi:10.21608/ejle.2024.308019.1070

	Constructing and Augmenting a Bidirectional Paraphrases Dataset from an English-Arabic Subtitling Parallel Corpus
The Egyptian Journal of Language Engineering
Article 1, Volume 11, Issue 2, October 2024, Pages 1-12 PDF (1.15 M)
Document Type: Original Article
DOI: 10.21608/ejle.2024.308019.1070
Authors
Mohamed Attia Ahmed^* ¹; Fahad AlGhamdi²; Abdelati Hawwari³
¹RDI; www.rdi-eg.ai
²Al-Baha University, Al-Baha - Saudi Arabia, fghamdi@bu.edu.sa
³Datalex4ai, Santa Clara – California - USA
Abstract
Paraphrasing is one of the major yet the most challenging tasks of the deep semantic analysis of natural languages. In this paper we present a novel algorithm that operates on a big parallel text corpus and automatically generates the paraphrases of the two natural languages of the corpus. Like several previously crafted algorithms in this regard, our algorithm exploits the bidirectional translation provided by the big parallel text corpora to infer couples of synonymous phrases, however, our algorithm is simpler and more efficient. Moreover, our algorithm is the only one that constructs the whole paraphrase through its run without any need for further post processing. We implemented and ran our algorithm on the English-Arabic text corpora from the 2018 version of the OpenSubtitles (OPUS) parallel text corpora, and through the statistical evaluation of random samples we found that the semantic quality among the phrases of the automatically generated paraphrases to be interestingly superb.
Keywords
bidirectional semantic augmentation; paraphrase; paraphrasing; phrase; semantic analysis

Statistics Article View: 264 PDF Download: 239