Constructing and Augmenting a Bidirectional Paraphrases Dataset from an English-Arabic Subtitling Parallel Corpus | ||||
The Egyptian Journal of Language Engineering | ||||
Article 1, Volume 11, Issue 2, October 2024, Page 1-12 PDF (1.15 MB) | ||||
Document Type: Original Article | ||||
DOI: 10.21608/ejle.2024.308019.1070 | ||||
![]() | ||||
Authors | ||||
Mohamed Attia Ahmed ![]() | ||||
1RDI; www.rdi-eg.ai | ||||
2Al-Baha University, Al-Baha - Saudi Arabia, fghamdi@bu.edu.sa | ||||
3Datalex4ai, Santa Clara – California - USA | ||||
Abstract | ||||
Paraphrasing is one of the major yet the most challenging tasks of the deep semantic analysis of natural languages. In this paper we present a novel algorithm that operates on a big parallel text corpus and automatically generates the paraphrases of the two natural languages of the corpus. Like several previously crafted algorithms in this regard, our algorithm exploits the bidirectional translation provided by the big parallel text corpora to infer couples of synonymous phrases, however, our algorithm is simpler and more efficient. Moreover, our algorithm is the only one that constructs the whole paraphrase through its run without any need for further post processing. We implemented and ran our algorithm on the English-Arabic text corpora from the 2018 version of the OpenSubtitles (OPUS) parallel text corpora, and through the statistical evaluation of random samples we found that the semantic quality among the phrases of the automatically generated paraphrases to be interestingly superb. | ||||
Keywords | ||||
bidirectional semantic augmentation; paraphrase; paraphrasing; phrase; semantic analysis | ||||
Statistics Article View: 145 PDF Download: 168 |
||||