Pivot-Based Bilingual Dictionary Induction
UNESCO estimates that, if nothing is done, half of the over 6,000 plus languages spoken today will disappear by the end of this century. Consequently, humanity would lose cultural heritage and ancestral knowledge embedded, in particular, in indigineous languages. Enriching the language resource is one of the way to prevent the language extinction. A pivot-based bilingual dictionary induction is the most convenient method to create bilingual dictionary for a low-resource language, i.e., a language that has inadequate language resources for computational linguistics. When two bilingual dictionaries Malay-Indonesian and Indonesian-Minangkabau are connected via the pivot language Indonesian to induce dictionary Malay-Minangkabau, sometimes the precision could be very low (0.36) due to polysemy of the pivot word as shown in Figure 1. A way to prune incorrect translation pair candidates is the research challenge on pivot-based bilingual dictionary induction approach.
Figure 1. Example of Pivot-based Bilingual Dictionary Induction
The first work on pivot-based bilingual dictionary induction is inverse consultation method that identifies equivalent candidates of source language words in target language by consulting dictionary source-pivot and pivot-target. These equivalent candidates will be looked up and compared in the inverse dictionary target-source. Unfortunately, for some low-resource languages, it is often difficult to find machine-readable inverse dictionaries to identify and eliminate the erroneous translation pair candidates.
Inspired by inverse consultation method, our team proposed to treat pivot-based bilingual dictionary induction as an optimization problem. The pruning process involving a set of constraints and heuristics rather than inverse dictionaries. The assumption was that lexicons of closely-related languages offer instances of one-to-one mapping and share a significant number of cognates (words with similar spelling/form and meaning originating from the same root language). To be a one-to-one pair, source language word and target language word should be symmetrically connected via pivot word(s). Some new edges can be added to the graph if no symmetrically connected pair available with some cost to be paid based on some defined heuristics. However, this so called one-to-one approach that prioritize precision lead to a low recall, since many other potential translation pair candidates are ignored.
Therefore, we generalized the constraint-based bilingual dictionary induction by extending constraints and translation pair candidates from the one-to-one approach to attain higher recall while maintaining a good precision. Firstly, we identify one-to-one cognates by incorporating more constraints and heuristics to improve the precision. We then identify the cognates’ synonyms to obtain many-to-many translation pairs. In each step, we can obtain more cognate and cognate synonym pair candidates by iterating the n-cycle symmetry assumption until all possible translation pair candidates have been reached. After conducting some experiments, we found out that our generalized approach works better on a closer related languages and outperformed the inverse consultation and one-to-one approach.