Proceedings of NAACL HLT 2009: Short Papers , pages 129 3132, Boulder, Colorado, June 2009. c 2009 Association for Computational Linguistics Extending Pronunciation Lexicons via Non-phonemic Respellings Lucian Galescu Florida Institute for Human and Machine Cognition 40 S Alcaniz St., Pensacola FL 32502, USA email@example.com Abstract This paper describes work in progress towards using non-phonemic respellings as an addi- tional source of information besides spelling in the process of extending pronunciation lexicons for speech recognition and text-to- speech systems. Preliminary experimental data indicates that the approach is likely to be successful.
The major benefit of the approach is that it makes extending pronunciation lexi- cons accessible to average users. 1 Introduction Speech recognition (SR) systems use pronuncia- tion lexicons to map words into the phoneme-like units used for acoustic modeling. Text-to-speech (TTS) systems also make use of pronunciation lexicons, both internally and as cexception diction- aries d meant to override the systems 9 internal grapheme-to-phoneme (G2P) convertors.
There are many situations where users might want to aug- ment the pronunciation lexicons of SR and TTS systems, ranging from minor fixes, such as adding a few new words or alternate pronunciations for existing words, to significant development efforts, such as adapting a speech ... more. less.
system to a specialized domain, or developing speech systems for new languages by bootstrapping from small amounts of data (Kominek et al. , 2008). Unfortunately, extending the pronunciation lexi- con (PL) is not an easy task.<br><br> Getting expert help is usually impractical, yet users have little or no sup- port if they want to tackle the job themselves. Where available, the user has to either know how to transcribe a word 9s pronunciation into the appli- cation 9s underlying phone set, or, in rare cases, use pronunciation-by-orthography, whereby word pro- nunciations are respelled using other words (e.g., c Thailand d is pronounced like c tie land d ) . The former method requires a certain skill that is clearly beyond the capabilities of the average user; the latter is extremely limited in scope.<br><br> What is needed is a method that would make it easy for the users to specify pronunciations them- selves, without requiring them to be or become expert phoneticians. In this paper we will argue 3 with backing from some preliminary experiments 3 that non-phonemic respelllings might be an acces- sible intermediate representation that will allow speech systems to learn pronunciations directly from user input faster and more accurately. 2 Extending pronunciation lexicons Automatic G2P conversion seems the ideal tool to help users with PL expansion.<br><br> The user would be shown a ranked list of automatically derived pro- nunciations and would have to pick the correct one. To make such a system more user-friendly, a synthesized waveform could also be presented (Davel and Barnard, 2004; also Kominek et al. , 2008).<br><br> This approach has a major drawback: if the system 9s choices are all wrong 3 which is, in fact, to be expected, if the number of choices is small 3 the user would have to provide their own pronun- ciation by using the system 9s phonetic alphabet. In our opinion this precludes the approach from being used by non-specialists. Other systems try to learn pronunciations only from user-provided audio samples, via speech rec- ognition/alignment (Beaufays et al ., 2003; see also Bansal et al ., 2009 and Chung et al.<br><br> , 2004). In such systems G2P conversion may be used to constrain choices, thereby overcoming the notoriously poor phone-level recognition performance. For example, Beaufays et al .<br><br> (2003) focused on a directory assis- tance SR task, with many out-of-vocabulary proper names. Their procedure works by initializing a hy- pothesis by G2P conversion, and thereafter refin- ing it with hypotheses from the joint alignment of phone lattices obtained from audio samples and the current best hypothesis. Several transformation rules were employed to expand the search space of alternative pronunciations.<br><br> While audio-based pronunciation learning may appear to be more user-friendly, it actually suffers from being a slow approach, with many audio samples being needed to achieve reasonable per- formance (the studies cited used up to 15 samples). It is also unclear whether the pronunciations learned are in fact correct, since the approach was mostly used to help increase the performance of a SR system. The SR performance improvements (ranging from 40% to 74%) must be due to better 129 pronunciations, but we are not aware of the exis- tence of any correctness evaluations.<br><br> 3 Non-phonemic respellings The method proposed here is aimed at allowing users to directly indicate the pronunciation of a word via non-phonemic respellings (NPRs). With NPRs, a word 9s pronunciation is represented ac- cording to the ordinary spelling rules of English, without attempting to represent each sound with a unique symbol. For example, the pronunciation of the word phoneme could be indicated as \ FO-neem \, where capitalization indicates stress (boldface, un- derlining, and the apostrophe are also used as stress markers).<br><br> It is often possible to come up with different respellings, and, indeed, systematicity is not a goal here; rather, the goal is to convey infor- mation about pronunciation using familiar spell- ing-to-sound rules, with no special training or tables of unfamiliar symbols. NPRs are used to indicate the pronunciation of unfamiliar or difficult words by news organizations (mostly for foreign names), the United States Phar- macopoeia (for drug names), as well as countless interest groups (astronomy, horticulture, philoso- phy, etc.). Lately, Merriam-Webster Online 1 has started using NPRs in their popular Word of the Day 2 feature.<br><br> Here is a recent example: girandole " \ JEER-un-dohl \ While NPRs seem to be used by a fairly wide range of audiences, we mustn 9t assume that most people are familiar with them. What we do know, however, is that people can learn new pronuncia- tions faster and with fewer errors from NPRs than from phonemic transcriptions and this holds true whether they are linguistically-trained or not (Fraser, 1997). We contend, based on preliminary observations, that not only are NPRs easily de- coded, but people seem to be able to produce rela- tively accurate NPRs, too.<br><br> 4 Our Approach Our vision is that speech applications would em- ploy user-provided NPRs as an additional source of information besides orthography, and use dedi- cated NPR-to-pronunciation (N2P) models to de- rive hypotheses about the correct pronunciation. However, before embarking on this project, we ought to answer three questions: 1. Is generic knowledge about grapheme-to- phoneme mappings in English sufficient to de- code pronunciation respellings?<br><br> Or, in techni- 1 http://www.merriam-webster.com 2 http://www.merriam-webster.com/cgi-bin/mwwod.pl cal terms, are generic G2P models going to work as N2P models? 2. Are pronunciation respellings useful in obtain- ing the correct pronunciation of a word beyond the capabilities of a G2P converter?<br><br> 3. Since we don 9t require that average users learn a respelling system, are novice users able to generate useful respellings? In the following we try to answer experimentally the technical counterparts of the first two ques- tions, and report results of a small study designed to answer the third one.<br><br> 4.1 Data and models We collected a corpus of 2730 words with a to- tal of 2847 NPR transcriptions (some words have multiple NPRs) from National Cancer Institute 9s Dictionary of Cancer Terms. 3 The dictionary con- tains over 4000 medical terms. Here are a couple of entries (without the definitions): lactoferrin (LAK-toh-fayr-in) valproic acid (val-PROH-ik A-sid) Of the 2730 words, 1183 appear in the CMU dictionary (Weide, 1998) 3 we 9ll call this the ID set.<br><br> Of note, about 180 words were not truly in- dictionary; for example, Versed (a drug brand name), pronounced \ V ER0 S EH1 D \, is different from the in-dictionary word versed , pronounced \ V ER1 S T \. We manually aligned all NPRs in the ID set with the phonetic transcriptions. We transcribed phonetically another 928 of the words 3 we 9ll call this the OOD set 3 not found in the CMU dictionary; we verified the phonetic tran- scriptions against the Merriam-Webster Online Medical Dictionary and the New Oxford American Dictionary (McKean, 2005).<br><br> For G2P conversion we used a joint 4-gram model (Galescu, 2001) trained on automatic alignments for all entries in the CMU dictionary. We note that joint n-gram models seem to be among the best G2P models available (Polyakova and Bonafonte, 2006; Bisani and Ney, 2008). 4.2 Adequacy of generic G2P models To answer the first question above, we looked at whether the generic joint 4-gram G2P model is adequate for converting NPRs into phonemes.<br><br> At first, it appeared that the answer would be negative. We found out that NPRs use GP corre- spondences that do not exist or are extremely rare in the CMU dictionary. For example, the <[ ih ], \ IH \> correspondence is very infrequent in the 3 http://www.cancer.gov 130 CMU dictionary (and appears only in proper names, e.g., Stihl ), but is very frequently used in NPRs.<br><br> Therefore, for the [ ih ] grapheme the G2P converter prefers \ IH HH \ to the intended \ IH \. Similar problems happen because of the way some diphones are transcribed. Two other peculiarities of the transcription accounted for other errors: a) always preferring / S / in plurals where / Z / would be required, and b) using [ ayr ] to transcribe \ EH R \, which uses the very rare <[ ay ], \ EH \> mapping.<br><br> These deviations from ordinary GP correspon- dences occur with regularity and therefore we were able to fix them with four post-processing rules. We are confident that these rules capture specific choices made during the compilation of the Dic- tionary of Cancer Terms, to reduce ambiguity, and increase consistency, with the expectation that readers would learn to make the correct phonological choices when reading the respellings. Another issue was that the set of GP mappings used in NPRs was extremely small (111) compared to the GP correspondence set obtained automati- cally from the CMU dictionary (1130, many of them occurring only in proper names).<br><br> However, it turns out that 47524 entries in the CMU dictionary (about 45%) use exclusively GP mappings found in NPRs! This suggests that, while the generic G2P model may not be adequate for the N2P task, the GP mappings used in NPRs are sufficiently com- mon that a more adequate N2P model could be built from generic dictionary entries by selecting only relevant entries for training. Unfortunately we don 9t have a full account of all cexotic d entries in the CMU dictionary, but we expect that by simply removing from the training data the approximately 54K known proper names will yield a reasonable starting point for building N2P models.<br><br> 4.3 NPR-to-pronunciation conversion To assess the contribution of NPR information to pronunciation prediction, we compare the perform- ance of spelling-to-pronunciation conversion (the baseline) to that of NPR-to-pronunciation conver- sion, as well as to that of a combined spelling and NPR-based conversion, which is our end goal. For the N2P task, we trained two joint 4-gram models: one based on the aligned NPRs, and a sec- ond based on the 47K CMU dictionary entries that use only GP mappings found in NPRs. Then, we interpolated the two models to obtain an NPR- specific model (the weights were not optimized for these experiments), which we 9ll call the N2P model.<br><br> The combined, spelling and NPR-based model was an oracle combination of the G2P and the N2P model. Phone error rates (PER) and word error rates (WER) for both the ID set and the OOD set are shown in Figures 1 and 2, respectively. We obtained n-best pronunciations with n from 1 to 10 for the three models considered.<br><br> As expected, G2P performance is very good on the ID set, since the test data was used in training the G2P model. Significantly, even though the N2P model is not as good itself, the combined model shows marked error rate reductions: for the top hypothesis it cuts the PER by over 57%, and the WER by over 47% when compared to the G2P per- formance on spelling alone. Since the OOD set represents data unseen by ei- ther the spelling-based model or the NPR-based model, all models 9 performance is severely de- graded compared to that on the ID set.<br><br> But here we see that NPR-based pronunciations are already bet- ter than spelling-based ones. For the top hypothe- sis, compared to the performance of the G2P model alone, the N2P model shows almost 19% better PER, and almost 5% better WER, whereas the combined model achieves 49% better PER and close to 31% better WER. 4.4 User-generated NPRs To answer the third question, we collected user- generated NPRs from five subjects.<br><br> The subjects were all computer-savvy, with at least a BSc de- gree. Only one subject expressed some familiarity with NPRs (but didn 9t generate better NPRs than Figure 1. Phone and word error rates on the ID set.<br><br> Figure 2. Phone and word error rates for the OOD set. 0% 2% 4% 6% 8% 10% 1 2 3 4 5 6 7 8 9 10 PER spelling NPR combined 0% 10% 20% 30% 40% 50% 1 2 3 4 5 6 7 8 9 10 WER 0% 2% 4% 6% 8% 10% 12% 14% 16% 1 2 3 4 5 6 7 8 9 10 PER spelling NPR combined 0% 10% 20% 30% 40% 50% 60% 70% 80% 1 2 3 4 5 6 7 8 9 10 WER 131 other subjects).<br><br> The subjects were shown four examples of NPRs; two of them were recent Word of the Day entries, and had audio attached to them. The other two were selected from the OOD set. With only four words and two different sources we wanted to ensure that users would not be able to train them- selves to a specific system.<br><br> Subjects understood the problem easily and rarely if ever looked back at the examples during the actual test. The test involved generating NPRs for 20 of the most difficult words for our generic GP model from the OOD set (e.g., bronchoscope , paren- chyma , etc.). These words turned out to be mostly unfamiliar to users as well (the average familiarity score was just under 1.9 on a 4-point scale.<br><br> No audio and no feedback were given. Users varied greatly in the choices they made. For the word acupressure , the first two syllables were transcribed as AK-YOO in the Dictionary of Cancer Terms, and users came up with ACK- YOU, AK-U, and AK-YOU.<br><br> This underscores that a good N2P model would have to account for far more GP mappings than the 111 found in our data. Sometimes users had trouble assigning conso- nants to syllables (syllabification wasn 9t required, but subjects tried anyway), on occasion splitting them across syllable boundaries (e.g., \ BIL-LIH- RUE-BEN \ for bilirubin ), which guarantees an in- sertion error. It is quite likely that some error model might be required to deal with such issues.<br><br> Nonetheless, even though imperfect, the resulting NPRs showed excellent promise. Looking just at the top hypothesis, whereas the average PER on those 20 words was about 45% for the G2P model, pronunciations obtained from NPRs using the same G2P model (new GP mappings pre- cluded the use of the N2P model described in the previous section) had only around 36% (+/-5%) phone error rate. The combined model showed an even better performance of about 33% (+/-5%) PER.<br><br> Full results for n-best lists up to n=10 are shown in Figure 3. 5 Conclusions and Further Work The experiments we conducted are preliminary, and most of the work remains to be done. More data need to be collected and analyzed before good NPR-to-pronunciation models can be trained.<br><br> Fur- ther investigations need to be conducted to assess the average users 9 ability to generate NPRs and how they tend to deviate from the general graph- eme-to-phoneme rules of English. Nonetheless, we believe these experiments give strong indications that NPRs would be an excellent source of information to improve the quality of pronunciation hypotheses generated from spelling. Moreover, it appears that novice users don 9t have much difficulty generating useful NPRs on their own; we expect that their skill would increase with use.<br><br> Particularly useful would be for the system to be able to provide feedback, including generating NPRs; we have started investigating this reverse problem, of obtaining NPRs from pronunciations, and are encouraged by the initial results. References D. Bansal, N.<br><br> Nair, R. Singh, and B. Raj.<br><br> 2009. A Joint Decoding Algorithm for Multiple-Example-Based Addition of Words to a Pronunciation Lexicon . Proc.<br><br> ICASSP 92009 , pp. 2104-2107. F.<br><br> Beaufays, et al. 2003. Learning Linguistically Valid Pronunciation From Acoustic Data.<br><br> Proc. Eu- rospeech 903 , Geneva, pp. 2593-2596.<br><br> M. Bisani and H. Ney.<br><br> 2008. Joint-Sequence Models for Grapheme-to-Phoneme Conversion. Speech Commu- nication , 50(5):434-451.<br><br> G. Chung, C. Wang, S.<br><br> Seneff, E. Filisko, and M. Tang.<br><br> 2004. Combining Linguistic Knowledge and Acous- tic Information in Automatic Pronunciation Lexicon Generation. Proc.<br><br> Interspeech'04 , Jeju Island, Korea. M. Davel and E.<br><br> Barnard. 2004. The Efficient Genera- tion of Pronunciation Dictionaries: Human Factors during Bootstrapping.<br><br> Proc. INTERSPEECH 2004 , Korea. H.<br><br> Fraser. 1997. Dictionary pronunciation guides for English.<br><br> International Journal of Lexicography , 10(3), 181-208. L. Galescu and J.<br><br> Allen. 2001. Bi-directional Conver- sion Between Graphemes and Phonemes Using a Joint N-gram Model.<br><br> Proc. 4th ISCA Tutorial and Research Workshop on Speech Synthesis , Scotland. J.<br><br> Kominek, S. Badaskar, T. Schultz, and A.<br><br> Black. 2008. Improving Speech Systems Built from Very Little Data.<br><br> Proc. INTERSPEECH 2008 , Australia. E.<br><br> McKean (ed.). 2005. The New Oxford American Dic- tionary (2nd ed.).<br><br> Oxford University Press. T. Polyakova and A.<br><br> Bonafonte, 2006. Learning from Errors in Grapheme-to-Phoneme Conversion. Proc.<br><br> ISCLP 92006 . Pittsburgh, USA. R.L.<br><br> Weide. 1998. The CMU pronunciation dictionary, release 0.6.<br><br> http://www.speech.cs.cmu.edu/cgi- bin/cmudict. Fi ure 3. Phone error rates for user- enerated NPRs.<br><br> 0% 10% 20% 30% 40% 50% 1 2 3 4 5 6 7 8 9 10 PER Spelling NPR Combined 132