Hi hunspell developers, We are trying to create a hunspell dictionary for Quechua (runasimi), an indigenous language of the Andes, spoken by roughly 10 million people. There are many dialects but we are going to use the Cusco dialect from Peru. Unfortunately, most people who speak Quechua, have no idea how to write the language. Generally they write the words, sounding them out in the Spanish alphabet which consists of the following letters: a,b,c,ch,d,e,f,h,i,j,k,l,ll,m,n,ñ,o,p,q,r,rr,s,t,u,v,w,x,y,z Hopefully with a proper spell checker, we can help train Quechua speakers how to write in the Quechua alphabet which consists of the following letters: a,ch,chh,ch',e,f,h,i,j,k,kh,k',l,ll,m,n,ñ,o,p,ph,p',q,qh,q',r,s,sh,t,th,t',u,w,y (b,c,d,g,rr,v,x,z do not exist in Quechua.) For instance, Quechua speakers who are literate in Spanish will write the word for "house" as "huasi" or "guasi". With a Quechua spell checker, they will learn to write the word properly as "wasi". Most people learn to write in the same way that they learn to speak a language--not through rules, but through trial and repeated error correction. So a spell checker is an excellent way to teach people to spell correctly. We have investigated the other spelling formats (aspell, ispell, old myspell), but only hunspell will work as bec we need the agglutinative suffix features of hunspell for Quechua spell checking. Quechua can have an almost infinite number of combinations of suffixes. You can have words with as many as 5 or 6 suffixes. The problem is that there is an order to the way that suffixes can be added together, but some suffixes can occupy different places in the order. It is a nightmare to try and list all the possiblities. I tried to create an ispell dictionary for Southern Bolivian Quechua 2 years ago with very ugly results. My affix file was 27,000 lines long when I finally gave up and decided that it was impossible to cover the language with ispell. If you want to see how ugly an ispell affix file can get, you can download it at: www.ciber-runa.org/qu-BO-0.02-0.zip I also had to create a special program to insert infixes into the words in the word list, so a word list of 7000 lines expanded to over 50,000 lines. It could have easily grown to over 200,000 lines if I had bothered to cover all the possible verbal infixes, but I decided to just cover the most basic verbal infixes. The people at aspell transformed my ispell dictionary into aspell and posted it on ftp.gnu.org, but nobody has ever used it as far as I can tell. I sent the dictionary to the AbiWord developer list twice, but they never incorporated it into their program. I have no idea whether the massive affix file caused them to reject it, or if it was simply oversight on their part. I made 3 requests to the OpenOffice people that they add Quechua as a language option, so we could incorporate our spell checker. My messages kept getting forwarded on to other openoffice lists, but nobody ever responded as to how to get quechua added as a language code in openoffice. It was a very frustrating experience to say the least. In the end, I gave up trying to get my Bolivian Quechua ispell/aspell dictionary incorporated into AbiWord and OpenOffice, because it didn't cover a lot of the language, so I figured that it wouldn't be that helpful anyway. Another major problem is that I created the word list with the dictionary written by Jesus Lara in the 1940-1950s. Its spelling style is now out-of-date, and nobody has written an up-to-date dictionary for Bolivian Quechua in the correct alphabet which is being used today. But the situation is totally different with Peruvian Quechua, where there are good dictionaries for the Cusco dialect, written in a good alphabet which people currently use. Right now we are forming a group in Peru to translate a lot of free software programs (AbiWord, Firefox, and eventually OpenOffice) into Quechua, so we need a spell checker as well. We had pretty much given up on spell-checking in Quechua until we found your hunspell program and have decided to give it a try. In order to understand how difficult this is going to be, take a look at how a quechua verb is formed: verb root + ~15 possible verbal infixes (~50 combinations of infixes) + 2 progressive forms + ~100 combinations of person, number, and tense + ~20 possible suffixes + ~20 possible suffixes + ~20 possible suffixes It is possible to have more than three verbal suffixes, but so rare that we aren't going to bother trying to cover combinations with more than 3 suffixes. In the case of most suffixes, they can appear as the first, middle or last suffix, but the order changes according to which suffixes are used. For instance, if the suffix "manta" and "pacha" appear together, then "manta" is before "pacha". A few suffixes, on the other hand, can only be used as the last suffix. This is how we are thinking of implementing verbs in hunspell. Here is how I'm proposing to set it up: In the word list file: -------------------------- verb roots: All verb roots will have the COMPOUNDBEGIN flag. Some verb roots can also be used as nouns, but most of the verb roots will also have the ONLYINCOMPOUND flag. ~50 verbal infix combinations: 50 compound words with COMPOUNDMIDDLE and ONLYINCOMPOUND flags. 2 progressive forms + ~100 combinations of person, number and tense: This will be added together to form ~300 compound words with COMPOUNDEND and ONLYINCOMPOUND flags. In addition these words will have ~20 suffix flags. (There will be ~300, because ~100 without progressive, ~100 with "sha" progressive, and ~100 with "sa" progressive) --------------------------- In the affix file: --------------------------- first ~20 suffixes: ~20 flags, all with addition suffix flags for double suffixes Second ~20 suffixes + third ~20 suffixes: These suffixes will be combined together for less then ~400 flags. (There will be less than ~400 flags because some suffixes don't combine with other suffixes.) --------------------------- At this point, that is how we are thinking of doing it. Do you foresee any problems? Will hunspell be able to handle it? Will hunspell choke or slow down to a crawl with so many compounds words and suffix flags? With nouns, adjectives, and adverbs, we will not need to use any compound words because they are less complex. We can use most of the same suffix flags for verbs to represent combinations of up to 3 suffixes. Apart from how to represent agglutination of infixes and suffixes, we also have the problem of how to catch spelling mistakes for confusable letters in quechua. In quechua there is an on-going debate about whether to use 3 or 5 vowels. The vowels "i" and "e" can often be interchanged and so can the vowels "o" and "u". It will be relatively easy to represent this with hunspell's REP command: REP i e REP e i REP o u REP u o These vowels are the most common letters in Quechua. Do you forsee a major performance problem if hunspell has to transform so many letters? Because these vowells are highly confusable, some Quechua linguists prefer to only use the 3 vowels "a", "i", and "u". We are going to use 5 vowels, because that is the more standard style here in Peru, but anyone who wants to only use 3 vowels can easily transform our 5 vowel spelling dictionary into a 3 vowel dictionary with 2 simple global search and replace commands. On the other hand, if we implement our dictionary in 3 vowels, it will be very difficult to transform it into a 5 vowel dictionary. Hopefully in this way, we can satisfy both the 3 and 5 vowel camps. It is relatively easy to represent confusable vowels in the hunspell format, but it becomes more difficult to represent the confusable consonants ("ch","k", "p","q", and "t") which have a normal form, an aspirated form and a glotallized form. For instance, in Quechua there are 6 |k| sounds which are readily confusable: k (|k| high in the throat) kh (aspirated k), k' (glotalized k), q (|k| deep in the throat), qh (aspirated q) q' (glottalized q) Quechua speakers will often confuse these different sounds when writing and some dictionaries even list different spelling for the same word. For instance, in some dictionaries, the word for "young man" is "kari" and in others it is "qari". Likewise, in some dictionaries, the word "to write" is spelled "qhelqhey" and in others it is spelled "qelqey". We are only going to allow one spelling for these words in our hunspell dictionary, because we would like everyone to standardize around one spelling for "young man" as "qari" and "to write" as "qelqey". In addition, quechua speakers often mispell the |k| sound, using with the spanish alphabet. In spanish, |k| is represented by the letter "c" (if followed by an "a", "o", or "u") or by "qu" (if followed by an "e" or "i"). The tricky part is that there is no universal rule, for when to replace one |k| spelling with another |k| spelling. In aspell, with its "sounds like" feature we could easily just transform all k-like sounds into k, so the spell-checker could easily find all possible matches. For instance in aspell: #c is used in ch, so can't just transform c into k, or will confuse with kh ca => ka co => ko cu => ku que => ke qui => ki kh => k k' => k q => k qh => k q' => k So it didn't matter if the user spelled the word "to write" as "qhelqhey", "qelqey", "q'elq'ey", "khelkhey", or "quelquey". The spell checker would evaluate all the input as |kelkey| and then return the correct spelling "qelqey". Is it possible to do something similar in hunspell? The documentation in man 4 hunspell doesn't give any details about how the REP command works. Are the changes cumulative? For instance, if I have: REP shon tion REP dit dict and I write the word "dishonary". Does hunspell transform it first to "ditionary", and then transform "ditionary" to "dictionary"? Or does hunspell, transform "dishonary" to "ditionary" and then stop? Is it possible to have multiple REP commands with the same string? For instance can you do this? REP shon tion REP shon gion If I pass the word "dicshionary", it will get transformed to "dictionary" and if I pass the word "reshon", it will get transformed to "region"? I wrote out a really long REP table like this: REP ca ka REP ca k'a REP ca kha REP ca qa REP ca q'a REP ca qha REP co ko REP co k'o REP co kho REP co qo REP co q'o REP co qho REP cu ku REP cu k'u REP cu khu REP cu ku REP cu k'u REP cu khu REP que ke REP que k'e REP que khe REP que qe REP que q'e REP que qhe REP qui ki REP qui k'i REP qui khi REP qui qi REP qui q'i REP qui qhi REP k' k REP k' kh REP k' q REP k' qh REP k' q' REP kh k REP kh k' REP kh q REP kh qh REP kh q' REP k kh REP k k' REP k q REP k qh REP k q' REP q' k REP q' kh REP q' k' REP q' q REP q' qh REP qh k REP qh kh REP qh k' REP qh q REP qh q' REP q k REP q kh REP q k' REP q qh REP q q' Then I realized that I don't have any idea whether hunspell would allow multiple replacements for the same string. And even if it does work, would it take up so much processing time that it would be undesirable? Thanks for any advice you can give me (or any good luck charms you can pass my way), Amos Batto PS: I know that you guys probably thought that you had finally solved spell checking for agglutinitive languages, but as you can see, there are languages a lot more complicated than Hungarian. And Quechua isn't unique in this regard. The other major language spoken in the Andes, Aymara, is just as complicated. I hear that some of the Southern African languages have the same problems with agglutination as Quechua. Somebody will probably have to sit down and write a special spell checking program just for extreme agglutinative languages like Quechua. Using hunspell's COMPOUNDMIDDLE flag to implement verbal infixes is a really ugly way to use hunspell. If I were going to design the ideal spell checker for Quechua, it would have a special infix flag that allowed for double infixes and triple infixes to be combined together. Similarly with suffixes, it would allow triple suffixes. I'm not sure, however, if I would be able to handle all the weird order rules for suffixes and infixes. On the other hand, I am sure that calculating all the possible combinations would hog a ton of memory and processing cycles. I don't understand quite how hunspell works, but I imagine that we are talking about cubing the combinations of infixes and cubing the combinations of suffixes, plus all the combinations of those two together. It becomes really hairy, really fast.