2007-11-01: Hunspell 1.2.1 release: - new memory efficient condition checking algorithm for affix rules - new morphological functions: - stem() for stemming - analyze() for morphological analysis - generate() for morphological generation - new demos: - analyze: stemming, morphological analysis and generation - chmorph: morphological conversion of texts 2007-09-05: Hunspell 1.1.12 release: - dictionary based phonetic suggestion for words with special or foreign pronounciation or alternative (bad) transliteration (see Changelog, tests/phone.* and manual). - improved data structure and memory optimization for dictionaries with variable count fields - bug fixes for Unicode encoding dictionaries and ngram suggestions - improved REP suggestions with space: it works without dictionary modification - updated and new project files for Windows API 2007-08-27: Hunspell 1.1.11 release: - portability fixes 2007-08-23: Hunspell 1.1.10 release: - pronounciation based suggestion using Bj�rn Jacke's original Aspell phonetic transcription algorithm (http://aspell.net), relicensed under GPL/LGPL/MPL tri-license with the permission of the author - keyboard base suggestion by KEY (see manual) - better time limits for suggestion search - test environment for suggestion based on Wikipedia data - bug fixes for non standard Mozilla platforms etc. 2007-07-25: Hunspell 1.1.9 release: - better tokenization: - for URLs, mail addresses and directory paths (default: skip these tokens) - for colons in words (for Finnish and Swedish) - new examples: - affixation of personal dictionary words - digits in words - bug fixes (see ChangeLog) 2007-07-16: Hunspell 1.1.8 release: - better Mac OS X/Cygwin and Windows compatibility - fix Hunspell's Valgrind environment and memory handling errors detected by Valgrind - other bug fixes (see ChangeLog) 2007-07-06: Hunspell 1.1.7 release: - fix warning messages of OpenOffice.org build 2007-06-29: Hunspell 1.1.6 release: - check capitalization of the following word forms - words with mixed capitalisation: OpenOffice.org - OPENOFFICE.ORG - allcap words and suffixes: UNICEF's - UNICEF'S - prefixes with apostrophe and proper names: Sant'Elia - SANT'ELIA - suggestion for missing sentence spacing: something.The -> something. The - Hunspell executable: improved locale support - -i option: custom input encoding - use locale data for default dictionary names. - tools/hunspell.cxx: fix 8-bit tokenization (letters without casing, like ß or Hebrew characters now are handled well) - dictionary search path (automatic detection of OpenOffice.org directories) - DICPATH environmental variable - -D option: show directory path of loaded dictionary - patches and bug fixes for Mozilla, OpenOffice.org. 2007-03-19: Hunspell 1.1.5 release: - optimizations: 10-100% speed up, smaller code size and memory footprint (conditional experimental code and warning messages) - extended Unicode support: - non BMP Unicode characters in dictionary words and affixes (except affix rules and conditions) - support BOM sequence in aff and dic files - IGNORE feature for Arabic diacritics and other optional characters - New edit distance suggestion methods: - capitalisation: nasa -> NASA - long swap: permenant -> permanent - long move: Ghandi -> Gandhi, greatful -> grateful - double two characters: vacacation -> vacation - spaces in REP sug.: REP alot a_lot (NOTE: "a lot" must be a dictionary word) - patches and bug fixes for Mozilla, OpenOffice.org, Emacs, MinGW, Aqua, German and Arabic language, etc. 2006-02-01: Hunspell 1.1.4 release: - Improved suggestion for typical OCR bugs (missing spaces between capitalized words). For example: "aNew" -> "a New". http://qa.openoffice.org/issues/show_bug.cgi?id=58202 - tokenization fixes (fix incomplete tokenization of input texts on big-endian platforms, and locale-dependent tokenization of dictionary entries) 2006-01-06: Hunspell 1.1.3.2 release: - fix Visual C++ compiling errors 2006-01-05: Hunspell 1.1.3 release: - GPL/LGPL/MPL tri-license for Mozilla integration - Alias compression of flag sets and morphological descriptions. (For example, 16 MB Arabic dic file can be compressed to 1 MB.) - Improved suggestion. - Improved, language independent German sharp s casing with CHECKSHARPS declaration. - Unicode tokenization in Hunspell program. - Bug fixes (at new and old compound word handling methods), etc. 2005-11-11: Hunspell 1.1.2 release: - Bug fixes (MAP Unicode, COMPOUND pattern matching, ONLYINCOMPOUND suggestions) - Checked with 51 regression tests in Valgrind debugging environment, and tested with 52 OOo dictionaries on i686-pc-linux platform. 2005-11-09: Hunspell 1.1.1 release: - Compound word patterns for complex compound word handling and simple word-level lexical scanning. Ideal for checking Arabic and Roman numbers, ordinal numbers in English, affixed numbers in agglutinative languages, etc. http://qa.openoffice.org/issues/show_bug.cgi?id=53643 - Support ISO-8859-15 encoding for French (French oe ligatures are missing from the latin-1 encoding). http://qa.openoffice.org/issues/show_bug.cgi?id=54980 - Implemented a flag to forbid obscene word suggestion: http://qa.openoffice.org/issues/show_bug.cgi?id=55498 - Checked with 50 regression tests in Valgrind debugging environment, and tested with 52 OOo dictionaries. - other improvements and bug fixes (see ChangeLog) 2005-09-19: Hunspell 1.1.0 release * complete comparison with MySpell 3.2 (from OpenOffice.org 2 beta) * improved ngram suggestion with swap character detection and case insensitivity ------ examples for ngram improvement (input word and suggestions) ----- 1. pernament (instead of permanent) MySpell 3.2: tournaments, tournament, ornaments, ornament's, ornamenting, ornamented, ornament, ornamentals, ornamental, ornamentally Hunspell 1.0.9: ornamental, ornament, tournament Hunspell 1.1.0: permanent Note: swap character detection 2. PERNAMENT (instead of PERMANENT) MySpell 3.2: - Hunspell 1.0.9: - Hunspell 1.1.0: PERMANENT 3. Unesco (instead of UNESCO) MySpell 3.2: Genesco, Ionesco, Genesco's, Ionesco's, Frescoing, Fresco's, Frescoed, Fresco, Escorts, Escorting Hunspell 1.0.9: Genesco, Ionesco, Fresco Hunspell 1.1.0: UNESCO 4. siggraph's (instead of SIGGRAPH's) MySpell 3.2: serigraph's, photograph's, serigraphs, physiography's, physiography, digraphs, serigraph, stratigraphy's, stratigraphy epigraphs Hunspell 1.0.9: serigraph's, epigraph's, digraph's Hunspell 1.1.0: SIGGRAPH's --------------- end of examples -------------------- * improved testing environment with suggestion checking and memory debugging memory debugging of all tests with a simple command: VALGRIND=memcheck make check * lots of other improvements and bug fixes (see ChangeLog) 2005-08-26: Hunspell 1.0.9 release * improved related character map suggestion * improved ngram suggestion ------ examples for ngram improvement (O=old, N = new ngram suggestions) -- 1. Permenant (instead of Permanent) O: Endangerment, Ferment, Fermented, Deferment's, Empowerment, Ferment's, Ferments, Fermenting, Countermen, Weathermen N: Permanent, Supermen, Preferment Note: Ngram suggestions was case sensitive. 2. permenant (instead of permanent) O: supermen, newspapermen, empowerment, endangerment, preferments, preferment, permanent, preferment's, permanently, impermanent N: permanent, supermen, preferment Note: new suggestions are also weighted with longest common subsequence, first letter and common character positions 3. pernemant (instead of permanent) O: pimpernel's, pimpernel, pimpernels, permanently, permanents, permanent, supernatant, impermanent, semipermanent, impermanently N: permanent, supernatant, pimpernel Note: new method also prefers root word instead of not relevant affixes ('s, s and ly) 4. pernament (instead of permanent) O: tournaments, tournament, ornaments, ornament's, ornamenting, ornamented, ornament, ornamentals, ornamental, ornamentally N: ornamental, ornament, tournament Note: Both ngram methods misses here. 5. obvus (instad of obvious): O: obvious, Corvus, obverse, obviously, Jacobus, obtuser, obtuse, obviates, obviate, Travus N: obvious, obtuse, obverse Note: new method also prefers common first letters. 6. unambigus (instead of unambiguous) O: unambiguous, unambiguity, unambiguously, ambiguously, ambiguous, unambitious, ambiguities, ambiguousness N: unambiguous, unambiguity, unambitious 7. consecvence (instead of consequence) O: consecutive, consecutively, consecutiveness, nonconsecutive, consequence, consecutiveness's, convenience's, consistences, consistence N: consequence, consecutive, consecrates An example in a language with rich morphology: 8. Misisipiben (instead of Mississippiben [`in Mississippi' in Hungarian]): O: Misik�d�iben, Pisised�iben, Misik�i�iben, Pisisek�iben, Misik�iben, Misik�id�iben, Misik�k�iben, Misik�ik�iben, Misik�im�iben, Mississippiiben N: Mississippiben, Mississippiiben, Misiiben Note: Suggesting not relevant affixes was the biggest fault in ngram suggestion for languages with a lot of affixes. --------------- end of examples -------------------- * support twofold prefix cutting * lots of other improvements and bug fixes (see ChangeLog) * test Hunspell with 54 OpenOffice.org dictionaries: source: ftp://ftp.services.openoffice.org/pub/OpenOffice.org/contrib/dictionaries testing shell script: ------------------------------------------------------- for i in `ls *zip | grep '^[a-z]*_[A-Z]*[.]'` do dic=`basename $i .zip` mkdir $dic echo unzip $dic unzip -d $dic $i 2>/dev/null cd $dic echo unmunch and test $dic unmunch $dic.dic $dic.aff 2>/dev/null | awk '{print$0"\t"}' | hunspell -d $dic -l -1 >$dic.result 2>$dic.err || rm -f $dic.result cd .. done -------------------------------------------------------- test result (0 size is o.k.): $ for i in *_*/*.result; do wc -c $i; done 0 af_ZA/af_ZA.result 0 bg_BG/bg_BG.result 0 ca_ES/ca_ES.result 0 cy_GB/cy_GB.result 0 cs_CZ/cs_CZ.result 0 da_DK/da_DK.result 0 de_AT/de_AT.result 0 de_CH/de_CH.result 0 de_DE/de_DE.result 0 el_GR/el_GR.result 6 en_AU/en_AU.result 0 en_CA/en_CA.result 0 en_GB/en_GB.result 0 en_NZ/en_NZ.result 0 en_US/en_US.result 0 eo_EO/eo_EO.result 0 es_ES/es_ES.result 0 es_MX/es_MX.result 0 es_NEW/es_NEW.result 0 fo_FO/fo_FO.result 0 fr_FR/fr_FR.result 0 ga_IE/ga_IE.result 0 gd_GB/gd_GB.result 0 gl_ES/gl_ES.result 0 he_IL/he_IL.result 0 hr_HR/hr_HR.result 200694989 hu_HU/hu_HU.result 0 id_ID/id_ID.result 0 it_IT/it_IT.result 0 ku_TR/ku_TR.result 0 lt_LT/lt_LT.result 0 lv_LV/lv_LV.result 0 mg_MG/mg_MG.result 0 mi_NZ/mi_NZ.result 0 ms_MY/ms_MY.result 0 nb_NO/nb_NO.result 0 nl_NL/nl_NL.result 0 nn_NO/nn_NO.result 0 ny_MW/ny_MW.result 0 pl_PL/pl_PL.result 0 pt_BR/pt_BR.result 0 pt_PT/pt_PT.result 0 ro_RO/ro_RO.result 0 ru_RU/ru_RU.result 0 rw_RW/rw_RW.result 0 sk_SK/sk_SK.result 0 sl_SI/sl_SI.result 0 sv_SE/sv_SE.result 0 sw_KE/sw_KE.result 0 tet_ID/tet_ID.result 0 tl_PH/tl_PH.result 0 tn_ZA/tn_ZA.result 0 uk_UA/uk_UA.result 0 zu_ZA/zu_ZA.result In en_AU dictionary, there is an abbrevation with two dots (`eqn..'), but `eqn.' is missing. Presumably it is a dictionary bug. Myspell also haven't accepted it. Hungarian dictionary contains pseudoroots and forbidden words. Unmunch haven't supported these features yet, and generates bad words, too. * check affix rules and OOo dictionaries. Detected bugs in cs_CZ, es_ES, es_NEW, es_MX, lt_LT, nn_NO, pt_PT, ro_RO, sk_SK and sv_SE dictionaries). Details: -------------------------------------------------------- cs_CZ warning - incompatible stripping characters and condition: SFX D us ech [^ighk]os SFX D us y [^i]os SFX Q os ech [^ghk]es SFX M o ech [^ghkei]a SFX J �m ej �m SFX J �m ejme �m SFX J �m ejte �m SFX A ou�it up oupit SFX A ou�it upme oupit SFX A ou�it upte oupit SFX A nout l [aeiouy��������r][^aeiouy��������rl][^aeiouy SFX A nout l [aeiouy��������r][^aeiouy��������rl][^aeiouy es_ES warning - incompatible stripping characters and condition: SFX W umar �se [ae]husar SFX W emir i��is e�ir es_NEW warning - incompatible stripping characters and condition: SFX I unan �nen unar es_MX warning - incompatible stripping characters and condition: SFX A a ote e SFX W umar �se [ae]husar SFX W emir i��is e�ir lt_LT warning - incompatible stripping characters and condition: SFX U ti siuosi tis SFX U ti siuosi tis SFX U ti siesi tis SFX U ti siesi tis SFX U ti sis tis SFX U ti sis tis SFX U ti sim�s tis SFX U ti sim�s tis SFX U ti sit�s tis SFX U ti sit�s tis nn_NO warning - incompatible stripping characters and condition: SFX D ar rar [^fmk]er SFX U �re orde ere SFX U �re ort ere pt_PT warning - incompatible stripping characters and condition: SFX g �os oas �o SFX g �os oas �o ro_RO warning - bad field number: SFX L 0 le [^cg] i SFX L 0 i [cg] i SFX U 0 i [^i] ii warning - incompatible stripping characters and condition: SFX P l i l [<- there is an unnecessary tabulator here) SFX I a ii [gc] a warning - bad field number: SFX I a ii [gc] a SFX I a ei [^cg] a sk_SK warning - incompatible stripping characters and condition: SFX T �a� ol� kla� SFX T �a� ol�c kla� SFX T s�a� �l� sla� SFX T s�a� �l�c sla� SFX R �c� l�iem �c� SFX R i�s� �tie mias� SFX R iez� iem [^i]ez� SFX R iez� ie� [^i]ez� SFX R iez� ie [^i]ez� SFX R iez� eme [^i]ez� SFX R iez� ete [^i]ez� SFX R iez� � [^i]ez� SFX R iez� �c [^i]ez� SFX R iez� z [^i]ez� SFX R iez� me [^i]ez� SFX R iez� te [^i]ez� sv_SE warning - bad field number: SFX C 0 net nets [^e]n -------------------------------------------------------- 2005-08-01: Hunspell 1.0.8 release - improved compound word support - fix German S handling - port MySpell files and MAP feature 2005-07-22: Hunspell 1.0.7 release 2005-07-21: new home page: http://hunspell.sourceforge.net