6d9b40114c
Originally committed to SVN as r1694.
502 lines
15 KiB
Text
502 lines
15 KiB
Text
2007-11-01: Hunspell 1.2.1 release:
|
|
- new memory efficient condition checking algorithm for affix rules
|
|
|
|
- new morphological functions:
|
|
- stem() for stemming
|
|
- analyze() for morphological analysis
|
|
- generate() for morphological generation
|
|
|
|
- new demos:
|
|
- analyze: stemming, morphological analysis and generation
|
|
- chmorph: morphological conversion of texts
|
|
|
|
2007-09-05: Hunspell 1.1.12 release:
|
|
- dictionary based phonetic suggestion for words with
|
|
special or foreign pronounciation or alternative (bad) transliteration
|
|
(see Changelog, tests/phone.* and manual).
|
|
|
|
- improved data structure and memory optimization for dictionaries
|
|
with variable count fields
|
|
|
|
- bug fixes for Unicode encoding dictionaries and ngram suggestions
|
|
|
|
- improved REP suggestions with space: it works without dictionary
|
|
modification
|
|
|
|
- updated and new project files for Windows API
|
|
|
|
2007-08-27: Hunspell 1.1.11 release:
|
|
- portability fixes
|
|
|
|
2007-08-23: Hunspell 1.1.10 release:
|
|
- pronounciation based suggestion using Björn Jacke's original Aspell
|
|
phonetic transcription algorithm (http://aspell.net), relicensed under
|
|
GPL/LGPL/MPL tri-license with the permission of the author
|
|
|
|
- keyboard base suggestion by KEY (see manual)
|
|
|
|
- better time limits for suggestion search
|
|
|
|
- test environment for suggestion based on Wikipedia data
|
|
|
|
- bug fixes for non standard Mozilla platforms etc.
|
|
|
|
2007-07-25: Hunspell 1.1.9 release:
|
|
- better tokenization:
|
|
- for URLs, mail addresses and directory paths (default: skip these tokens)
|
|
- for colons in words (for Finnish and Swedish)
|
|
|
|
- new examples:
|
|
- affixation of personal dictionary words
|
|
- digits in words
|
|
|
|
- bug fixes (see ChangeLog)
|
|
|
|
2007-07-16: Hunspell 1.1.8 release:
|
|
- better Mac OS X/Cygwin and Windows compatibility
|
|
|
|
- fix Hunspell's Valgrind environment and memory handling errors
|
|
detected by Valgrind
|
|
|
|
- other bug fixes (see ChangeLog)
|
|
|
|
2007-07-06: Hunspell 1.1.7 release:
|
|
- fix warning messages of OpenOffice.org build
|
|
|
|
2007-06-29: Hunspell 1.1.6 release:
|
|
- check capitalization of the following word forms
|
|
- words with mixed capitalisation: OpenOffice.org - OPENOFFICE.ORG
|
|
- allcap words and suffixes: UNICEF's - UNICEF'S
|
|
- prefixes with apostrophe and proper names: Sant'Elia - SANT'ELIA
|
|
|
|
- suggestion for missing sentence spacing: something.The -> something. The
|
|
|
|
- Hunspell executable: improved locale support
|
|
- -i option: custom input encoding
|
|
- use locale data for default dictionary names.
|
|
- tools/hunspell.cxx: fix 8-bit tokenization (letters without
|
|
casing, like ß or Hebrew characters now are handled well)
|
|
- dictionary search path (automatic detection of OpenOffice.org directories)
|
|
- DICPATH environmental variable
|
|
- -D option: show directory path of loaded dictionary
|
|
|
|
- patches and bug fixes for Mozilla, OpenOffice.org.
|
|
|
|
2007-03-19: Hunspell 1.1.5 release:
|
|
- optimizations: 10-100% speed up, smaller code size and memory footprint
|
|
(conditional experimental code and warning messages)
|
|
|
|
- extended Unicode support:
|
|
- non BMP Unicode characters in dictionary words and affixes (except
|
|
affix rules and conditions)
|
|
- support BOM sequence in aff and dic files
|
|
|
|
- IGNORE feature for Arabic diacritics and other optional characters
|
|
|
|
- New edit distance suggestion methods:
|
|
- capitalisation: nasa -> NASA
|
|
- long swap: permenant -> permanent
|
|
- long move: Ghandi -> Gandhi, greatful -> grateful
|
|
- double two characters: vacacation -> vacation
|
|
- spaces in REP sug.: REP alot a_lot (NOTE: "a lot" must be a dictionary word)
|
|
|
|
- patches and bug fixes for Mozilla, OpenOffice.org, Emacs, MinGW, Aqua,
|
|
German and Arabic language, etc.
|
|
|
|
2006-02-01: Hunspell 1.1.4 release:
|
|
- Improved suggestion for typical OCR bugs (missing spaces between
|
|
capitalized words). For example: "aNew" -> "a New".
|
|
http://qa.openoffice.org/issues/show_bug.cgi?id=58202
|
|
|
|
- tokenization fixes (fix incomplete tokenization of input texts on big-endian
|
|
platforms, and locale-dependent tokenization of dictionary entries)
|
|
|
|
2006-01-06: Hunspell 1.1.3.2 release:
|
|
- fix Visual C++ compiling errors
|
|
|
|
2006-01-05: Hunspell 1.1.3 release:
|
|
- GPL/LGPL/MPL tri-license for Mozilla integration
|
|
|
|
- Alias compression of flag sets and morphological descriptions.
|
|
(For example, 16 MB Arabic dic file can be compressed to 1 MB.)
|
|
|
|
- Improved suggestion.
|
|
|
|
- Improved, language independent German sharp s casing with CHECKSHARPS
|
|
declaration.
|
|
|
|
- Unicode tokenization in Hunspell program.
|
|
|
|
- Bug fixes (at new and old compound word handling methods), etc.
|
|
|
|
2005-11-11: Hunspell 1.1.2 release:
|
|
|
|
- Bug fixes (MAP Unicode, COMPOUND pattern matching, ONLYINCOMPOUND
|
|
suggestions)
|
|
|
|
- Checked with 51 regression tests in Valgrind debugging environment,
|
|
and tested with 52 OOo dictionaries on i686-pc-linux platform.
|
|
|
|
2005-11-09: Hunspell 1.1.1 release:
|
|
|
|
- Compound word patterns for complex compound word handling and
|
|
simple word-level lexical scanning. Ideal for checking
|
|
Arabic and Roman numbers, ordinal numbers in English, affixed
|
|
numbers in agglutinative languages, etc.
|
|
http://qa.openoffice.org/issues/show_bug.cgi?id=53643
|
|
|
|
- Support ISO-8859-15 encoding for French (French oe ligatures are
|
|
missing from the latin-1 encoding).
|
|
http://qa.openoffice.org/issues/show_bug.cgi?id=54980
|
|
|
|
- Implemented a flag to forbid obscene word suggestion:
|
|
http://qa.openoffice.org/issues/show_bug.cgi?id=55498
|
|
|
|
- Checked with 50 regression tests in Valgrind debugging environment,
|
|
and tested with 52 OOo dictionaries.
|
|
|
|
- other improvements and bug fixes (see ChangeLog)
|
|
|
|
2005-09-19: Hunspell 1.1.0 release
|
|
|
|
* complete comparison with MySpell 3.2 (from OpenOffice.org 2 beta)
|
|
|
|
* improved ngram suggestion with swap character detection and
|
|
case insensitivity
|
|
|
|
------ examples for ngram improvement (input word and suggestions) -----
|
|
|
|
1. pernament (instead of permanent)
|
|
|
|
MySpell 3.2: tournaments, tournament, ornaments, ornament's, ornamenting, ornamented,
|
|
ornament, ornamentals, ornamental, ornamentally
|
|
|
|
Hunspell 1.0.9: ornamental, ornament, tournament
|
|
|
|
Hunspell 1.1.0: permanent
|
|
|
|
Note: swap character detection
|
|
|
|
|
|
2. PERNAMENT (instead of PERMANENT)
|
|
|
|
MySpell 3.2: -
|
|
|
|
Hunspell 1.0.9: -
|
|
|
|
Hunspell 1.1.0: PERMANENT
|
|
|
|
|
|
3. Unesco (instead of UNESCO)
|
|
|
|
MySpell 3.2: Genesco, Ionesco, Genesco's, Ionesco's, Frescoing, Fresco's,
|
|
Frescoed, Fresco, Escorts, Escorting
|
|
|
|
Hunspell 1.0.9: Genesco, Ionesco, Fresco
|
|
|
|
Hunspell 1.1.0: UNESCO
|
|
|
|
|
|
4. siggraph's (instead of SIGGRAPH's)
|
|
|
|
MySpell 3.2: serigraph's, photograph's, serigraphs, physiography's,
|
|
physiography, digraphs, serigraph, stratigraphy's, stratigraphy
|
|
epigraphs
|
|
|
|
Hunspell 1.0.9: serigraph's, epigraph's, digraph's
|
|
|
|
Hunspell 1.1.0: SIGGRAPH's
|
|
|
|
--------------- end of examples --------------------
|
|
|
|
* improved testing environment with suggestion checking and memory debugging
|
|
|
|
memory debugging of all tests with a simple command:
|
|
|
|
VALGRIND=memcheck make check
|
|
|
|
* lots of other improvements and bug fixes (see ChangeLog)
|
|
|
|
|
|
2005-08-26: Hunspell 1.0.9 release
|
|
|
|
* improved related character map suggestion
|
|
|
|
* improved ngram suggestion
|
|
|
|
------ examples for ngram improvement (O=old, N = new ngram suggestions) --
|
|
|
|
1. Permenant (instead of Permanent)
|
|
|
|
O: Endangerment, Ferment, Fermented, Deferment's, Empowerment,
|
|
Ferment's, Ferments, Fermenting, Countermen, Weathermen
|
|
|
|
N: Permanent, Supermen, Preferment
|
|
|
|
Note: Ngram suggestions was case sensitive.
|
|
|
|
2. permenant (instead of permanent)
|
|
|
|
O: supermen, newspapermen, empowerment, endangerment, preferments,
|
|
preferment, permanent, preferment's, permanently, impermanent
|
|
|
|
N: permanent, supermen, preferment
|
|
|
|
Note: new suggestions are also weighted with longest common subsequence,
|
|
first letter and common character positions
|
|
|
|
3. pernemant (instead of permanent)
|
|
|
|
O: pimpernel's, pimpernel, pimpernels, permanently, permanents, permanent,
|
|
supernatant, impermanent, semipermanent, impermanently
|
|
|
|
N: permanent, supernatant, pimpernel
|
|
|
|
Note: new method also prefers root word instead of not
|
|
relevant affixes ('s, s and ly)
|
|
|
|
|
|
4. pernament (instead of permanent)
|
|
|
|
O: tournaments, tournament, ornaments, ornament's, ornamenting, ornamented,
|
|
ornament, ornamentals, ornamental, ornamentally
|
|
|
|
N: ornamental, ornament, tournament
|
|
|
|
Note: Both ngram methods misses here.
|
|
|
|
|
|
5. obvus (instad of obvious):
|
|
|
|
O: obvious, Corvus, obverse, obviously, Jacobus, obtuser, obtuse,
|
|
obviates, obviate, Travus
|
|
|
|
N: obvious, obtuse, obverse
|
|
|
|
Note: new method also prefers common first letters.
|
|
|
|
|
|
6. unambigus (instead of unambiguous)
|
|
|
|
O: unambiguous, unambiguity, unambiguously, ambiguously, ambiguous,
|
|
unambitious, ambiguities, ambiguousness
|
|
|
|
N: unambiguous, unambiguity, unambitious
|
|
|
|
|
|
|
|
7. consecvence (instead of consequence)
|
|
|
|
O: consecutive, consecutively, consecutiveness, nonconsecutive, consequence,
|
|
consecutiveness's, convenience's, consistences, consistence
|
|
|
|
N: consequence, consecutive, consecrates
|
|
|
|
|
|
An example in a language with rich morphology:
|
|
|
|
8. Misisipiben (instead of Mississippiben [`in Mississippi' in Hungarian]):
|
|
|
|
O: Misikédéiben, Pisisedéiben, Misikéiéiben, Pisisekéiben, Misikéiben,
|
|
Misikéidéiben, Misikékéiben, Misikéikéiben, Misikéiméiben, Mississippiiben
|
|
|
|
N: Mississippiben, Mississippiiben, Misiiben
|
|
|
|
Note: Suggesting not relevant affixes was the biggest fault in ngram
|
|
suggestion for languages with a lot of affixes.
|
|
|
|
--------------- end of examples --------------------
|
|
|
|
* support twofold prefix cutting
|
|
|
|
* lots of other improvements and bug fixes (see ChangeLog)
|
|
|
|
* test Hunspell with 54 OpenOffice.org dictionaries:
|
|
|
|
source: ftp://ftp.services.openoffice.org/pub/OpenOffice.org/contrib/dictionaries
|
|
|
|
testing shell script:
|
|
-------------------------------------------------------
|
|
for i in `ls *zip | grep '^[a-z]*_[A-Z]*[.]'`
|
|
do
|
|
dic=`basename $i .zip`
|
|
mkdir $dic
|
|
echo unzip $dic
|
|
unzip -d $dic $i 2>/dev/null
|
|
cd $dic
|
|
echo unmunch and test $dic
|
|
unmunch $dic.dic $dic.aff 2>/dev/null | awk '{print$0"\t"}' |
|
|
hunspell -d $dic -l -1 >$dic.result 2>$dic.err || rm -f $dic.result
|
|
cd ..
|
|
done
|
|
--------------------------------------------------------
|
|
|
|
test result (0 size is o.k.):
|
|
|
|
$ for i in *_*/*.result; do wc -c $i; done
|
|
0 af_ZA/af_ZA.result
|
|
0 bg_BG/bg_BG.result
|
|
0 ca_ES/ca_ES.result
|
|
0 cy_GB/cy_GB.result
|
|
0 cs_CZ/cs_CZ.result
|
|
0 da_DK/da_DK.result
|
|
0 de_AT/de_AT.result
|
|
0 de_CH/de_CH.result
|
|
0 de_DE/de_DE.result
|
|
0 el_GR/el_GR.result
|
|
6 en_AU/en_AU.result
|
|
0 en_CA/en_CA.result
|
|
0 en_GB/en_GB.result
|
|
0 en_NZ/en_NZ.result
|
|
0 en_US/en_US.result
|
|
0 eo_EO/eo_EO.result
|
|
0 es_ES/es_ES.result
|
|
0 es_MX/es_MX.result
|
|
0 es_NEW/es_NEW.result
|
|
0 fo_FO/fo_FO.result
|
|
0 fr_FR/fr_FR.result
|
|
0 ga_IE/ga_IE.result
|
|
0 gd_GB/gd_GB.result
|
|
0 gl_ES/gl_ES.result
|
|
0 he_IL/he_IL.result
|
|
0 hr_HR/hr_HR.result
|
|
200694989 hu_HU/hu_HU.result
|
|
0 id_ID/id_ID.result
|
|
0 it_IT/it_IT.result
|
|
0 ku_TR/ku_TR.result
|
|
0 lt_LT/lt_LT.result
|
|
0 lv_LV/lv_LV.result
|
|
0 mg_MG/mg_MG.result
|
|
0 mi_NZ/mi_NZ.result
|
|
0 ms_MY/ms_MY.result
|
|
0 nb_NO/nb_NO.result
|
|
0 nl_NL/nl_NL.result
|
|
0 nn_NO/nn_NO.result
|
|
0 ny_MW/ny_MW.result
|
|
0 pl_PL/pl_PL.result
|
|
0 pt_BR/pt_BR.result
|
|
0 pt_PT/pt_PT.result
|
|
0 ro_RO/ro_RO.result
|
|
0 ru_RU/ru_RU.result
|
|
0 rw_RW/rw_RW.result
|
|
0 sk_SK/sk_SK.result
|
|
0 sl_SI/sl_SI.result
|
|
0 sv_SE/sv_SE.result
|
|
0 sw_KE/sw_KE.result
|
|
0 tet_ID/tet_ID.result
|
|
0 tl_PH/tl_PH.result
|
|
0 tn_ZA/tn_ZA.result
|
|
0 uk_UA/uk_UA.result
|
|
0 zu_ZA/zu_ZA.result
|
|
|
|
In en_AU dictionary, there is an abbrevation with two dots (`eqn..'), but
|
|
`eqn.' is missing. Presumably it is a dictionary bug. Myspell also
|
|
haven't accepted it.
|
|
|
|
Hungarian dictionary contains pseudoroots and forbidden words.
|
|
Unmunch haven't supported these features yet, and generates bad words, too.
|
|
|
|
* check affix rules and OOo dictionaries. Detected bugs in cs_CZ,
|
|
es_ES, es_NEW, es_MX, lt_LT, nn_NO, pt_PT, ro_RO, sk_SK and sv_SE dictionaries).
|
|
|
|
Details:
|
|
--------------------------------------------------------
|
|
cs_CZ
|
|
warning - incompatible stripping characters and condition:
|
|
SFX D us ech [^ighk]os
|
|
SFX D us y [^i]os
|
|
SFX Q os ech [^ghk]es
|
|
SFX M o ech [^ghkei]a
|
|
SFX J ém ej ám
|
|
SFX J ém ejme ám
|
|
SFX J ém ejte ám
|
|
SFX A ou¾it up oupit
|
|
SFX A ou¾it upme oupit
|
|
SFX A ou¾it upte oupit
|
|
SFX A nout l [aeiouyáéíóúýùìr][^aeiouyáéíóúýùìrl][^aeiouy
|
|
SFX A nout l [aeiouyáéíóúýùìr][^aeiouyáéíóúýùìrl][^aeiouy
|
|
|
|
es_ES
|
|
warning - incompatible stripping characters and condition:
|
|
SFX W umar úse [ae]husar
|
|
SFX W emir iñáis eñir
|
|
|
|
es_NEW
|
|
warning - incompatible stripping characters and condition:
|
|
SFX I unan únen unar
|
|
|
|
es_MX
|
|
warning - incompatible stripping characters and condition:
|
|
SFX A a ote e
|
|
SFX W umar úse [ae]husar
|
|
SFX W emir iñáis eñir
|
|
|
|
lt_LT
|
|
warning - incompatible stripping characters and condition:
|
|
SFX U ti siuosi tis
|
|
SFX U ti siuosi tis
|
|
SFX U ti siesi tis
|
|
SFX U ti siesi tis
|
|
SFX U ti sis tis
|
|
SFX U ti sis tis
|
|
SFX U ti simës tis
|
|
SFX U ti simës tis
|
|
SFX U ti sitës tis
|
|
SFX U ti sitës tis
|
|
|
|
nn_NO
|
|
warning - incompatible stripping characters and condition:
|
|
SFX D ar rar [^fmk]er
|
|
SFX U Øre orde ere
|
|
SFX U Øre ort ere
|
|
|
|
pt_PT
|
|
warning - incompatible stripping characters and condition:
|
|
SFX g ãos oas ão
|
|
SFX g ãos oas ão
|
|
|
|
ro_RO
|
|
warning - bad field number:
|
|
SFX L 0 le [^cg] i
|
|
SFX L 0 i [cg] i
|
|
SFX U 0 i [^i] ii
|
|
warning - incompatible stripping characters and condition:
|
|
SFX P l i l [<- there is an unnecessary tabulator here)
|
|
SFX I a ii [gc] a
|
|
warning - bad field number:
|
|
SFX I a ii [gc] a
|
|
SFX I a ei [^cg] a
|
|
|
|
sk_SK
|
|
warning - incompatible stripping characters and condition:
|
|
SFX T µa» olú kla»
|
|
SFX T µa» olúc kla»
|
|
SFX T sµa» ¹lú sla»
|
|
SFX T sµa» ¹lúc sla»
|
|
SFX R µc» lèiem åc»
|
|
SFX R iás» ätie mias»
|
|
SFX R iez» iem [^i]ez»
|
|
SFX R iez» ie¹ [^i]ez»
|
|
SFX R iez» ie [^i]ez»
|
|
SFX R iez» eme [^i]ez»
|
|
SFX R iez» ete [^i]ez»
|
|
SFX R iez» ú [^i]ez»
|
|
SFX R iez» úc [^i]ez»
|
|
SFX R iez» z [^i]ez»
|
|
SFX R iez» me [^i]ez»
|
|
SFX R iez» te [^i]ez»
|
|
|
|
sv_SE
|
|
warning - bad field number:
|
|
SFX C 0 net nets [^e]n
|
|
--------------------------------------------------------
|
|
|
|
2005-08-01: Hunspell 1.0.8 release
|
|
|
|
- improved compound word support
|
|
- fix German S handling
|
|
- port MySpell files and MAP feature
|
|
|
|
2005-07-22: Hunspell 1.0.7 release
|
|
|
|
2005-07-21: new home page: http://hunspell.sourceforge.net
|