Aegisub/contrib/hunspell/NEWS

2011-02-02: Hunspell 1.3.2 release:
  - fix library versioning
  - improved manual 

2011-02-02: Hunspell 1.3.1 release:
  - bug fixes

2011-01-26: Hunspell 1.2.15/1.3 release:
  - new features: MAXDIFF, ONLYMAXDIFF, MAXCPDSUGS, FORBIDWARN, see manual
  - bug fixes

2011-01-21:
  - new features: FORCEUCASE and WARN, see manual
  - new options: -r to filter potential mistakes (rare words
    signed by flag WARN in the dictionary)
  - limited and optimized suggestions

2011-01-06: Hunspell 1.2.14 release:
  - bug fix
2011-01-03: Hunspell 1.2.13 release:
  - bug fixes
  - improved compound handling and
    other improvements supported by OpenTaal Foundation, Netherlands
2010-07-15: Hunspell 1.2.12 release
2010-05-06: Hunspell 1.2.11 release:
  - Maintenance release bug fixes
2010-04-30: Hunspell 1.2.10 release:
  - Maintenance release bug fixes
2010-03-03: Hunspell 1.2.9 release:
  - Maintenance release bug fixes and warnings
  - MAP support for composed characters or character sequences
2008-11-01: Hunspell 1.2.8 release:
  - Default BREAK feature and better hyphenated word suggestion to accept
    and fix (compound) words with hyphen characters by spell checker
    instead of by work breaking code of OpenOffice.org. With this feature
    it's possible to accept hyphenated compound words, such as "scot-free",
    where "scot" is not a correct English word.

  - ICONV & OCONV: input and output conversion tables for optional character
    handling or using special inner format. Example:

  # Accepting de facto replacements of the Romanian comma acuted letters
  SET UTF-8
  ICONV 4
  ICONV ş ș
  ICONV ţ ț
  ICONV Ş Ș
  ICONV Ţ Ț

    Typical usage of ICONV/OCONV is to manage an inner format for a segmental
    writing system, like the Ethiopic script of the Amharic language.

  - Extended CHECKCOMPOUNDPATTERN to handle conpound word alternations, like
    sandhi feature of Telugu and other writing systems.

  - SIMPLIFIEDTRIPLE compound word feature: allow simplified Swedish and
    Norwegian compound word forms, like tillåta (till|låta) and
    bussjåfør (buss|sjåfør)

  - wordforms: word generator script for dictionary developers (Hunspell
    version of unmunch).

  - bug fixes

2008-08-15: Hunspell 1.2.7 release:
  - FULLSTRIP: new option for affix handling. With FULLSTRIP, affix rules can
    strip full words, not only one less characters.
  - COMPOUNDRULE works with all flag types. (COMPOUNDRULE is for pattern
    matching. For example, en_US dictionary of OpenOffice.org uses COMPOUNDRULE
    for ordinal number recognition: 1st, 2nd, 11th, 12th, 22nd, 112th, 1000122nd
    etc.).
  - optimized suggestions:
    - modified 1-character distance suggestion algorithms: search a TRY character
      in all position instead of all TRY characters in a character position
      (it can give more readable suggestion order, also better suggestions
      in the first positions, when TRY characters are sorted by frequency.)
      For example, suggestions for "moze":
      ooze, doze, Roze, maze, more etc. (Hunspell 1.2.6),
      maze, more, mote, ooze, mole etc. (Hunspell 1.2.7).
    - extended compound word checking for better COMPOUNDRULE related
      suggestions, for example English ordinal numbers: 121323th -> 121323rd
      (it needs also a th->rd REP definition).
  - bug fixes

2008-07-15: Hunspell 1.2.6 release:
  - bug fix release (fix affix rule condition checking of sk_SK dictionary,
    iconv support in stemming and morphological analysis of the Hunspell
    utility, see also Changelog)

2008-07-09: Hunspell 1.2.5 release:
  - bug fix release (fix affix rule condition checking of en_GB dictionary,
    also morphological analysis by dictionaries with two-level suffixes)

2008-06-18: Hunspell 1.2.4-2 release:
  - fix GCC compiler warnings

2008-06-17: Hunspell 1.2.4 release:
  - add free_list() for C, C++ interfaces to deallocate suggestion lists
  
  - bug fixes

2008-06-17: Hunspell 1.2.3 release:
  - extended XML interface to use morphological functions by standard
    spell checking interface, spell() and suggest(). See hunspell.3 manual page.

  - default dash suggestions for compound words: newword-> new word and new-word

  - new manual pages: hunspell.3, hzip.1, hunzip.1.
  
  - bug fixes

2008-04-12: Hunspell 1.2.2 release:
  - extended dictionary (dic file) support to use multiple base and
    special dictionaries.
    
  - new and improved options of command line hunspell:
    -m: morphological analysis or flag debug mode (without affix
        rule data it signs the flag of the affix rules)
    -s: stemming mode
    -D: list available dictionaries and search path
    -d: support extra dictionaries by comma separated list. Example:
    
    hunspell -d en_US,en_med,de_DE,de_med,de_geo UNESCO.txt

    - forbidding in personal dictionary (with asterisk, / signs affixation)

  - optional compressed dictionary format "hzip" for aff and dic files
    usage:
    hzip example.aff example.dic
    mv example.aff example.dic /tmp
    hunspell -d example
    hunzip example.aff.hz >example.aff
    hunzip example.dic.hz >example.dic

  - new affix compression tool "affixcompress": compression tool for
    large (millions of words) dictionaries.

  - support encrypted dictionaries for closed OpenOffice.org extensions or
    other commercial programs

  - improved manual

  - bug fixes

2007-11-01: Hunspell 1.2.1 release:
  - new memory efficient condition checking algorithm for affix rules
  
  - new morphological functions:
    - stem() for stemming
    - analyze() for morphological analysis
    - generate() for morphological generation

  - new demos:
    - analyze: stemming, morphological analysis and generation
    - chmorph: morphological conversion of texts

2007-09-05: Hunspell 1.1.12 release:
  - dictionary based phonetic suggestion for words with
    special or foreign pronounciation or alternative (bad) transliteration
    (see Changelog, tests/phone.* and manual).

  - improved data structure and memory optimization for dictionaries
    with variable count fields

  - bug fixes for Unicode encoding dictionaries and ngram suggestions
  
  - improved REP suggestions with space: it works without dictionary
    modification

  - updated and new project files for Windows API

2007-08-27: Hunspell 1.1.11 release:
  - portability fixes

2007-08-23: Hunspell 1.1.10 release:
  - pronounciation based suggestion using Bj<42>rn Jacke's original Aspell
    phonetic transcription algorithm (http://aspell.net), relicensed under
    GPL/LGPL/MPL tri-license with the permission of the author

  - keyboard base suggestion by KEY (see manual)

  - better time limits for suggestion search

  - test environment for suggestion based on Wikipedia data

  - bug fixes for non standard Mozilla platforms etc.

2007-07-25: Hunspell 1.1.9 release:
  - better tokenization:
    - for URLs, mail addresses and directory paths (default: skip these tokens)
    - for colons in words (for Finnish and Swedish)
  
  - new examples:
    - affixation of personal dictionary words
    - digits in words

  - bug fixes (see ChangeLog)

2007-07-16: Hunspell 1.1.8 release:
  - better Mac OS X/Cygwin and Windows compatibility

  - fix Hunspell's Valgrind environment and memory handling errors
    detected by Valgrind

  - other bug fixes (see ChangeLog)

2007-07-06: Hunspell 1.1.7 release:
  - fix warning messages of OpenOffice.org build

2007-06-29: Hunspell 1.1.6 release:
  - check capitalization of the following word forms
    - words with mixed capitalisation: OpenOffice.org - OPENOFFICE.ORG
    - allcap words and suffixes: UNICEF's - UNICEF'S
    - prefixes with apostrophe and proper names: Sant'Elia - SANT'ELIA

  - suggestion for missing sentence spacing: something.The -> something. The

  - Hunspell executable: improved locale support
    - -i option: custom input encoding
    - use locale data for default dictionary names. 
    - tools/hunspell.cxx: fix 8-bit tokenization (letters without
      casing, like ß or Hebrew characters now are handled well) 
    - dictionary search path (automatic detection of OpenOffice.org directories)
    - DICPATH environmental variable
    - -D option: show directory path of loaded dictionary

  - patches and bug fixes for Mozilla, OpenOffice.org.

2007-03-19: Hunspell 1.1.5 release:
  - optimizations: 10-100% speed up, smaller code size and memory footprint
    (conditional experimental code and warning messages)

  - extended Unicode support:
    - non BMP Unicode characters in dictionary words and affixes (except
      affix rules and conditions)
    - support BOM sequence in aff and dic files

  - IGNORE feature for Arabic diacritics and other optional characters

  - New edit distance suggestion methods:
    - capitalisation: nasa -> NASA
    - long swap: permenant -> permanent
    - long move: Ghandi -> Gandhi, greatful -> grateful
    - double two characters: vacacation -> vacation
    - spaces in REP sug.: REP alot a_lot (NOTE: "a lot" must be a dictionary word)

  - patches and bug fixes for Mozilla, OpenOffice.org, Emacs, MinGW, Aqua,
    German and Arabic language, etc.

2006-02-01: Hunspell 1.1.4 release:
  - Improved suggestion for typical OCR bugs (missing spaces between
    capitalized words). For example: "aNew" -> "a New".
    http://qa.openoffice.org/issues/show_bug.cgi?id=58202

  - tokenization fixes (fix incomplete tokenization of input texts on big-endian
    platforms, and locale-dependent tokenization of dictionary entries)

2006-01-06: Hunspell 1.1.3.2 release:
  - fix Visual C++ compiling errors

2006-01-05: Hunspell 1.1.3 release:
  - GPL/LGPL/MPL tri-license for Mozilla integration
  
  - Alias compression of flag sets and morphological descriptions.
    (For example, 16 MB Arabic dic file can be compressed to 1 MB.)
  
  - Improved suggestion.
  
  - Improved, language independent German sharp s casing with CHECKSHARPS
    declaration.

  - Unicode tokenization in Hunspell program.
  
  - Bug fixes (at new and old compound word handling methods), etc.

2005-11-11: Hunspell 1.1.2 release:

  - Bug fixes (MAP Unicode, COMPOUND pattern matching, ONLYINCOMPOUND
    suggestions)

  - Checked with 51 regression tests in Valgrind debugging environment,
    and tested with 52 OOo dictionaries on i686-pc-linux platform.

2005-11-09: Hunspell 1.1.1 release:

  - Compound word patterns for complex compound word handling and
    simple word-level lexical scanning. Ideal for checking
    Arabic and Roman numbers, ordinal numbers in English, affixed
    numbers in agglutinative languages, etc.
    http://qa.openoffice.org/issues/show_bug.cgi?id=53643

  - Support ISO-8859-15 encoding for French (French oe ligatures are
    missing from the latin-1 encoding).
    http://qa.openoffice.org/issues/show_bug.cgi?id=54980
    
  - Implemented a flag to forbid obscene word suggestion:
    http://qa.openoffice.org/issues/show_bug.cgi?id=55498

  - Checked with 50 regression tests in Valgrind debugging environment,
    and tested with 52 OOo dictionaries.

  - other improvements and bug fixes (see ChangeLog)

2005-09-19: Hunspell 1.1.0 release

* complete comparison with MySpell 3.2 (from OpenOffice.org 2 beta)

* improved ngram suggestion with swap character detection and
  case insensitivity

------ examples for ngram improvement (input word and suggestions) -----

1. pernament (instead of permanent)

MySpell 3.2: tournaments, tournament, ornaments, ornament's, ornamenting, ornamented,
        ornament, ornamentals, ornamental, ornamentally

Hunspell 1.0.9: ornamental, ornament, tournament

Hunspell 1.1.0: permanent

Note: swap character detection


2. PERNAMENT (instead of PERMANENT)

MySpell 3.2: -

Hunspell 1.0.9: -

Hunspell 1.1.0: PERMANENT


3. Unesco (instead of UNESCO)

MySpell 3.2: Genesco, Ionesco, Genesco's, Ionesco's, Frescoing, Fresco's,
             Frescoed, Fresco, Escorts, Escorting

Hunspell 1.0.9: Genesco, Ionesco, Fresco

Hunspell 1.1.0: UNESCO


4. siggraph's (instead of SIGGRAPH's)

MySpell 3.2: serigraph's, photograph's, serigraphs, physiography's,
             physiography, digraphs, serigraph, stratigraphy's, stratigraphy
             epigraphs

Hunspell 1.0.9: serigraph's, epigraph's, digraph's

Hunspell 1.1.0: SIGGRAPH's

--------------- end of examples --------------------

* improved testing environment with suggestion checking and memory debugging

  memory debugging of all tests with a simple command:
  
  VALGRIND=memcheck make check

* lots of other improvements and bug fixes (see ChangeLog)


2005-08-26: Hunspell 1.0.9 release

* improved related character map suggestion

* improved ngram suggestion

------ examples for ngram improvement (O=old, N = new ngram suggestions) --

1. Permenant (instead of Permanent)

O: Endangerment, Ferment, Fermented, Deferment's, Empowerment,
        Ferment's, Ferments, Fermenting, Countermen, Weathermen

N: Permanent, Supermen, Preferment

Note: Ngram suggestions was case sensitive.

2. permenant (instead of permanent) 

O: supermen, newspapermen, empowerment, endangerment, preferments,
        preferment, permanent, preferment's, permanently, impermanent

N: permanent, supermen, preferment

Note: new suggestions are also weighted with longest common subsequence,
first letter and common character positions

3. pernemant (instead of permanent) 

O: pimpernel's, pimpernel, pimpernels, permanently, permanents, permanent,
        supernatant, impermanent, semipermanent, impermanently

N: permanent, supernatant, pimpernel

Note: new method also prefers root word instead of not
relevant affixes ('s, s and ly)


4. pernament (instead of permanent)

O: tournaments, tournament, ornaments, ornament's, ornamenting, ornamented,
        ornament, ornamentals, ornamental, ornamentally

N: ornamental, ornament, tournament

Note: Both ngram methods misses here.


5. obvus (instad of obvious):

O: obvious, Corvus, obverse, obviously, Jacobus, obtuser, obtuse,
        obviates, obviate, Travus

N: obvious, obtuse, obverse

Note: new method also prefers common first letters.


6. unambigus (instead of unambiguous) 

O: unambiguous, unambiguity, unambiguously, ambiguously, ambiguous,
        unambitious, ambiguities, ambiguousness

N: unambiguous, unambiguity, unambitious


7. consecvence (instead of consequence)

O: consecutive, consecutively, consecutiveness, nonconsecutive, consequence,
        consecutiveness's, convenience's, consistences, consistence

N: consequence, consecutive, consecrates


An example in a language with rich morphology:

8. Misisipiben (instead of Mississippiben [`in Mississippi' in Hungarian]):

O: Misik<69>d<EFBFBD>iben, Pisised<65>iben, Misik<69>i<EFBFBD>iben, Pisisek<65>iben, Misik<69>iben,
        Misik<69>id<69>iben, Misik<69>k<EFBFBD>iben, Misik<69>ik<69>iben, Misik<69>im<69>iben, Mississippiiben

N: Mississippiben, Mississippiiben, Misiiben

Note: Suggesting not relevant affixes was the biggest fault in ngram
   suggestion for languages with a lot of affixes.

--------------- end of examples --------------------

* support twofold prefix cutting

* lots of other improvements and bug fixes (see ChangeLog)

* test Hunspell with 54 OpenOffice.org dictionaries:

source: ftp://ftp.services.openoffice.org/pub/OpenOffice.org/contrib/dictionaries

testing shell script:
-------------------------------------------------------
for i in `ls *zip | grep '^[a-z]*_[A-Z]*[.]'`
do
	dic=`basename $i .zip`
	mkdir $dic
	echo unzip $dic
	unzip -d $dic $i 2>/dev/null
	cd $dic
	echo unmunch and test $dic
	unmunch $dic.dic $dic.aff 2>/dev/null | awk '{print$0"\t"}' |
	hunspell -d $dic -l -1 >$dic.result 2>$dic.err || rm -f $dic.result
	cd ..
done
--------------------------------------------------------

test result (0 size is o.k.):

$ for i in *_*/*.result; do wc -c $i; done 
0 af_ZA/af_ZA.result
0 bg_BG/bg_BG.result
0 ca_ES/ca_ES.result
0 cy_GB/cy_GB.result
0 cs_CZ/cs_CZ.result
0 da_DK/da_DK.result
0 de_AT/de_AT.result
0 de_CH/de_CH.result
0 de_DE/de_DE.result
0 el_GR/el_GR.result
6 en_AU/en_AU.result
0 en_CA/en_CA.result
0 en_GB/en_GB.result
0 en_NZ/en_NZ.result
0 en_US/en_US.result
0 eo_EO/eo_EO.result
0 es_ES/es_ES.result
0 es_MX/es_MX.result
0 es_NEW/es_NEW.result
0 fo_FO/fo_FO.result
0 fr_FR/fr_FR.result
0 ga_IE/ga_IE.result
0 gd_GB/gd_GB.result
0 gl_ES/gl_ES.result
0 he_IL/he_IL.result
0 hr_HR/hr_HR.result
200694989 hu_HU/hu_HU.result
0 id_ID/id_ID.result
0 it_IT/it_IT.result
0 ku_TR/ku_TR.result
0 lt_LT/lt_LT.result
0 lv_LV/lv_LV.result
0 mg_MG/mg_MG.result
0 mi_NZ/mi_NZ.result
0 ms_MY/ms_MY.result
0 nb_NO/nb_NO.result
0 nl_NL/nl_NL.result
0 nn_NO/nn_NO.result
0 ny_MW/ny_MW.result
0 pl_PL/pl_PL.result
0 pt_BR/pt_BR.result
0 pt_PT/pt_PT.result
0 ro_RO/ro_RO.result
0 ru_RU/ru_RU.result
0 rw_RW/rw_RW.result
0 sk_SK/sk_SK.result
0 sl_SI/sl_SI.result
0 sv_SE/sv_SE.result
0 sw_KE/sw_KE.result
0 tet_ID/tet_ID.result
0 tl_PH/tl_PH.result
0 tn_ZA/tn_ZA.result
0 uk_UA/uk_UA.result
0 zu_ZA/zu_ZA.result

In en_AU dictionary, there is an abbrevation with two dots (`eqn..'), but
`eqn.' is missing. Presumably it is a dictionary bug. Myspell also
haven't accepted it.

Hungarian dictionary contains pseudoroots and forbidden words.
Unmunch haven't supported these features yet, and generates bad words, too.

* check affix rules and OOo dictionaries. Detected bugs in cs_CZ,
es_ES, es_NEW, es_MX, lt_LT, nn_NO, pt_PT, ro_RO, sk_SK and sv_SE dictionaries).

Details:
--------------------------------------------------------
cs_CZ
warning - incompatible stripping characters and condition:
SFX D   us          ech        [^ighk]os
SFX D   us          y          [^i]os
SFX Q   os          ech        [^ghk]es
SFX M   o           ech        [^ghkei]a
SFX J   <20>m          ej         <20>m
SFX J   <20>m          ejme       <20>m
SFX J   <20>m          ejte       <20>m
SFX A   ou<6F>it       up         oupit
SFX A   ou<6F>it       upme       oupit
SFX A   ou<6F>it       upte       oupit
SFX A   nout        l          [aeiouy<75><79><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>r][^aeiouy<75><79><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>rl][^aeiouy
SFX A   nout        l          [aeiouy<75><79><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>r][^aeiouy<75><79><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>rl][^aeiouy

es_ES
warning - incompatible stripping characters and condition:
SFX W umar <20>se [ae]husar
SFX W emir i<><69>is e<>ir

es_NEW
warning - incompatible stripping characters and condition:
SFX I unan <20>nen unar

es_MX
warning - incompatible stripping characters and condition:
SFX A a ote e
SFX W umar <20>se [ae]husar
SFX W emir i<><69>is e<>ir

lt_LT
warning - incompatible stripping characters and condition:
SFX U ti      siuosi          tis       
SFX U ti      siuosi          tis       
SFX U ti      siesi           tis       
SFX U ti      siesi           tis       
SFX U ti      sis             tis       
SFX U ti      sis             tis       
SFX U ti      sim<69>s           tis       
SFX U ti      sim<69>s           tis       
SFX U ti      sit<69>s           tis       
SFX U ti      sit<69>s           tis       

nn_NO
warning - incompatible stripping characters and condition:
SFX D   ar  rar  [^fmk]er
SFX U   <20>re  orde  ere
SFX U   <20>re  ort  ere

pt_PT
warning - incompatible stripping characters and condition:
SFX g   <20>os        oas        <20>o
SFX g   <20>os        oas        <20>o

ro_RO
warning - bad field number:
SFX L   0          le         [^cg] i
SFX L   0          i          [cg] i
SFX U   0          i          [^i] ii
warning - incompatible stripping characters and condition:
SFX P   l          i          l	[<- there is an unnecessary tabulator here)
SFX I   a          ii         [gc] a
warning - bad field number:
SFX I   a          ii         [gc] a
SFX I   a          ei         [^cg] a

sk_SK
warning - incompatible stripping characters and condition:
SFX T   <20>a<EFBFBD>         ol<6F>        kla<6C>
SFX T   <20>a<EFBFBD>         ol<6F>c       kla<6C>
SFX T   s<>a<EFBFBD>        <20>l<EFBFBD>        sla<6C>
SFX T   s<>a<EFBFBD>        <20>l<EFBFBD>c       sla<6C>
SFX R   <20>c<EFBFBD>         l<>iem      <20>c<EFBFBD>
SFX R   i<>s<EFBFBD>        <20>tie       mias<61>
SFX R   iez<65>        iem        [^i]ez<65>
SFX R   iez<65>        ie<69>        [^i]ez<65>
SFX R   iez<65>        ie         [^i]ez<65>
SFX R   iez<65>        eme        [^i]ez<65>
SFX R   iez<65>        ete        [^i]ez<65>
SFX R   iez<65>        <20>          [^i]ez<65>
SFX R   iez<65>        <20>c         [^i]ez<65>
SFX R   iez<65>        z          [^i]ez<65>
SFX R   iez<65>        me         [^i]ez<65>
SFX R   iez<65>        te         [^i]ez<65>

sv_SE
warning - bad field number:
SFX  C  0  net  nets [^e]n
--------------------------------------------------------

2005-08-01: Hunspell 1.0.8 release

- improved compound word support
- fix German S handling
- port MySpell files and MAP feature

2005-07-22: Hunspell 1.0.7 release

2005-07-21: new home page: http://hunspell.sourceforge.net