* Copyright (C) 2004-2006, International Business Machines * Corporation and others. All Rights Reserved. * * file name: changes.txt * encoding: US-ASCII * tab size: 8 (not used) * indentation:4 * * created on: 2004may06 * created by: Markus W. Scherer * * change log for Unicode updates ---------------------------------------------------------------------------- *** Unicode 5.0 update *** related Jitterbugs 5084 RFE: Update to Unicode 5.0 *** data files & enums & parser code * file preparation - ucdstrip: DerivedCoreProperties.txt DerivedNormalizationProps.txt NormalizationTest.txt PropList.txt Scripts.txt GraphemeBreakProperty.txt SentenceBreakProperty.txt WordBreakProperty.txt - ucdstrip and ucdmerge: EastAsianWidth.txt LineBreak.txt * my ucd2unidata.txt (needs to be updated each time with UCD and file version numbers) copy 5.0.0\ucd\BidiMirroring.txt ..\unidata\ copy 5.0.0\ucd\Blocks.txt ..\unidata\ copy 5.0.0\ucd\CaseFolding.txt ..\unidata\ copy 5.0.0\ucd\DerivedAge.txt ..\unidata\ copy 5.0.0\ucd\extracted\DerivedBidiClass.txt ..\unidata\ copy 5.0.0\ucd\extracted\DerivedJoiningGroup.txt ..\unidata\ copy 5.0.0\ucd\extracted\DerivedJoiningType.txt ..\unidata\ copy 5.0.0\ucd\extracted\DerivedNumericValues.txt ..\unidata\ copy 5.0.0\ucd\NormalizationCorrections.txt ..\unidata\ copy 5.0.0\ucd\PropertyAliases.txt ..\unidata\ copy 5.0.0\ucd\PropertyValueAliases.txt ..\unidata\ copy 5.0.0\ucd\SpecialCasing.txt ..\unidata\ copy 5.0.0\ucd\UnicodeData.txt ..\unidata\ ucdstrip < 5.0.0\ucd\DerivedCoreProperties.txt > ..\unidata\DerivedCoreProperties.txt ucdstrip < 5.0.0\ucd\DerivedNormalizationProps.txt > ..\unidata\DerivedNormalizationProps.txt ucdstrip < 5.0.0\ucd\NormalizationTest.txt > ..\unidata\NormalizationTest.txt ucdstrip < 5.0.0\ucd\PropList.txt > ..\unidata\PropList.txt ucdstrip < 5.0.0\ucd\Scripts.txt > ..\unidata\Scripts.txt ucdstrip < 5.0.0\ucd\auxiliary\GraphemeBreakProperty.txt > ..\unidata\GraphemeBreakProperty.txt ucdstrip < 5.0.0\ucd\auxiliary\SentenceBreakProperty.txt > ..\unidata\SentenceBreakProperty.txt ucdstrip < 5.0.0\ucd\auxiliary\WordBreakProperty.txt > ..\unidata\WordBreakProperty.txt ucdstrip < 5.0.0\ucd\EastAsianWidth.txt | ucdmerge > ..\unidata\EastAsianWidth.txt ucdstrip < 5.0.0\ucd\LineBreak.txt | ucdmerge > ..\unidata\LineBreak.txt * update FractionalUCA.txt and UCARules.txt with new canonical closure * genpname - run preparse.pl + make sure that data.h is writable + perl preparse.pl \cvs\oss\icu > out.txt * uchar.h & uscript.h & uprops.h & uprops.c & genprops - new block & script values + script values already added in ICU 3.6 because all of ISO 15924 is now covered * build Unicode data source code for hardcoding core data C:\cvs\oss\icu\source\data>NMAKE /f makedata.mak ICUMAKE=\cvs\oss\icu\source\data\ CFG=debug uni-core-data ICU data make path is \cvs\oss\icu\source\data\ ICU root path is \cvs\oss\icu Information: cannot find "ucmlocal.mk". Not building user-additional converter files. [etc.] Creating data file for Unicode Character Properties Creating data file for Unicode Case Mapping Properties Creating data file for Unicode BiDi/Shaping Properties Creating data file for Unicode Normalization Unicode .icu files built to "\cvs\oss\icu\source\data\out\build\icudt35l" Unicode .c source files built to "\cvs\oss\icu\source\data\out\tmp" - copy the .c source files to C:\cvs\oss\icu\source\common and rebuild the common library *** Unicode version numbers - makedata.mak - uchar.h - configure.in *** LayoutEngine script information * Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguage.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (it also generates ScriptRunData.cpp, which is no longer needed.) The generated files have a current copyright date and "@draft" statement. * copy the above files into /source/layout, replacing the old files. Add new default entries to the indicClassTables array in /source/layout/IndicClassTables.cpp and the complexTable array in /source/layoutex/ParagraphLayout.cpp. (This step should be automated...) * rebuild the layout and layoutex libraries. ---------------------------------------------------------------------------- *** Unicode 4.1 update *** related Jitterbugs 4332 RFE: Update to Unicode 4.1 4157 RBBI, TR29 4.1 updates *** data files & enums & parser code * file preparation - ucdstrip: DerivedCoreProperties.txt DerivedNormalizationProps.txt NormalizationTest.txt GraphemeBreakProperty.txt SentenceBreakProperty.txt WordBreakProperty.txt - ucdstrip and ucdmerge: EastAsianWidth.txt LineBreak.txt * add new files to the repository GraphemeBreakProperty.txt SentenceBreakProperty.txt WordBreakProperty.txt * update FractionalUCA.txt and UCARules.txt with new canonical closure * genpname - handle new enumerated properties in sub read_uchar - run preparse.pl * uchar.h & uscript.h & uprops.h & uprops.c & genprops - new binary properties + Pattern_Syntax + Pattern_White_Space - new enumerated properties + Grapheme_Cluster_Break + Sentence_Break + Word_Break - new block & script & line break values * gencase - case-ignorable changes see http://www.unicode.org/versions/Unicode4.1.0/#CaseMods now: (D47a) Word_Break=MidLetter or Mn, Me, Cf, Lm, Sk *** Unicode version numbers - makedata.mak - uchar.h - configure.in *** tests - verify that u_charMirror() round-trips - test all new properties and some new values of old properties *** other code * hardcoded Unihan range end/limit - Unihan range end moves from 9FA5 to 9FBB search for both 9FA5 (end) and 9FA6 (limit) (regex 9FA[56], case-insensitive) + do not modify BOCU/BOCSU code because that would change the encoding and break binary compatibility! + similarly, do not change the GB 18030 range data (ucnvmbcs.c), NamePrepProfile.txt + ignore trietest.c: test data is arbitrary + ignore tstnorm.cpp: test optimization, not important + ignore collation: 9FA[56] only appears in comments; swapCJK() uses the whole block up to 9FFF + do change line_th.txt and word_th.txt by replacing hardcoded ranges with the new property values + do change gennames.c source\data\brkitr\line_th.txt(229): \u33E0-\u33FE \u3400-\u4DB5 \u4E00-\u9FA5 \uA000-\uA48C \uA490-\uA4C6 source\data\brkitr\word_th.txt(23): \u33E0-\u33FE \u3400-\u4DB5 \u4E00-\u9FA5 \uA000-\uA48C \uA490-\uA4C6 source\tools\gennames\gennames.c(971): 0x4e00, 0x9fa5, * case mappings - compare new special casing context conditions with previous ones see http://www.unicode.org/versions/Unicode4.1.0/#CaseMods * genpname - consider storing only the short name if it is the same as the long name *** other reviews - UAX #29 changes (grapheme/word/sentence breaks) - UAX #14 changes (line breaks) - Pattern_Syntax & Pattern_White_Space ---------------------------------------------------------------------------- *** Unicode 4.0.1 update *** related Jitterbugs 3170 RFE: Update to Unicode 4.0.1 3171 Add new Unicode 4.0.1 properties 3520 use Unicode 4.0.1 updates for break iteration *** data files & enums & parser code * file preparation - ucdstrip: DerivedNormalizationProps.txt, NormalizationTest.txt, DerivedCoreProperties.txt - ucdstrip and ucdmerge: EastAsianWidth.txt, LineBreak.txt * file fixes - fix UnicodeData.txt general categories of Ethiopic digits Nd->No according to PRI #26 http://www.unicode.org/review/resolved-pri.html#pri26 - undone again because no corrigendum in sight; instead modified tests to not check consistency on this for Unicode 4.0.1 * ucdterms.txt - update from http://www.unicode.org/copyright.html formatted for plain text * uchar.h & uprops.h & uprops.c & genprops - add UBLOCK_CYRILLIC_SUPPLEMENT because the block is renamed - add U_LB_INSEPARABLE due to a spelling fix + put short name comment only on line with new constant for genpname perl script parser - new binary properties + STerm + Variation_Selector * genpname - fix genpname perl script so that it doesn't choke on more than 2 names per property value - perl script: correctly calculate the maximum number of fields per row * uscript.h - new script code Hrkt=Katakana_Or_Hiragana * gennorm.c track changes in DerivedNormalizationProps.txt - "FNC" -> "FC_NFKC" - single field "NFD_NO" -> two fields "NFD_QC; N" etc. * genprops/props2.c track changes in DerivedNumericValues.txt - changed from 3 columns to 2, dropping the numeric type + assume that the type is always numeric for Han characters, and that only those are added in addition to what UnicodeData.txt lists *** Unicode version numbers - makedata.mak - uchar.h - configure.in *** tests - update test of default bidi classes according to PRI #28 /tsutil/cucdtst/TestUnicodeData http://www.unicode.org/review/resolved-pri.html#pri28 - bidi tests: change exemplar character for ES depending on Unicode version - change hardcoded expected property values where they change *** other code * name matching - read UCD.html * scripts - use new Hrkt=Katakana_Or_Hiragana * ZWJ & ZWNJ - are now part of combining character sequences - break iteration used to assume that LB classes did not overlap; now they do for ZWJ & ZWNJ