Support numbers of Chinese and other language systems when sorting in Dolphin

Firestar-Reimu · December 31, 2023, 9:12am

The default sort attribute in Dolphin for Chinese is pinyin (the romanization of chinese characters)

but this makes Chinese numbers not in their right position which is: (from top to bottom: Chinese, pinyin, Arabic numbers)

一、二、三、四、五、六、七、八、九、十
yi er san si wu liu qi ba jiu shi
1 2 3 4 5 6 7 8 9 10

in alphabetical order it will be

八、二、六、九、七、三、四、十、五、一、
ba er liu jiu qi san si shi wu yi
8 2 6 9 7 3 4 10 5 1

This made it frustrating when there is lots of files named in Chinese numbers like 第一章（chapter 1）

jinliu · December 31, 2023, 12:21pm

I guess Dolphin does the sorting with QCollator from Qt, which in turn uses ICU. So you should change ICU for that.

And part of the problem is that, "一“ (U+4E00 CJK UNIFIED IDEOGRAPH-4E00: 一 – Unicode – Codepoints) in unicode is not classified as a digit (category Nd), albeit having a numeric value of 1. So probably ICU doesn’t see it as a number when sorting.

jinliu · January 1, 2024, 3:15pm

You probably need to change this function to set DIGIT_TAG on all characters that have a numeric value:

github.com

unicode-org/icu/blob/1a60a038e14f0c56f50052c03fe76c4933cda339/icu4c/source/i18n/collationdatabuilder.cpp#L1273


      
                          UChar32 jamo = jamoCpFromIndex(j);
                          jamoCE32s[j] = copyFromBaseCE32(jamo, base->getCE32(jamo),
                                                          /*withContext=*/ true, errorCode);
                      }
                  }
              }
              return anyJamoAssigned && U_SUCCESS(errorCode);
          }
          
          void
          CollationDataBuilder::setDigitTags(UErrorCode &errorCode) {
              UnicodeSet digits(UNICODE_STRING_SIMPLE("[:Nd:]"), errorCode);
              if(U_FAILURE(errorCode)) { return; }
              UnicodeSetIterator iter(digits);
              while(iter.next()) {
                  U_ASSERT(!iter.isString());
                  UChar32 c = iter.getCodepoint();
                  uint32_t ce32 = utrie2_get32(trie, c);
                  if(ce32 != Collation::FALLBACK_CE32 && ce32 != Collation::UNASSIGNED_CE32) {
                      int32_t index = addCE32(ce32, errorCode);
                      if(U_FAILURE(errorCode)) { return; }

But I’m not sure if ICU would accept such a change.