DEV Community

algieba
algieba

Posted on

[Development Tips] How to Get the Number of Strokes in Chinese Characters?

Background

During the development of a simple divination script, I encountered this interesting problem. If it's just a specific few Chinese characters, we can hard-code a dictionary in the script, but what if we want to get the stroke count for any Chinese character?

pypinyin library

from pypinyin import pinyin, Style

def get_strokes_count(chinese_character):
    pinyin_list = pinyin(chinese_character, style=Style.NORMAL)
    strokes_count = len(pinyin_list[0])
    return strokes_count

character = input("Please enter a Chinese character:")
strokes = get_strokes_count(character)
print("Character'{}'stroke numbers:{}".format(character, strokes))
Enter fullscreen mode Exit fullscreen mode

I tried it and found that the result is actually the number of results in the normal pinyin format for that character.

pypinin wrong

Unihan Database

The Unihan database is a Chinese character database maintained by the Unicode Consortium, which seems quite reliable and also provides online tools.

In its online query tooUnihan Database LookupI found that the query results contain the kTotalStrokes field, which is the stroke count data we need.
As the official database of Unicode, the current version fully meets the basic needs of Chinese character queries.

Nice! One step closer to success!
unihan_lokup

Getting Stroke Information from Unihan Database

I initially planned to send query requests directly through lookup, but it was too slow, and the address is abroad from China. I found that the database file itself is not large, so I downloaded it directly.

Unihan

AfterAfter opening the compressed package, there are several files.

Unihan_files.png

By looking up the results, we need the kTotalStrokes field in the IRG Source. Extract this file.
I tested the regex on regex101 to extract the desired Unicode part and stroke count part, and saved them separately for querying.

Coding

  • Extracting Stroke Information
file = Path("Stroke/Unihan_IRGSources.txt")
output = Path("Stroke/unicode2stroke.json")
stroke_dict = dict()
with open(file,mode="r") as f:
    for line in f:
        raw_line = line.strip()
        pattern = r"(U\+.*)\skTotalStrokes.*\s(\d+)"
        result = re.findall(pattern=pattern, string=raw_line)
        if len(result) == 0:
            continue
        unicode_key = result[0][0]
        unicode_stroke = result[0][1]
        print(f"{unicode_key}: {unicode_stroke}")
        stroke_dict[unicode_key] = unicode_stroke

with open(file=output, mode="w", encoding="utf-8") as f:
    json.dump(stroke_dict,f, ensure_ascii=False, indent=4)
Enter fullscreen mode Exit fullscreen mode

exported to json for easy access.

  • Writing the Acquisition Function
with open(output) as f:
    unicode2stroke = json.load(f)

def get_character_stroke_count(char: str):
    unicode = "U+" + str(hex(ord(char)))[2:].upper()
    return int(unicode2stroke[unicode])

test_char = ""
get_character_stroke_count(char=test_char)
Enter fullscreen mode Exit fullscreen mode

When obtaining, note that Unicode converts the character to its corresponding hexadecimal code

Success! The expected result is achieved!

Top comments (0)