[Development Tips] How to Get the Number of Strokes in Chinese Characters?

Background

During the development of a simple divination script, I encountered this interesting problem. If it's just a specific few Chinese characters, we can hard-code a dictionary in the script, but what if we want to get the stroke count for any Chinese character?

pypinyin library

from pypinyin import pinyin, Style

def get_strokes_count(chinese_character):
    pinyin_list = pinyin(chinese_character, style=Style.NORMAL)
    strokes_count = len(pinyin_list[0])
    return strokes_count

character = input("Please enter a Chinese character：")
strokes = get_strokes_count(character)
print("Character'{}'stroke numbers：{}".format(character, strokes))

I tried it and found that the result is actually the number of results in the normal pinyin format for that character.

Unihan Database

The Unihan database is a Chinese character database maintained by the Unicode Consortium, which seems quite reliable and also provides online tools.

In its online query tooUnihan Database LookupI found that the query results contain the kTotalStrokes field, which is the stroke count data we need.
As the official database of Unicode, the current version fully meets the basic needs of Chinese character queries.

Nice! One step closer to success!

Getting Stroke Information from Unihan Database

I initially planned to send query requests directly through lookup, but it was too slow, and the address is abroad from China. I found that the database file itself is not large, so I downloaded it directly.

Unihan

AfterAfter opening the compressed package, there are several files.

By looking up the results, we need the kTotalStrokes field in the IRG Source. Extract this file.
I tested the regex on regex101 to extract the desired Unicode part and stroke count part, and saved them separately for querying.

Coding

Extracting Stroke Information

file = Path("Stroke/Unihan_IRGSources.txt")
output = Path("Stroke/unicode2stroke.json")
stroke_dict = dict()
with open(file,mode="r") as f:
    for line in f:
        raw_line = line.strip()
        pattern = r"(U\+.*)\skTotalStrokes.*\s(\d+)"
        result = re.findall(pattern=pattern, string=raw_line)
        if len(result) == 0:
            continue
        unicode_key = result[0][0]
        unicode_stroke = result[0][1]
        print(f"{unicode_key}: {unicode_stroke}")
        stroke_dict[unicode_key] = unicode_stroke

with open(file=output, mode="w", encoding="utf-8") as f:
    json.dump(stroke_dict,f, ensure_ascii=False, indent=4)

exported to json for easy access.

Writing the Acquisition Function

with open(output) as f:
    unicode2stroke = json.load(f)

def get_character_stroke_count(char: str):
    unicode = "U+" + str(hex(ord(char)))[2:].upper()
    return int(unicode2stroke[unicode])

test_char = "阿"
get_character_stroke_count(char=test_char)

When obtaining, note that Unicode converts the character to its corresponding hexadecimal code

Success! The expected result is achieved!

DEV Community

[Development Tips] How to Get the Number of Strokes in Chinese Characters?

Background

pypinyin library

Unihan Database

Getting Stroke Information from Unihan Database

Coding

Top comments (0)

Read next

Steps to Create AI ML Solution

When Tech Plays Mind Games, The Curious Case of Uber Pricing

TypeScript vs JavaScript: Key Differences, Features, and When to Use Each

Cloud Cost Management Strategies