UTF-8 Validation

#python #algorithms

Another leetCode challenge on UTF-8 Validation.

PROBLEM
Given an integer array data representing the data, return whether it is a valid UTF-8 encoding (i.e. it translates to a sequence of valid UTF-8 encoded characters). - You can use the link above to see more details.

SOLUTION
This was quite a challenge! let's look at the step to solve it.

Loop through the array.
Find the binary representation for each integer using the format method that takes in two arguments
Loop through the first octet sequence to determine if it is a valid UTF-8 encoding, by checking if the character is 1 - 4 bytes Long.
Check the next two bit of the sequence if they follow the rules, to check for validity

Rules
A character in UTF8 can be from 1 to 4 bytes long, subjected to the following rules:

For a 1-byte character, the first bit is a 0, followed by its Unicode code.
For an n-bytes character, the first n bits are all one's, the n + 1 bit is 0, followed by n - 1 bytes with the most significant 2 bits being 10.

This is how the UTF-8 encoding would work:

Number of Bytes   |        UTF-8 Octet Sequence
                       |              (binary)
   --------------------+-----------------------------------------
            1          |   0xxxxxxx
            2          |   110xxxxx 10xxxxxx
            3          |   1110xxxx 10xxxxxx 10xxxxxx
            4          |   11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

x denotes a bit in the binary form of a byte that may be either 0 or 1.

DEV Community

UTF-8 Validation

Top comments (0)

Read next

2825. Make String a Subsequence Using Cyclic Increments

How to Optimize Loops for Better Performance

🚀 Building a User Management API with FastAPI and SQLite

2182. Construct String With Repeat Limit