PHP mb_ord() Function

What is PHP mb_ord() Function?

If you want to get the Unicode code point (or codepoint or code position) value of the first character of a string, use mb_ord() function.

Why we use multi-byte (mb_) functions?

With 1 byte (8 consecutive bits), we can’t express more than 256 characters. What happen if we want to write multiple languages in one document? In this case, 256 characters is not enough, that’s why we need multi-byte encoding system.

Most of us use UTF-8 character encoding system in our web application and It supports multi-byte character set. A UTF-8 character can consist of 1 to 4 bytes which is capable of encoding all 1,112,064 valid Unicode code points. The UTF-8 character encoding system is ASCII compactable which is a single-byte character encoding system. PHP’s multi-byte functions (starting with mb_) are capable of operate correctly on multi-byte string.

Syntax:

mb_ord(string, encoding)

Parameters:

The Function has 1 required parameter and 1 optional parameter-

string (Required): The string whose Unicode code point value of the first character we want to find out. Check example 1.

encoding (Optional): The character encoding system to use. It will be used to encode the string. If you omit this parameter or use NULL as value, the internal character encoding system will be used. You can find the default character encoding system in php.ini file in “default_charset = “ setting. Check example 2.

Return Values:

The function returns-

  • The Unicode code point of the first character – on success.
  • FALSE – on failure.

Examples:

Example 1: Simple mb_ord() function-

<?php
echo "The Unicode code point of the first character of Web is: " . mb_ord("Web", "UTF-8");
?>

Output:

The Unicode code point of the first character of Web is: 87

Explanation:

The function converts the first character of the string “Web” which is also “W” here to 87.

Example 2: Finding codepoint of each character of a string-

<?php
echo "The Unicode code point of the first character of Web is (Without explicit encoding): " . mb_ord("Web");
?>

Output:

The Unicode code point of the first byte of Web is (Without explicit encoding): 87

Explanation:

The mb_ord() function doesn’t explicitly mention the encoding system, still it displays the code point 87 which is the code point of character “W” in UTF-8 encoding system. This happens because the function by default it uses the internal encoding system which is UTF-8 here.

Example 3: Finding value of each byte of a multi-byte string-

<?php
$arr = mb_str_split("いらっしゃいませ");
for($i=0; $i<count($arr); $i++){
    echo "The Unicode code point of the character " . $arr[$i]. " is: " . mb_ord($arr[$i]) . "<br />";
}
?>

Output:

The Unicode code point of the byte い is: 12356
The Unicode code point of the byte ら is: 12425
The Unicode code point of the byte っ is: 12387
The Unicode code point of the byte し is: 12375
The Unicode code point of the byte ゃ is: 12419
The Unicode code point of the byte い is: 12356
The Unicode code point of the byte ま is: 12414
The Unicode code point of the byte せ is: 12379

Explanation:

The mb_str_split() function splits the string into separate characters and create an array. So, each element contains one character and the mb_ord() function converts each character to its equivalent UTF-8 code point.

Example 4: Finding missing digits-

<?php
$arr = mb_str_split("০১৩৫৭৯");
for($i=0; $i<count($arr); $i++){
    $numbers[] = mb_ord($arr[$i], "UTF-8");
}
$missingNumbers = "";
for ($i=2534; $i<=2543; $i++){
    if(!in_array($i, $numbers)){
        $missingNumbers .= mb_chr($i) . " ";
    }
}
echo $missingNumbers;
?>

Output:

২ ৪ ৬ ৮

Explanation:

Line 2: The mb_str_split() function splits the numeric string “০১৩৫৭৯” which is Bangla numbers into separate digit and create an array with each digit.

Line 3: The loop stores the code points of the digits in the $number array.

Line 7: Code point 2534 to 2543 is for 10 digits in Bangla Alphabet (“০১২৩৪৫৬৭৮৯) which is equivalent to English (0,1,2,3,4,5,6,7,8,9)

Line 8: If any code point digit doesn’t match with the code points stored in the array, we add this to the variable $missingNumbers in line 10. We used chr() function to convert the code point back to numbers.

Example 5: Extract all multi-byte characters from a string-

<?php
$string ="かさた1ب 2ت 3ث 4ج";
echo "String is: " . $string . "<br />";
$arr = mb_str_split("かさた1ب 2ت 3ث 4ج");
$numbers = [];
$multiByteChars = "";
for($i=0; $i<mb_strlen("かさた1ب 2ت 3ث 4ج"); $i++){
    if(mb_ord($arr[$i])>255){
        $multiByteChars .= mb_chr(mb_ord($arr[$i]));
    }
}
echo "Only Multi-byte characters are: " . $multiByteChars;
?>

Output:

String is: かさた1ب 2ت 3ث 4ج
Only Multi-byte characters are: かさたبتثج

Explanation:

Line 4: The mb_str_split() function splits the string into separate characters and create an array with it.

Line 8: In UTF-8 character encoding, code points of single-byte characters are from 0 to 255 and the multi-byte characters are above 255. Here, we retrieve the characters that has code point of greater than 255.

Line 9: We used chr() function to convert the code point back to numbers.

Example 6: Code points in different encoding system for the same character-

<?php
echo "The Unicode code point of the first character (こ) of Web is: " . mb_ord("こんにちは世界", "UTF-8");
echo "<br />";
echo "The Unicode code point of the first character (こ) of Web is: " . mb_ord("こんにちは世界", "SJIS");
?>

Output:

The Unicode code point of the first character (こ) of Web is: 12371
The Unicode code point of the first character (こ) of Web is: 32314

Explanation:

In the UTF-8 encoding, the code point is 12371 and in SJIS it is 32314.

Practical Usages of mb_ord() Function:

mb_ord() function has many usages, few includes-

  • You can find any missing characters from any language with this function. Check example 4.
  • UTF-8 characters range from 1-byte to 4-byte.  You can separate one level of characters from a string i.e. you can discard all the multi-byte characters from a string. Check example 5.
  • With this function, you can sort any alphabet of any language.

Notes on mb_ord() Function:

The opposite of mb_ord() function is mb_chr() function.

Caution:

If you don’t use the correct encoding system, you’ll get wrong code point. For example, Japanese characters have different code points in the “UTF-8” and “SJIS” encoding systems. Check example 6.

PHP Version Support:

PHP 4, PHP 5, PHP 7, PHP 8

Summary: PHP mb_ord() Function

mb_ord() is one of the useful multi-byte string functions in PHP which can help you get the Unicode codepoint.

Reference:

https://www.php.net/manual/en/function.mb-ord.php