我们都知道php函数str_word_count可以统计字数,但这个函数在统计中文时就无能为力了。

解决办法是根据汉字的编码规则,自己来实现中文汉字数统计和中英文单词数统计。 对于GB2312编码的字符采用以下函数:

<?php
define( "GB2312_CHINESE_PATTERN", "/[\xb0-\xfe][\xa0-\xfe]/" );
define( "GB2312_SYMBOL_PATTERN", "/[\xa1-\xa3][\xa0-\xfe]/" );
// count only chinese words
function str_gb2312_chinese_word_count($str = ""){
$str = preg_replace(GB2312_SYMBOL_PATTERN, "", $str);
return preg_match_all(GB2312_CHINESE_PATTERN, $str, $arr);
}
// count both chinese and english
function str_gb2312_mix_word_count($str = ""){
$str = preg_replace(GB2312_SYMBOL_PATTERN, "", $str);
return str_gb2312_chinese_word_count($str) + str_word_count(preg_replace(GB2312_CHINESE_PATTERN, "", $str));
}
?>

对于UTF-8编码的字符采用以下函数:

<?php
define( "UTF8_CHINESE_PATTERN", "/[\x{4e00}-\x{9fff}\x{f900}-\x{faff}]/u" );
define( "UTF8_SYMBOL_PATTERN", "/[\x{ff00}-\x{ffef}\x{2000}-\x{206F}]/u" );
// count only chinese words
function str_utf8_chinese_word_count($str = ""){
$str = preg_replace(UTF8_SYMBOL_PATTERN, "", $str);
return preg_match_all(UTF8_CHINESE_PATTERN, $str, $arr);
}
// count both chinese and english
function str_utf8_mix_word_count($str = ""){
$str = preg_replace(UTF8_SYMBOL_PATTERN, "", $str);
return str_utf8_chinese_word_count($str) + str_word_count(preg_replace(UTF8_CHINESE_PATTERN, "", $str));
}?>

以上两种代码功能相同,只是根据不同的字符编码做了不同的实现,实际使用视页面编码对应选择。都有两个函数,一个只统计中文汉字数,另一个统计中英文单词数(中文汉字数+英文单词数),中英文符号都不计入数字统计。

特别说明:如不先去除中文标点会导致统计出错,如GB2312编码下”:‘”两个中文标点的字节表示为a3baa1ae,中间部分baa1正好对应GB2312编码地”骸”字,会被统计为一个中文汉字,导致计数错误。

GB2312可参考以下函数:

<?php
define( "GB2312_CHINESE_PATTERN", "/[\xb0-\xfe][\xa0-\xfe]/" );
define( "GB2312_SYMBOL_PATTERN", "/[\xa1-\xa3][\xa0-\xfe]/" );
// count only chinese words
function str_gb2312_chinese_word_count($str = ""){
$str = preg_replace(GB2312_SYMBOL_PATTERN, "", $str);
return preg_match_all(GB2312_CHINESE_PATTERN, $str, $textrr);
}
// count both chinese and english
function str_gb2312_mix_word_count($str = ""){
$str = preg_replace(GB2312_SYMBOL_PATTERN, "", $str);
return str_gb2312_chinese_word_count($str) + str_word_count(preg_replace(GB2312_CHINESE_PATTERN, "", $str));
}

// use one of the following two lines according to the page encoding
$word_count = str_gb2312_mix_word_count($text);
?>

WordPress主题或插件UTF-8可以参考下面的方法:

<?php
// chinese word count pattern
define( "UTF8_CHINESE_PATTERN", "/[\x{4e00}-\x{9fff}\x{f900}-\x{faff}]/u" );
define( "UTF8_SYMBOL_PATTERN", "/[\x{ff00}-\x{ffef}\x{2000}-\x{206F}]/u" );
// count only chinese words
function str_utf8_chinese_word_count($str = ""){
    $str = preg_replace(UTF8_SYMBOL_PATTERN, "", $str);
    return preg_match_all(UTF8_CHINESE_PATTERN, $str, $textrr);
}
// count both chinese and english
function str_utf8_mix_word_count($str = ""){
    $str = preg_replace(UTF8_SYMBOL_PATTERN, "", $str);
    return str_utf8_chinese_word_count($str) + str_word_count(preg_replace(UTF8_CHINESE_PATTERN, "", $str));
}

// use to wordpress
$word_count = str_utf8_mix_word_count(strip_tags($content));
?>

写评论

*