PHP multibyte character dumbness
I got bit this week by something that I have somehow managed to avoid via complete dumb luck up till now. It’s the fact that PHP is retarded when it comes to handling UTF-8 strings.
I couldn’t figure out why a string kept coming out with gibberish on the end. The app was truncating the string because it was too long, and the spot it was getting truncated at was at a single quote mark. My initial thinking led me down a road where I thought perhaps it was one of those goofy quote marks and maybe it had gotten incorrectly encoded along the way and was not proper UTF-8. After determining it was not an encoding problem one of the guys at work reminded me that UTF-8 characters can be mutibyte, and perhaps it had something to do with that. After a little research, sure enough, the substr function I was using is one of dumb functions and was chopping that multibyte character in half, leaving only the gibberish.
In PHP < version 6 the core string functionality generally disregards the fact that charters could be multibyte, and just assumes 1 byte = 1 character. In my research I ran across a great page that has a rundown of the different problematic string functions and their level of risk when it comes to dealing with UTF-8: Handling UTF-8 with PHP [Web Application Component Toolkit].
My particluar situation was resolved by not chopping in the middle of words and instead finding and chopping at the preceding whitespace, but if you don’t have the luxury of tweaking your logic, I highly recommend taking a look at the Multibyte String Library, or procrastinate until PHP6 is out :)






