A Character Conversion Script
Published Saturday December 10th, 2005

I’ve incorporated GeSHi, a generic syntax highlighter class for PHP into my weblog so now I can easily insert code into entries and have them highlighted! So here’s my first entry to utilize this new ability.

Here’s the scenario behind this Javascript. You’ve got a form in-which users are inputting UTF-8 encoded characters but the table in your database in which you are storing the user-submitted data is has its character encoding set to ISO-8859-1 (also known as latin1, the Western European character set). You don’t want to filter the text on the server-side (too complex, too lazy, whatever), don’t want to convert the database to UTF-8 (pain in the ass?), or you simply don’t have access to the server-side of things. Here is a Javascript solution that works across all the popular, modern browsers.

Why did I create this script? Simply because the described scenario came up at work. Why am I sharing this with the world? When it came to writing a script to replace single-characters, I had a difficult time finding a script that converts single-characters and there wasn’t much documentation on the functions used. What exactly does this Javascript do? It effectively converts Unicode characters to ISO Latin-1 characters based on the character code or charcode and acts as a workaround for converting between character-sets, encodings, or charsets by using characters that are standard in both Unicode and ISO Latin-1.x

Before I continue, it should be noted that this Javascript will work with both Javascript 1.2 where the functions used use ISO-Latin-1 (ISO 8859-1) values, and Javascript 1.3 where the functions use Unicode (UTF-8) values. (Citation) This is because Unicode and Latin-1 share characters/charcodes for some of the lower value charcodes. This means that the quote ( " ) will have the same character code in UTF-8 as it will in ISO-8859-1 but the curly-quote ( “ ) will not because it does not exist in the ISO-8859-1 charset.jav

The javascript functions used are charCodeAt, and fromCharCode(). These related functions may also be of use, but are not used in this script: substr(), substring(), charAt(), replace(). Additionally, see this page for further documentation on Javascript string functions. If you’re interested in converting or replacing chunks of text, look into the replace() function.


An Example

Here is the Javascript in action. Type any of the following common Unicode characters that are not supported in ISO-8859-1 and see them converted into a ISO-8859-1 equivalent: “ “ ‘ ‘ ™ …

Input:


Output:



The CodeDownload this Javascript: convert-utf8-input-to-latin1-20051209.js

This first bit of HTML set’s up the form’s used in the demonstration above. The first form, ‘iamaform’ executes the charCleanUp() function so that when the ‘Convert’ submit button is clicked, the charCleanUp() function is first called, and if that function returns true (which it will) the form is then sent on to the server.

  1. <script type="text/javascript" src="convert-utf8-input-to-latin1-20051209.js"></script>
  2. <form name="iamaform" onSubmit="charCleanUp();">
  3. Input:
  4. <textarea name="iamatextarea" cols="50" rows="10"></textarea>
  5. <input type="submit" value="Convert"/>
  6. </form>
  7.  
  8. <form name="iamanotherform">
  9. Output:
  10. <textarea name="iamanothertextarea" cols="50" rows="10"></textarea>
  11. </form>

The charCleanUp() function does two things. It sets the second textarea in the second form to the value returned by the charConversion() function to which we pass the value from the first form’s textarea. The function then return’s false to avoid the form being submitted to the server.

  1. function charCleanUp(){
  2.   document.iamanotherform.iamanothertextarea.value = charConversion(document.iamaform.iamatextarea.value);
  3.   return false;
  4. }

What you’d really want to do in a real-world scenario is not even have the second textarea. Instead, you would want to take the value from a form, convert the characters, and then pass it on to the server. This version of charCleanUp() does just that:

  1. function charCleanUp(){
  2.   document.iamaform.iamatextarea.value = charConversion(document.iamaform.iamatextarea.value);
  3.   return true;
  4. }

The function takes the value of the textarea named ‘iamatextarea’ from the form named ‘iamaform’ and passed if to the charConversion() function. The returned value is then saved back into the same form and the same textarea.

The charConversion() function looks at each character individually in the string that is passed to it. If an undesirable character code is found, we replace it with something that fits our needs more appropriately. See the comment’s in the code itself for further details.

  1. function charConversion(intext){
  2.   var intext;
  3.   var outtext = '';
  4.  
  5.   // loop through the text one character at a time.
  6.   for (var i=0; i<=intext.length; i  ) {
  7.     var code = intext.charCodeAt(i);
  8.    
  9.     // Setup single char to single char conversions to be made
  10.     if (code == 8220){code = 34;} //  curly-double quote open  to "
  11.     if (code == 8221){code = 34;} //  curly-double quote close  to "
  12.     if (code == 8217){code = 39;} //  curly-single quote open  to '
  13.     if (code == 8216){code = 39;} //  curly-single quote close  to '
  14.     if (code == 8211){code = 45;} //  en-dash with -
  15.  
  16.     // Setup and handle single char to multiple char replacements
  17.     if (code == 8212){ //  em-dash with --
  18.       code = '';
  19.       outtext = outtext String.fromCharCode(45,45);  
  20.     }  
  21.     if (code == 8482){ //  TM symbol  to (TM)
  22.       code = '';
  23.       outtext = outtext String.fromCharCode(40,84,77,41);  
  24.     }  
  25.     if (code == 8230){ //  ellipsis  to three-periods
  26.       code = '';
  27.       outtext = outtext String.fromCharCode(46,46,46);   
  28.     }
  29.    
  30.     // Handles all single char to single char replacements.
  31.     if (code!=''){ 
  32.       outtext = outtext String.fromCharCode(code);
  33.     }
  34.   }
  35.   return outtext;
  36. }

To figure out what characters have what character code this simple form and Javascript function will come in handy.

Charcode Lookup:


The html for this is:

  1. <form name="formB" onSubmit="charLookUp(formB.fieldA.value);return false;">
  2. Charcode Lookup:
  3. <input type="text" name="fieldA" maxlength="1" size="2"></input>
  4. <input type="submit" value="Get Charcode"></input>
  5. </form>

The charLookUp() function simply returns the character code for the first character in the string passed to it from a form in an alert() box. Useful for debuging and adding more rules to the charConversion function.

  1. function charLookUp(str){
  2.     alert(str.charCodeAt(0));
  3. }


Conclusion

Well then. With all that said, if anyone else in the future has to do this type of thing, most of the work has already been done! No piecing things together from across the Internet.

Isn’t the syntax highlighting ability neat? I think so. I’m wondering if there would be a point to incorporating it into the commenting functionality of this weblog.

That’s it for this one.
Posted by The fatty @ 20:10, December 11, 2005
why do i even attempt to read this, lol. bravo at any rate marco!
Reply