A Character Conversion Script
Published Saturday December 10th, 2005
Ive incorporated GeSHi, a generic syntax highlighter class for PHP into my weblog so now I can easily insert code into entries and have them highlighted! So heres my first entry to utilize this new ability.
Heres the scenario behind this Javascript. Youve got a form in-which users are inputting UTF-8 encoded characters but the table in your database in which you are storing the user-submitted data is has its character encoding set to ISO-8859-1 (also known as latin1, the Western European character set). You dont want to filter the text on the server-side (too complex, too lazy, whatever), dont want to convert the database to UTF-8 (pain in the ass?), or you simply dont have access to the server-side of things. Here is a Javascript solution that works across all the popular, modern browsers.
Why did I create this script? Simply because the described scenario came up at work. Why am I sharing this with the world? When it came to writing a script to replace single-characters, I had a difficult time finding a script that converts single-characters and there wasnt much documentation on the functions used. What exactly does this Javascript do? It effectively converts Unicode characters to ISO Latin-1 characters based on the character code or charcode and acts as a workaround for converting between character-sets, encodings, or charsets by using characters that are standard in both Unicode and ISO Latin-1.x
Before I continue, it should be noted that this Javascript will work with both Javascript 1.2 where the functions used use ISO-Latin-1 (ISO 8859-1) values, and Javascript 1.3 where the functions use Unicode (UTF-8) values. (Citation) This is because Unicode and Latin-1 share characters/charcodes for some of the lower value charcodes. This means that the quote ( " ) will have the same character code in UTF-8 as it will in ISO-8859-1 but the curly-quote ( ) will not because it does not exist in the ISO-8859-1 charset.jav
The javascript functions used are charCodeAt, and fromCharCode(). These related functions may also be of use, but are not used in this script: substr(), substring(), charAt(), replace(). Additionally, see this page for further documentation on Javascript string functions. If youre interested in converting or replacing chunks of text, look into the replace() function.
An Example
Here is the Javascript in action. Type any of the following common Unicode characters that are not supported in ISO-8859-1 and see them converted into a ISO-8859-1 equivalent:
The CodeDownload this Javascript: convert-utf8-input-to-latin1-20051209.js
This first bit of HTML sets up the forms used in the demonstration above. The first form, iamaform executes the charCleanUp() function so that when the Convert submit button is clicked, the charCleanUp() function is first called, and if that function returns true (which it will) the form is then sent on to the server.
The charCleanUp() function does two things. It sets the second textarea in the second form to the value returned by the charConversion() function to which we pass the value from the first forms textarea. The function then returns false to avoid the form being submitted to the server.
What youd really want to do in a real-world scenario is not even have the second textarea. Instead, you would want to take the value from a form, convert the characters, and then pass it on to the server. This version of charCleanUp() does just that:
The function takes the value of the textarea named iamatextarea from the form named iamaform and passed if to the charConversion() function. The returned value is then saved back into the same form and the same textarea.
The charConversion() function looks at each character individually in the string that is passed to it. If an undesirable character code is found, we replace it with something that fits our needs more appropriately. See the comments in the code itself for further details.
To figure out what characters have what character code this simple form and Javascript function will come in handy.
The html for this is:
The charLookUp() function simply returns the character code for the first character in the string passed to it from a form in an alert() box. Useful for debuging and adding more rules to the charConversion function.
Conclusion
Well then. With all that said, if anyone else in the future has to do this type of thing, most of the work has already been done! No piecing things together from across the Internet.
Isnt the syntax highlighting ability neat? I think so. Im wondering if there would be a point to incorporating it into the commenting functionality of this weblog.
Thats it for this one.
1 comment, 5810 views, 2 revisions
Permanent Link
- Add Comment- Hide Comments
Ive incorporated GeSHi, a generic syntax highlighter class for PHP into my weblog so now I can easily insert code into entries and have them highlighted! So heres my first entry to utilize this new ability.
Heres the scenario behind this Javascript. Youve got a form in-which users are inputting UTF-8 encoded characters but the table in your database in which you are storing the user-submitted data is has its character encoding set to ISO-8859-1 (also known as latin1, the Western European character set). You dont want to filter the text on the server-side (too complex, too lazy, whatever), dont want to convert the database to UTF-8 (pain in the ass?), or you simply dont have access to the server-side of things. Here is a Javascript solution that works across all the popular, modern browsers.
Why did I create this script? Simply because the described scenario came up at work. Why am I sharing this with the world? When it came to writing a script to replace single-characters, I had a difficult time finding a script that converts single-characters and there wasnt much documentation on the functions used. What exactly does this Javascript do? It effectively converts Unicode characters to ISO Latin-1 characters based on the character code or charcode and acts as a workaround for converting between character-sets, encodings, or charsets by using characters that are standard in both Unicode and ISO Latin-1.x
Before I continue, it should be noted that this Javascript will work with both Javascript 1.2 where the functions used use ISO-Latin-1 (ISO 8859-1) values, and Javascript 1.3 where the functions use Unicode (UTF-8) values. (Citation) This is because Unicode and Latin-1 share characters/charcodes for some of the lower value charcodes. This means that the quote ( " ) will have the same character code in UTF-8 as it will in ISO-8859-1 but the curly-quote ( ) will not because it does not exist in the ISO-8859-1 charset.jav
The javascript functions used are charCodeAt, and fromCharCode(). These related functions may also be of use, but are not used in this script: substr(), substring(), charAt(), replace(). Additionally, see this page for further documentation on Javascript string functions. If youre interested in converting or replacing chunks of text, look into the replace() function.
An Example
Here is the Javascript in action. Type any of the following common Unicode characters that are not supported in ISO-8859-1 and see them converted into a ISO-8859-1 equivalent:
The CodeDownload this Javascript: convert-utf8-input-to-latin1-20051209.js
This first bit of HTML sets up the forms used in the demonstration above. The first form, iamaform executes the charCleanUp() function so that when the Convert submit button is clicked, the charCleanUp() function is first called, and if that function returns true (which it will) the form is then sent on to the server.
- <script type="text/javascript" src="convert-utf8-input-to-latin1-20051209.js"></script>
- <form name="iamaform" onSubmit="charCleanUp();">
- Input:
- <textarea name="iamatextarea" cols="50" rows="10"></textarea>
- <input type="submit" value="Convert"/>
- </form>
- <form name="iamanotherform">
- Output:
- <textarea name="iamanothertextarea" cols="50" rows="10"></textarea>
- </form>
The charCleanUp() function does two things. It sets the second textarea in the second form to the value returned by the charConversion() function to which we pass the value from the first forms textarea. The function then returns false to avoid the form being submitted to the server.
- function charCleanUp(){
- document.iamanotherform.iamanothertextarea.value = charConversion(document.iamaform.iamatextarea.value);
- return false;
- }
What youd really want to do in a real-world scenario is not even have the second textarea. Instead, you would want to take the value from a form, convert the characters, and then pass it on to the server. This version of charCleanUp() does just that:
- function charCleanUp(){
- document.iamaform.iamatextarea.value = charConversion(document.iamaform.iamatextarea.value);
- return true;
- }
The function takes the value of the textarea named iamatextarea from the form named iamaform and passed if to the charConversion() function. The returned value is then saved back into the same form and the same textarea.
The charConversion() function looks at each character individually in the string that is passed to it. If an undesirable character code is found, we replace it with something that fits our needs more appropriately. See the comments in the code itself for further details.
- function charConversion(intext){
- var intext;
- var outtext = '';
- // loop through the text one character at a time.
- for (var i=0; i<=intext.length; i ) {
- var code = intext.charCodeAt(i);
- // Setup single char to single char conversions to be made
- if (code == 8220){code = 34;} // curly-double quote open to "
- if (code == 8221){code = 34;} // curly-double quote close to "
- if (code == 8217){code = 39;} // curly-single quote open to '
- if (code == 8216){code = 39;} // curly-single quote close to '
- if (code == 8211){code = 45;} // en-dash with -
- // Setup and handle single char to multiple char replacements
- if (code == 8212){ // em-dash with --
- code = '';
- outtext = outtext String.fromCharCode(45,45);
- }
- if (code == 8482){ // TM symbol to (TM)
- code = '';
- outtext = outtext String.fromCharCode(40,84,77,41);
- }
- if (code == 8230){ // ellipsis to three-periods
- code = '';
- outtext = outtext String.fromCharCode(46,46,46);
- }
- // Handles all single char to single char replacements.
- if (code!=''){
- outtext = outtext String.fromCharCode(code);
- }
- }
- return outtext;
- }
To figure out what characters have what character code this simple form and Javascript function will come in handy.
The html for this is:
The charLookUp() function simply returns the character code for the first character in the string passed to it from a form in an alert() box. Useful for debuging and adding more rules to the charConversion function.
- function charLookUp(str){
- alert(str.charCodeAt(0));
- }
Conclusion
Well then. With all that said, if anyone else in the future has to do this type of thing, most of the work has already been done! No piecing things together from across the Internet.
Isnt the syntax highlighting ability neat? I think so. Im wondering if there would be a point to incorporating it into the commenting functionality of this weblog.
Thats it for this one.
Posted by The fatty @ 20:10, December 11, 2005 | |
why do i even attempt to read this, lol. bravo at any rate marco! | |