Skip to end of metadata
Go to start of metadata

Brief

ATutor used to use the latin1 character set and the ISO-8859-1 encoding in its database structures and most of its page headings.  Problems raised when multi-languages content are being used in ATutor.  A common problem was characters were being decoded as a series of black diamond question mark like ������.  The goal of this conversion was to migrate the entire ATutor from latin1 to UTF-8. 

This major change took place in ATutor 1.6.

Problems that were solved

There were two main issues regarding to this major change.  One was to adjust all the string functions to adapt to UTF-8, which means most of the ATutor code had to be modified; the other is to convert existing content from some encoding(s) to UTF-8 without losing any content through the upgrade (this will only affect existing ATutor platforms).  This extends to a crucial problem on deciding which encoding to use during the conversion, as some platform may have several courses with different default languages. 

More in-depth technical problems are addressed here: http://www.phpwact.org/php/i18n/utf-8

Solutions

  1. Make use of the mbstring library in php, and replace most of the affected string functions with mbstring functions.  However, not all php versions have this library enabled by default.  Note: php mbstring library was said to be unstable until recent versions.
  2. Make our own multi-byte string library.
  3. When an ATutor installation contains more than 1 language, have some feasible mechanism to convert the content from different encodings to UTF-8.

Converting content for ATutor 1.6 (Using the installation/upgrade page)

If you are not upgrading from previous ATutor versions, then no conversion is needed. 

If you are using only UTF-8 langauge packs in your pre 1.6 ATutor installation, you may skip this step by choosing 'Option 3' below.  It is likely all your content is already encoded in UTF-8. 

When upgrading ATutor from 1.5.5 (or < 1.5.5) to 1.6, there are several issues if you are using multi-languages encodings in ATutor.  The database tables would have multi languages within them, we need a way to know which content belong to which of the encodings.  There is no apparent way unless we go through each of the characters in the database one by one.  Unfortunately, the computational time for this method is not ideal.  A better approach is to make use of the "Primary Course Language" in the course's "Properties".  Our assumption is based on the fact that most of the courses use mainly just one language for its content.  For instance, a course that teaches French would use "French" in their "Primary Course Language", and have only French, or English in their course content. 

To handle these issues, we have provided three options for the user to convert content to UTF-8 in 'Step 2' of the upgrade.

Option 1, Convert all content

  • Converts all content in ATutor from one encoding to UTF-8.
  • This is used when you are using just one non-UTF-8 language in your database.  For instance, an ATutor installation that uses English only; or an installation that uses Thai only.
  • For some ATutor installations, they might have several languages installed(ie. English, French, German, Thai), but the content are all in one language(ie. English), this is also the option for you.  You will then be converting from that one language(ie. English iso-8859-1) to UTF-8.

Option 2, Convert content by courses

  • Detects the 'Primary Course Language' of each course, and converts the content based on the character set associated with that language, to UTF-8.
  • This is used when you have different courses using different languages.  For instance, an ATutor installation that have 2 courses, one teaches English, another teaches Thai.  In this case, all the content in the English course will be encoded in English latin 1 characters; and the content in the Thai course will be encoded in Thai TIS-620 characters.  This option allows ATutor to detect each course's Primary Course Language and convert the course's content to UTF-8 independent of each other. 

Option 3, Skip conversion

  • Will not convert the database content to UTF-8, because the data are already in UTF-8 encoding.
  • Will modify the database structure.  This involves changing each of the table columns charset and its collation to UTF-8.
  • This is used when you are certain that your ATutor is UTF-8 compatible, which also means that your ATutor contains no characters that are non-UTF8. 
  • This is generally the option if you are using only UTF-8 language packs in your ATutor and all its content are typed under the UTF-8 language packs.
FAQ

Q: English is the only language I use in ATutor, which option should I use?
A: Option 1, and make sure it says ISO-8859-1

Q: I have 3 different language courses (Greek, Italian, English), but when I get to 'Step 2', the 'Course Primary Language' is showing only "iso-8559-1", why is that?
A: You have to change the "Course Primary Language" accordingly to your course content.  "Course Primary Language" can be set in your course's "Properties".

Q: I chose "BIG5" as my Option 2 encodings, after the upgrade is completed.  I reinstalled all my UTF-8 language packs, but some characters are missing, or become weird looking, Why is that?
A: Either those characters are encoded in other encodings, or PHP mb_string library doesn't support that character.  I would suggest you retype those characters manually.

Q: Importing/inserting text directly through mysql (or through a csv file) doesn't work with ATutor 1.6?
A: Client side and Server side have to be in the same encoding.  For instance, XAMPP's mysql uses 'Latin1' as its default server side charset, if you open up phpMyAdmin and choose UTF-8 there, data will be sent out in UTF-8, and received as Latin1.  You will have to fix either one of the connection encodings, or just add the line 'mysql_query('SET NAMES utf-8', $db); in vitals.inc.php right after the connection.  Warning: changing the vital file may affect upgraded system.

Q: I found a bug after upgrading, my mailbox's messages are all corrupted, it seems like the mailbox's messages are converted to some other language...
A: This is a flaw, not a bug.  Your inbox messages can contain all 32 languages, our system will not be able to tell which language we should convert your entire mailbox to unless we check each message from each user line by line, character by character.  For instance, you may have Japanese, Greek, Chinese, and English mails inside your inbox.  There is no way the system can identify and convert the messages accordingly unless we introduce more complexity into the current conversion algorithm.  We have decided to use the System's Default Language (inside Admin's System Perferences) for this type of conversion.  As a result, some characters might be translated incorrectly and inevitably. 

Converting content for ATutor 1.6 (Manually, not using the installation/upgrade page)

This is only a brief on the conversion steps, please refer to the 1.6 documentation for full details.  These steps are most likely carried out automatically during the upgrade.

Things you have to know before carrying on:

  • the defaulted mySQL encoding/charset of your ATutor database. (If you are not sure, you can go to your mysql status to check what the charset is.  The default should be 'latin1')
  • If you are already using UTF-8 language packs in your ATutor, and all the data (including content, forums, polls, files, glossary, etc..) are entered in UTF-8.  Then you do not have to carry out this conversion.
  • If you are using Windows, you will have to get a copy of GnuWin32 ICONV tool; if you are using Unix, ICONV should be included.
  • All language packages needs to be reinstalled; please refer to http://www.atutor.ca/atutor/translate/index.phpfor the most updated ATutor UTF-8 language pack.

Steps

  1. Backup your database.
  2. This step will create a file called "atutor_iso.sql" that contains all of your ATutor content in the defined character set.
    Preform a mysqldump under mysql command prompt:
    mysql> mysqldump --opt --default-character-set=<charset> -u <user> -p <database_name> > atutor_iso.sql
    <charset> is your character set for your database, should be defaulted as latin1 or iso-8859-1
    <user> is the ATutor database user
    <database_name> is your ATutor database name
    For example, on my ATutor database, it would be:
    mysqldump --opt --default-character-set=latin1 -u harris -p atutor > atutor_iso.sql
  3. Alter all table charset to utf8, collation to utf8_general_ci/utf8_unicode_ci
  4. This step will convert all the content to the UTF-8 character set.
    > iconv -c -f <encoding> -t utf-8 atutor_iso.sql > atutor_utf8.sql
    -c means it will ignore errors and carry on.  A problem with this is content that cannot be converted to utf8 will be left out; you can check the iconv manual for more option.
    <encoding> is your character set encoding, if you are using latin1, the encoding is latin1 or ISO-8859-1; for other character set, please refer to your mysql manual.
    For example, on my ATutor database, it would be:
    iconv -c -f latin1 -t utf-8 ../../../mysql/bin/atutor_iso.sql > atutor_utf8.sql
    Note: You need to specify the path of your files.
    If no errors are printed, then you had just successfully converted all your content to utf-8.
  5. This step will import utf-8 content back to your ATutor database.
    > mysql -u <user> -p <database_name> < atutor_utf8.sql
    <database_name> is your ATutor database name
    For example, on my ATutor, it would be:
    mysql -u harris -p atutor < atutor_utf8.sql
    One problem is language packs will also be converted.  You will just have to re-import the language packs after the conversion.

References

  • No labels