Notepad2 ― Encoding Tutorial

The following document is an overview of the enhanced encoding support introduced with Notepad2 version 4.0. Please note that this information does not apply to earlier versions of Notepad2.

  1. Default ANSI Mode
  2. Recoding a File
  3. Using Encoding Tags
  4. Setting Default Options
  5. Encoding Conversion
  6. Possible Loss of Data
  7. Editing MS-DOS (OEM) Files
  8. Outlook

Default ANSI Mode

Let's assume you have received a simple html file to display some letters from the Greek alphabet. This file has been created on a Macintosh computer, and has been saved in the Mac (Greek) code page. Opening this file with Notepad2 (or any other text editor) on a English Windows system usually results in incorrect display.

The statusbar indicates that Notepad2 is working in ANSI mode, which means that the displayed text is interpreted as if it were encoded in the system default ANSI code page. Plain text files, which are not automatically detected by Notepad2 to be in UTF-8 or another Unicode format, are usually encoded in the system default ANSI code page, that's why ANSI mode is the recommended default setting.

Recoding a File

In the case of the html file transferred from the Mac, however, we need the file contents to be reinterpreted as being encoded in the Mac (Greek) encoding. To achieve this, Notepad2 provides the Recode command (F8), to reload the file from disk and reinterpret the file contents in the desired encoding.

The file is now displayed properly.

If the file has already been modified before recoding it, you'll be asked whether to save your changes, first. This is usually no problem with simple changes consisting just of some ASCII characters (such as adding or modifying an encoding tag, as explained below), as most encodings use common code points for the 7-bit ASCII character set. However, if complex changes have already been done before realizing that the file was loaded with the wrong encoding, it's recommended to copy the modifications to clipboard prior to recoding (without saving) the file, and then reinserting the changes from clipboard.

Using Encoding Tags

To have the sample html file loaded with the correct encoding, the next time it is opened, a html content-type meta tag can be added, specifying Mac (Greek) as the file encoding.

Notepad2 is able to parse encoding tags and automatically perform the correct encoding conversion (or, reinterpretation) upon loading a file. This feature is based on basic Emacs file variable support introduced with Notepad2 version 3.0 (the reason I've added the complicated, initially error-prone file variable parsing was the ability to support encoding tags, at a later stage).

Encoding tag parsing is performed according to the following rules: the first (and, if nothing is found, also the last) 512 bytes of a file are scanned for statements determining the file encoding, such as coding:"ibm437" or "encoding==latin2". The statement with the highest priority is encoding, as this is usually at the top of XML files. If no such expression is present, charset, and finally coding instructions are searched.

Encoding tag parsing works with any file type, not just html, as illustrated by the following sample comment at the top of a CGI script:

# mode: perl;
# coding: koi8-r;

The encoding identifiers (i.e. the part of the encoding tag actually specifying the name of the encoding) recognized by Notepad2 are mostly taken from this list for Internet Explorer, narrowed a bit to the most common forms. To insert an encoding identifier for the currently selected file encoding, press Ctrl+F8 (this command is not available for UTF-16 and some other encodings not sharing the 7-bit ASCII character set).

Encoding tags are ignored if the file contains a UTF-16 little-endian or big-endian byte order mark, or a UTF-8 signature, all of which unambiguously determine the file encoding. Encoding tags are also ignored if the file is manually recoded by the user (F8). Additionally, Notepad2 can be launched with the --e command line switch to specify a source encoding:

notepad2.exe --e x-mac-greek encoding-test.html

Encoding tag parsing can be disabled permanently in File, Encoding, Default, or temporarily bypassed when reloading the file with Alt+F8 (the latter also affecting other file variables, like syntax scheme, tab-width, etc.).

Setting Default Options

This dialog box (File, Encoding, Default) provides a way to predefine the encoding for non-Unicode and non-UTF-8 files, namely by setting the desired encoding as the default. As already mentioned above, the recommendation is to use either ANSI or UTF-8 as the default. Like this, no encoding conversion needs to be performed when loading non-UTF-8 files.

The Open 7-bit ASCII files in UTF-8 mode setting can be used to have ASCII files always loaded as UTF-8. Normally, this only happens if UTF-8 is set as the default. This option can be enabled temporarily by pressing Shift+F8, which will reload the file from disk and open it in UTF-8 mode (unlike explicit recoding as UTF-8, this action will only open valid UTF-8 files in UTF-8 mode, but otherwise fallback to ANSI, or the default encoding, respectively).

The Windows XP encoding detection routines sometimes deliver wrong results for ANSI files, pretending the format being Unicode without byte order mark, which results in garbled display. This problem has been fixed in more recent versions of Windows. Enable Skip automatic Unicode detection if you want to rely solely on Unicode byte order marks for Unicode recognition.

Encoding Conversion

Now you may want to convert the sample Mac (Greek) html file to a different encoding, i.e. Greek (ISO). That's very easy, just hit F9 and make your choice.

Don't forget to update the file encoding tag, as otherwise the Greek (ISO) code points are still interpreted as Mac (Greek), and won't display properly after reloading. As mentioned above, press Ctrl+F8 to insert a common encoding identifier for the currently selected encoding.

Possible Loss of Data

Internally, text is stored in memory as UTF-8, except for ANSI mode. That's why, for most encodings, non-representable characters can be entered. In this case (i.e. when adding some Spanish accents or German umlauts to a Greek (ISO) file), a warning about a potential loss of data will be issued when saving the file.

The Windows API code page conversion routines usually replace unsupported characters with question marks, which appear the next time the file is reopened. Before closing the file and thus discarding the original, loss-free text from memory, try saving again with another encoding, and test if it supports all the necessary characters. Alternatively (and preferably) switch to UTF-8 or UTF-16 to make sure no characters are lost.

Editing MS-DOS (OEM) Files

Editing MS-DOS (OEM) text files integrates with enhanced encoding support. By default, OEM files are loaded in ANSI mode.

The hotkey Ctrl+Shift+O recodes the currently loaded file in the system default OEM code page. It's no longer necessary to convert text back to OEM when saving the file. Note: a "coding:oem;" tag can be used to indicate a file is encoded in the system default OEM code page.

The analog hotkey Ctrl+Shift+A recodes the currently loaded file in the system default ANSI code page. As of Notepad2 version 4.2.25, the hotkey Ctrl+Shift+F recodes the currently loaded file in the code page set as default by the user (File, Encoding, Default).

Outlook

At the current development stage, Notepad2 supports a handpicked list of more than 60 different encodings, provided that the appropriate code page conversion tables are installed on your system. I may be adding support for more encodings, in the future.

Notepad2 also supports some encodings for right-to-left languages, such as Arabic or Hebrew. However, true bidirectional text editing is not possible with Notepad2, so far.


© Florian Balmer 1996-2014
Page last modified: June 29, 2014