Categories
Other

Tomcat and accented/cyrillic/chinese characters (charset problems)

The problem is obvious: you made a page with a form and the accented characters are retrieved as a couple of “strange” characters. This is only an example of charset incompatibility and a web developer must know how to manage it.

When you serve a page to a browser from a JSP, you’re actually sending a stream of bytes and the browser interprets those bytes as characters which will be shown to the user.

But a byte ranges from 0 to 255 and those values are not enough to represent all the characters used in the World. If you worked only with the western languages you couldn’t have never met the problem. But as soon as you start to work with the Cyrillic alphabet , or the Chinese one, problems raise up.

What is hidden in the web thing is the charset. A charset is a method to map some bytes to characters. There are charset which maps one byte to one characters, others maps one or more bytes to one characters.

So, when you send a sequence of bytes to a browser you should tell it which charset to use to interpret the sequence and convert it in a readable text.

Most time the default charset (when nothing is specified) is the ISO-8859-1 which can be used only with western languages. This charset does not contains Cyrillic character, for example, so you cannot create a stream of byte to be interpret with ISO-8859-1 and pretend to send a Russian text.

Working with JSP make the problem harder to understand, since Java internally represents characters with two bytes, using part of the Unicode standard. So it seems easy to create a JSP with Cyrillic character… but then the browser does not display them correctly.

What you’re missing is to explicitly declare the charset that must be used when Tomcat takes the generated JSP strings and transforms them in a stream of bytes to be sent to the browser.

The assumed charset is ISO-8859-1 but a Cyrillic character does not map on this charset, so Tomcat (Java) must convert it in a question mark.

To make the whole thing to work, you can force the charset on every JSP so it will be passed on to Tomcat. Tomcat, given the charset, will create the correct stream of bytes corresponding to your texts and send it to the browser with the chosen charset.

The browser, then, can correctly render the text having the bytes and the charset to map them to characters.

How? Adding explicitly a content type:

<%@page contentType="text/html;charset=UTF-8"%>

The UTF-8 charset is one of the mostly adopted charset and can represent every Unicode defined character. It is a variable length encoding scheme, hence a character can be represented with one, two, three, four bytes.

Without entering the technical details, the variable content length will explain why some characters, when there are charset problems, are represented with two “strange” characters. In fact if you’re sending out a “è” (grave accented e) encoding it with UTF-8 it produces two bytes. But if you’re not saying the receiver you’re using that charset, the receiver assumes ISO-8859-1 and map the two byte in two single (wrong) characters.

This problem is not only between a web server and a browser, it can be seen in many communication channels which exchange stream of bytes to be interpreted as characters. A non obvious example is the communication between your code and a database.

Getting data back

I talked about sending data to a browser. But what happens when the browser sends back come data to the server, like when a user submit a form?

First, the data sent is text and it is converted to a stream of bytes. Hence we need a charset to correctly manage that data.

Hopefully, the HTTP protocol define a content type (and charset) even for the data sent by a browser. Incredibly no browser sends it.

Browsers simple encode the text using the same charset of the page. So if you send a page using UTF-8 (the most used standard), they will send back a form data using UTF-8.

But your server cannot “remind” or known the original page encoding, and since the browser does not send it back it assume ISO-8859-1.

We can clearly see that on the Tomcat source code, where parameters are extracted from a post.

String enc = getCharacterEncoding(); // no encoding is passed on by the browser, so it will be null
if (enc == null) parameters.setEncoding(DEFAULT_CHARACTER_ENCODING);

Searching the code we find that DEFAULT_CHARACTER_ENCODING is ISO-8859-1.

public static final String DEFAULT_CHARACTER_ENCODING="ISO-8859-1";

When the browser sends back UTF-8 encoded data, Tomcat will decode it in the wrong way.

“But it always worked for me”. Yes, if you use only English. All characters use in the English language (no accents) have the same coding in ISO-8859-1 and UTF-8. It’s a little magic thing which avoids compatibility issue on old systems.

How to make Tomcat decode the UTF-8

Since the browser doesn’t send back the encoding, we must force it with a filter. Note that filters are at web application level, so we should add a filter on each web application affected by this problem. And the served pages must be UTF-8, otherwise the browser can send back data not UTF-8 encoded. Being coherent will pay back.

Creating a filter is very simple. I’ll use the Servelt 2.x syntax, with version 3.x one can use annotation to auto install the filter (but since the filter order is extremely important in this case, it’s better to explicitly add it to the web.xml).

import java.io.IOException;
import javax.servlet.*;

public class MyCharacterEncodingFilter implements Filter {

    @Override
    public void init(FilterConfig filterConfig) throws ServletException {
    }

    @Override
    public void doFilter(ServletRequest request, ServletResponse response, FilterChain chain) throws IOException, ServletException {
        request.setCharacterEncoding("UTF-8");
        chain.doFilter(request, response);
    }

    @Override
    public void destroy() {
    }
}

and in to the web.xml, before every other filter:

<filter>
 <description>Force the request character encoding to UTF-8</description>
 <filter-name>CharacterEncodingFilter</filter-name>
 <filter-class>MyCharacterEncodingFilter</filter-class>
 </filter>
 <filter-mapping>
 <filter-name>CharacterEncodingFilter</filter-name>
 <url-pattern>/*</url-pattern>
 <dispatcher>REQUEST</dispatcher>
</filter-mapping>

Conclusions

Work taking in account all the World contexts is not easy and probably most of those problems should have been solved at a lower level not requiring tens of thousands of coders to add a filter to their web applications.

More about charset

More about charset can be read (msu be read) here.

Leave a Reply