Common problems with i18n and servlets/jsp-s

by Anton Tagunov

Hello, everybody! I as a Java developer have often hit problems with outputting national characters with servlets/jsps and with getting form parameters that users enter in national characters. Let me share some of my experience in this field.

I'm sorry about this document being a mixture of a tutorial for newcomers to the jsp/servler/i18n world (as it was originally designed) and a thorough investigation of the problems that I have found in this field written for experts. Maybe later on these parts will be separated.

Table of contents:

Terminology

Let us assume that you're going to implement a web page that has national characters in it. This can be cyrillics, japanese, chinese or whatsoever. First you should decide what character encoding you're going to serve your page in. Sometimes character encoding is also called character set or just charset. For example the following character encodings could be used to output text in the named languages:

Languagecharacter encoding (charset) IANA name
ChineseBig5
JapaneseShift_JIS
RussianKOI8-R
Russianwindows-1251

Please mail me other encodings used for Chinese and Japanese languages and others that are in common use, I'm personally an expert only on Russian

Setting character set for a web page (general)

The user browser that receives your page should know the charset it is in. This is done best by issuing a proper Content-Type HTTP header. For example:

Content-Type: text/html; charset=Big5

(The alternative approach, specifying content type via

<meta http-equiv="Content-Type content="text/html; charset=Big5">

is known to be less reliable and is not discussed in this article.)

The character encoding (charset) in the Content-Type header should be given as in the IANA preffered name for the character encoding as listed in the IANA registry.

For JDK's 1.3+ (and maybe earlier, please correct me) the IANA charset names may be used everywhere in the JDK where a function accepts character encoding name as a parameter.

(It can actually be used with the String, java.io.InputStreamReader and the java.io.OutputStreamWriter classes that are capable of performing transformations between Java internal 16 bit Unicode character representaion and a number of external encodings. See section brute force for an example).

(It looks like JDKs as early as JDK 1.1 did not fully support IANA standartized preffred character encoding names, see my notes on this here).

The character encoding names are not case sensitive.

Since character encoding information is conveyed in HTTP headers, it is often usefull to view them. Here's a short note on how to do them.

Setting character set when writing a servlet

What you have to do if you're developing a servlet is

response.setContentType("text/html; charset=Big5"); /*this is the IANA name of the character set*/
Writer out = response.getWriter(); /*we get a writer propelly set up to convert from Java internal unicode string representation to Big5*/

As you know, Java internally keeps characters in Unicode (which is actually packed to utf-8) that gives differnt code ranges to Cyrillic, Japanese, Chinese characters and so on. In our example if we out.write(s) string s that contains Chinese characters these characters will be propelly converted to bytes representing them in the Big5 character encoding. If in contrast we try to out.write() some characters not present in the Big5 encoding, (like cyrillic chars) then these characters will be output as question marks (???).

There's a very important thing in the servlet specification about calling response.setContentType: that we'll need in our discussion later on: "The setContentType or setLocale method must be called before getWriter for the charset to affect the construction of the writer."

By the way, this spec says that there is one more way to set the character encoding of the web page generated by a servlet, that is to call response.setLocale(). I find it does not work for Tomcat and Weblogic and find it is not usefull any way and advise everybody against using it. (See the details here).

Setting charset when writing a jsp page

jsp technology implies that every jsp page is translated into a servlet. The jsp page may have a contentType attribute in the page directive:

<%@ page contentType="text/html; charset=Big5" %>

This is a fragment of the servlet that Tomcat 4.0.1 generates from a jsp page with such directive:

JspWriter out = null;
  ...
  response.setContentType("text/html; charset=Big5");
    ...
    out = pageContext.getOut();

If the page directive does not have a contentType attibute, then the default character encoding is used:

response.setContentType("text/html;charset=ISO-8859-1");

So, by the contentType attribute of the page directive you control the Content-Type: .. HTTP header and the setup of the writer just as you would do it by calling response.setContentType yourself in a servlet.

The contentType attribute of the page directive also tells the servlet engine what character set is the source .jsp file written in. So, if you have a

<%@ page contentType="text/html; charset=Big5" %>

directive this implies that you have to save your source .jsp file in the Big5 character encoding. And if you have a

<%@ page contentType="text/html; charset=utf-8" %>

this implies that you have to author your pages in utf-8.

Setting character set dynamically

Selecting a character encoding (alias character set) dynamically at run time is very straightforward if you're writing a servlet and very subtle, if you're writing a jsp.

There is no problem when writing a servlet: you just compute the charset value on the fly and then call response.setContentType():

import javax.servlet.http.*;
import javax.servlet.*;
import java.io.*;

public class S1 extends HttpServlet{

   protected void doGet(HttpServletRequest request,
     HttpServletResponse response) throws ServletException, IOException{
     String charset = "windows-1251";
     response.setContentType("text/html; charset=" + charset );
     Writer out = response.getWriter();
     out.write("\u0423\u0440\u0430!");
     out.close();
   }
}

When writing a jsp the situation gets much worse. You still can code your jsp page that does the same thing as the servlet above like this:

<%@ page buffer="16kb" %>
<%
      String charset = "windows-1251";
      response.setContentType( "text/html; charset=" + charset ); %>
<%="\u0423\u0440\u0430!"%>

But this code is extremely servlet engine dependent:

Tomcat 3.3ok
Tomcat 4.0.1ok
Bea Weblogic 6.0sp1failure

"failure" means that that when the page outputs the cyrillic characters (\u0423\u0440\u0430) they come out as question marks.

Please, if you do test this on other servlet engines, will you mail me the results?

Why is this code servlet engine dependent? Let us look at the .java file generated by Tomcat 4.0.1:

  response.setContentType("text/html;charset=ISO-8859-1");
  pageContext = _jspxFactory.getPageContext(this, request, response,
     "", true, 16384, true);
  ...
  out = pageContext.getOut();
  out.write("\r\n");
  String charset = "windows-1251";
  response.setContentType( "text/html; charset=" + charset );
  out.print("\u0423\u0440\u0430!");
  out.write("\r\n");

As I have noted earlier, response.setContentType() should be called before response.getWriter().

The difference between Tomcat and Weblogic is that

Tomcat, if the buffer is not "none", really calls resonse.getWriter() only when flushing the buffer the first time. So everyone is free to call response.setContentType() untill the buffer is flushed.

At the same time it looks like (can't tell for sure as Weblogic sources are not available) that Weblogic servlet-and-jsp engine really calls resonse.getWriter() right when the out object is constructed - that is before any user code in sriptlets, custom tags or beans has a chance to run. So on the Weblogic 6.0sp1 dynamic charset switching in a jsp looks quite impossible.

Similar effects arise on Tomcat if we have buffer="none".

Please consider the following (working on Tomcat) example:

<% response.setContentType( "text/html; charset=windows-1251" ); %>
<%@ page buffer="none" %>
<%="\u0423\u0440\u0430!"%>

this works okay, I mean that not question marks but normal windows-1251 one-byte codes are generated for the cyrillic letters.

Here's another (failing on Tomcat) example:

<%@ page buffer="none" %>
<% response.setContentType( "text/html; charset=windows-1251" ); %>
<%="\u0423\u0440\u0430!"%>

this example fails: before the code in the scriptlet runs the

  out.write("\r\n");

line of code gets executed (it is the carrige-return after the page directive translated to the servlet code). And as the buffering is off, emmiting anything to the out object causes response.getWriter() to be called immediately, thus making the further call to response.setContentType() useless.

The scriptlets have been chosen to demonstrate the character encoding effects for simplicity, all the same applies to code that runs from custom tags in tag libraries.

Can we build a taglib that dynamically switches charater encoding in jsp pages?

This discussion showes that it is possible to write a tag library that would dynamically switch character encoding in a jsp page. The behaviour of this tag library would depend on the servlet engine though.

It is known that it will work on Tomcats of 3.x and 4.x families with those jsp pages that have buffering turned on.
It is known that it will not work on Bea Weblogic 6.0sp1 no matter whether buffering on or off.
Please mail the author data on weather such taglib would work/wouldn't work on other servlet engines, the information will be put on this page.

If we wanted to make this taglib work on Tomcat but with buffering turned off (for any reason, say for speed optimization), then we would meet a certain obstacle. For this taglib to work it's tag has to be invoked prior to anything written to the out object. To achieve this we'll have two options:

option 1: put all the <%@ page ..%> and <%@ taglib ..%> directives without any spaces between them and then the taglib tag that performs the necessary operation right after, all on one line. Any occasional space between these will ruine the operation of the jsp.

option 2: put all these directives in separate lines, but "glue" them with jsp comments (or empty scriptlets). Here's a sample:

<%@ page buffer="none" %><%--
--%><%@ taglib uri="www.smth.org/our/magic/taglib" prefix="magic" %><%
%><magic:doit some-param="some-value" />

If one wants a completely portable solution for choosing the character encoding of a jsp page it seems that the only reasonable option is to have multiple copies of the jsp page, one for each character encoding - language pair. This approach has an additional benefit of having all language dependent data already put inisde the page, not being searched for at runtime (say, from resource bundles). It looks quite possible to develpon, for example, an Ant task that would automate generation of such multiple jsp pages from a single source and a resource bundle.

A real solution would be to update the jsp spec, but this is a whole new story. What is necessary is one of the following:

  • Having an ability to run some user code before the out object is constructed. In fact this code could even dynamically determine other <%@ page %> directive parameters like buffer size, error page url and so on.
  • Demand on all the compliant implementations to be modelled after Tomcat -- that is really call response.getWriter() only when the buffer is first flushed (if the buffering is on) or when any data is first written to the out object (if the buffering is off). An additional usefull feature would be to ignore all the whitespaces in the jsp page that come before any meaningfull data (taglib tag invocations, non-whitespace template data and scriptlets).

Maybe discuss this part of this article in the appropriate JSR group?

Getting form parameters in national encodings (general)

When a web developer has solved the problem of delivering localized content to the user he/she hits the next problem: it is necessary to deliver user input done in national characters back to the server - that is correctly process user submitted form parameters.

Let us see, how the browser sends national chars to the server. I will use an example when the form is submitted via method "GET", but submiting a form via method "POST" does not make any diffrecence except that all the same character secuences come in the POST body, not as a part of the GET query.

The browser generally does the following: it takes user input in national characters

  • translates it to a byte sequence using the character encoding that the web page that contains the form is encoded with
  • the resulting byte secuence is encoded into the query string according to the usual rules of encoding query strings. That is all bytes that correspond to legal ascii alpha-numeric chars are encoded as those chars, all the rest are converted to the %xy representation, where xy is the hexademical code of the corresponding byte (like %C1, for example)

Then the encoded query (possibly containing %xy codes) is sent to the server. ascii characters, according to the procedure described above are sent to the server as they are, provided that they have the same codes both in ascii character encoding and in the national character encoding that is used.

This is an often case, here is a short list of encoding that follow this rule (please mail me additions and exceptions to this list to make this page more full):

KOI8-R
windows-1251
UTF-8

To make this clearer let us study an example:

Let us assume you have the following form in your web page a.jsp:

<FORM METHOD="GET" ACTION="b.jsp">
  <INPUT TYPE="TEXT" NAME="n">
  <INPUT TYPE="SUBMIT">
</FORM>

Let us assume that you enter two characters in the text field: a latin letter 'a' and a cyrillic letter 'a'. (If you do not have cyrillics support, but have a support of some other national language in your web browser you can repeat this test with that target language using character encodings for the test web page that are applicable to that language).

This is what you will get depending on the character encoding that page a.jsp has been marked with (for details on the character encoding for a page see the beginning of this document):

character encoding of a.jsp result of submitting the query
KOI8-Rb.jsp?n=a%C1
windows-1251b.jsp?n=a%E0
UTF-8b.jsp?n=a%D0%B0

Note that the latin letter 'a' stands for itself in the query. This is due to the fact it's code is the same in ascii, windows-1251, KOI8-R and UTF-8.

Encoding of the cyrillic letter 'a' in the query string depends on the character encoding of the page from which the form has been submitted, that is of a.jsp. As you can from this example,

Cyrillic letter 'a' is encoded asin character encoding
0xC1KOI8-R
0xE0windows-1251
0xD0 0xB0UTF-8

The author's experience showes that this is all true even for wap browsers in modern cellular phones: the pages for these devices should always be in UTF-8 character encoding if they contain national characters. And the wap phones that we tested that had cyrillics supported propelly returned data in the UTF-8 encoding just like in the examples above.

Now, that the way browsers send national characters back to the web server is clear, let us see what we can do to propelly decode the parameters at the server side.

As you might have noted there's large simularity in the way data travels from and to a web server. (You may see some examples on this here.) But unfortunately there is one difference: while the web server tells the browser what character encoding the page it sends is in (via the Content-Type HTTP header), the client does not send such information.

Accordingly to the HTTP spec the HTTP request the browser send to the server (that contains the submitted form) may well contain the Content-Type header too. This would give the server the key to decript the form parameters. Regretfully our present internet browsers do not send it.

So this is what generally happens if we do not set up our servlet engine in a spcial way and do not write any extra code in our servlets/jsps. Assume that we have b.jsp as followes:

<%@ page contentType="text/html; charset=the-same-as-for-a.jsp" %>
<HTML>
  <BODY>
    <%String n = request.getParameter( "n" );%>
    n=<%= n %><br />
    code=<%=Integer.toString( (int) n.charAt(1), 16) %>
    <%if (n.length()>2){
        out.print( Integer.toString( (int)request.getParameter("n").charAt(2), 16) );
    } %>
  </BODY>
</HTML>

What we would expect here is

What we actually see is that the servlet engine takes every %xy component of the query, interprets it as a code of a Latin-1 character and puts it to the string:

character encoding declared for the jsp pages a.jsp and b.jsp query string generated what unicode characters are put into the string by request.getParameter("n") what the string looks like if output back to the browser
KOI8-R ?n=a%C1 0x61 0xC1 "aÁ" or "a?"
windows-1251 ?n=a%E0 0x61 0xE0 "aà" or "a?"
UTF-8 ?n=a%D0%B0 0x61 0xD0 0xB0 "aа"

We see that ascii char (latin 'a') is decoded okay, but the code(s) representing cyrillic 'a' are misinterpreted. This happens because the servlet engine assumes that the query string contains Latin-1 coded parameter values. When this is not the case we should use our knowledge about the actual encoding of the parameters to decode them correctly. There are several ways to do that.

Decoding request parameters by the method of "brute force"

There is always a Servlet Container independent way of decoding parameters that I call "method of burte force":

<%@ page contentType="text/html; charset=windows-1251" %>
<HTML>
  <BODY>
    <%
      String n1 = request.getParameter( "n" );
      String n = new String(
         n1.getBytes( "ISO-8859-1" ),
         "windows-1251"
      );
    %>
    n=<%= n %>
  </BODY>
</HTML>

This may fail on JDK 1.1, see the remarks

It is a very often situation that the character encoding of the form parameters, that is the character encoding of the page from which the form has been submitted is the same as the character encoding of the current page. So, some utility class can be developed that would do something like (sure enough this function should be invoked after response.setContentType() has been called to set the character encoding of the response):

public class Util{
    public String getParameter(
        HttpServletRequest request,
        HttpServletResponse response,
        String parameterName){
        return new String(
            request.getParameter( parameterName ).getBytes("ISO8859_1"),
            response.getCharacterEncoding() /***/
        );
    }
}

Don't be confused with the ISO8859_1 character encoding name. This is what JDK 1.1 used to understand, see the remarks. The author suspects that for the same reasons as given in the remark the line of code marked with /***/ may cause trouble: response.getCharacterEncoding returns "the charset used for the MIME body" (see the spec), that is the IANA preferred character name (see the rfc 2045), while JDK 1.1 constructor String( byte[] bytes, String enc) accepts JDK charset name and may refuse to accept it.

The strong point of this approach (converting to byte array and reassabling back to string) is that it will work on every servlet container.

The weak point of this approach is that while decoding every parameter we create an extra Java byte array an an extra String, which is a waste of resources.

Decoding request parameters under Servlet 2.3 containers

Recognizing a need for such functionality the jsp spec writers have introduced a new function to the servlet 2.3 api:

public void setCharacterEncoding(java.lang.String env)
                  throws java.io.UnsupportedEncodingException;

As the spec says, "This method must be called prior to reading request parameters or reading input using getReader()." That it is prior to the first request.getParameter().

As it may often happen that the character encoding of the form parameters, that is the character encoding of the page from which the form has been submitted is the same as the character encoding of the current page, it may be worthwile to add the following line of code to the beginning of each jsp page (or do the same thing in the servlet):

<% request.setCharacterEncoding( response.getCharacterEncoding() ); %>

or have some tag in some taglib that would perform the same function. (This would be a cleaner approach). Is it worth considering including this functionality into the jakarta taglibs i18n tag library?

Of course this line of code will work if the character encoding of response has been already set, as the effect of having the contentType attribute on the <%@ page %> directive, or calling response.setContentType() directly (see the beginning of this document for details and subtleties of doing this).

This approach (calling resoponse.setCharacterEncoding()) looks the best approach of solving the problem of decoding form parameters. Its only weakness is that great many of Servlet 2.2 servet engines are currently in operation and this approach is not for them.

Decoding request parameters under Servlet 2.2 containers

There is great many various servlet containers from different vendors. The problem of decoding form parameters coded with national character encodings has been recognized by many vendors and various propriatary solutions to this problem exist. This solutions are specific to the container they are implemented with. (Here we need to speak only about Servlet 2.2 containers as for Servlet 2.3 containsers the problem has been solved as described in the previous section.)

Tomcat 3.3. There is a special Interceptor in Tomcat 3.3 called DecodeInterceptor. (Interceptors alias modules are configurable parts of software that costitue Tomcat 3.3. These parts of software participate in request processing. The interceptors=modules are configured in Tomcat 3.3 via conf/server.xml file and possibly in other xml configuration files.

Decode interceptor is documented in the Tomcat 3.3 docs with this interceptor it is possible to

  • Specify the default encoding
    • on per-context (that is per-application) basis
    • for all applications running on this instance of Tomcat
  • if there is a user session it is possible for this Interceptor to store the encoding of the last page emmited in the session and use it
  • extract the encoding name form the request:
    • from a special request parameter (charset=UTF-8)
    • from the tail of the request uri (the uri is of the form http://localhost:8080/myapp/index.jsp;charset=UTF-8, the part ;charset=UTF-8 is removed from the uri and not visible to normal request processing). Remark: interesting though it is, it would be worth testing if this works together with url-rewriting based session handling.

Refer to the documentation for further details.

BEA Weblogic 6.0sp1. Special context parameters described in web.xml web application descriptor switch charset for decoding request parameters. For example, the following parameter description:

     <context-param>
         <param-name>weblogic.httpd.inputCharset./rus/*</param-name>
         <param-value>windows-1251</param-value>
     </context-param>

Orders weblogic server to decode parameters in all requests submited to urls matching the /rus/* pattern as being encoded with windows-1251 encoding.

If someone needs to port an application designed using these features to Tomcat it should be possible to write an Interceptor that would mimic this Weblogic's behaviour.

It would be rather interesting to gather here information on how similar things can be done on other servlet 2.2 engines, so additions are welcome :-).

Issues arising with file upload libraries

Interesting issues arise when uploading files. (Some details on file upload are prepared here).

The files are uploaded with HTTP requests that have Content-Type: multipart/form-data. These requests and are capable of conveying both files and textual form fields. As none of the current jsp specs provide any special api for processing such requests various third-party libraries are used to do this.

The packages that do this that I know about are

Package nameLicense type
maybeupload BSD-style, very relaxed

I would like to collect some more information on popular packages that do this job, so this is another place that I hope to get some feedback at.

All these packages call request.getInputStream() and process the body of the POST request passed to the servlet/jsp. Thus the request.getParameter() mechanism is turned off and textual fields values if they were conveyed in the request should be retrived in a way specific to the package file uploading package.

maybeupload package, for example, provides a Servlet 2.2 wrapper for the request object that has its own versions of getParameter() and getParameterValues() methods.

As we have a new point where binary representation of form fields data is converted to Java strings we have a new point to worry about proper character encoding being used.

To provide proper national character decoding such third party libraries for uploading files should have an api to pass them the character encoding.

TODO: add more file uploading libraries to this section, find out weather they have an api for specifing the character encoding. As usual, help is highly welcome ;-).


Usefull links

1 details on MIME messages rfc2045
2 IANA character set names registry
3 Sun Microsystms specifications for servlets servlet specs
4 Sun Microsystms specifications for jsp jsp specs
5 Tomcat 3.3 documentation tomcat 3.3 docs
6 Jakarta taglibs project home page, where you can get pre-release versions of the i18n tag library fromJakarat taglibs
7 maybeupload home page maybeupload
Any kind of feedback, all corrections, additions, opinions are highly wellcome and greately appriciated
sincerely yours
"Anton Tagunov" <tagunov@motor.ru>
http://tagunov.newmail.ru
;-)