Common problems with i18n and servlets/jsp-s

Hello, everybody! I as a Java developer have often hit problems with outputting national characters with servlets/jsps and with getting form parameters that users enter in national characters. Let me share some of my experience in this field.

I'm sorry about this document being a mixture of a tutorial for newcomers to the jsp/servler/i18n world (as it was originally designed) and a thorough investigation of the problems that I have found in this field written for experts. Maybe later on these parts will be separated.

Let us assume that you're going to implement a web page that has national characters in it. This can be cyrillics, japanese, chinese or whatsoever. First you should decide what character encoding you're going to serve your page in. Sometimes character encoding is also called character set or just charset. For example the following character encodings could be used to output text in the named languages:

Language	character encoding (charset) IANA name
Chinese	Big5
Japanese	Shift_JIS
Russian	KOI8-R
Russian	windows-1251

Please mail me other encodings used for Chinese and Japanese languages and others that are in common use, I'm personally an expert only on Russian

The user browser that receives your page should know the charset it is in. This is done best by issuing a proper Content-Type HTTP header. For example:

Content-Type: text/html; charset=Big5

The character encoding (charset) in the Content-Type header should be given as in the IANA preffered name for the character encoding as listed in the IANA registry.

For JDK's 1.3+ (and maybe earlier, please correct me) the IANA charset names may be used everywhere in the JDK where a function accepts character encoding name as a parameter.

(It can actually be used with the String, java.io.InputStreamReader and the java.io.OutputStreamWriter classes that are capable of performing transformations between Java internal 16 bit Unicode character representaion and a number of external encodings. See section brute force for an example).

(It looks like JDKs as early as JDK 1.1 did not fully support IANA standartized preffred character encoding names, see my notes on this here).

Since character encoding information is conveyed in HTTP headers, it is often usefull to view them. Here's a short note on how to do them.

response.setContentType("text/html; charset=Big5"); /*this is the IANA name of the character set*/
Writer out = response.getWriter(); /*we get a writer propelly set up to convert from Java internal unicode string representation to Big5*/

As you know, Java internally keeps characters in Unicode (which is actually packed to utf-8) that gives differnt code ranges to Cyrillic, Japanese, Chinese characters and so on. In our example if we out.write(s) string s that contains Chinese characters these characters will be propelly converted to bytes representing them in the Big5 character encoding. If in contrast we try to out.write() some characters not present in the Big5 encoding, (like cyrillic chars) then these characters will be output as question marks (???).

There's a very important thing in the servlet specification about calling response.setContentType: that we'll need in our discussion later on: "The setContentType or setLocale method must be called before getWriter for the charset to affect the construction of the writer."

By the way, this spec says that there is one more way to set the character encoding of the web page generated by a servlet, that is to call response.setLocale(). I find it does not work for Tomcat and Weblogic and find it is not usefull any way and advise everybody against using it. (See the details here).

jsp technology implies that every jsp page is translated into a servlet. The jsp page may have a contentType attribute in the page directive:

<%@ page contentType="text/html; charset=Big5" %>

This is a fragment of the servlet that Tomcat 4.0.1 generates from a jsp page with such directive:

JspWriter out = null;
  ...
  response.setContentType("text/html; charset=Big5");
    ...
    out = pageContext.getOut();

If the page directive does not have a contentType attibute, then the default character encoding is used:

response.setContentType("text/html;charset=ISO-8859-1");

So, by the contentType attribute of the page directive you control the Content-Type: .. HTTP header and the setup of the writer just as you would do it by calling response.setContentType yourself in a servlet.

The contentType attribute of the page directive also tells the servlet engine what character set is the source .jsp file written in. So, if you have a

<%@ page contentType="text/html; charset=Big5" %>

directive this implies that you have to save your source .jsp file in the Big5 character encoding. And if you have a

<%@ page contentType="text/html; charset=utf-8" %>

Selecting a character encoding (alias character set) dynamically at run time is very straightforward if you're writing a servlet and very subtle, if you're writing a jsp.

There is no problem when writing a servlet: you just compute the charset value on the fly and then call response.setContentType():

import javax.servlet.http.*;
import javax.servlet.*;
import java.io.*;

public class S1 extends HttpServlet{

   protected void doGet(HttpServletRequest request,
     HttpServletResponse response) throws ServletException, IOException{
     String charset = "windows-1251";
     response.setContentType("text/html; charset=" + charset );
     Writer out = response.getWriter();
     out.write("\u0423\u0440\u0430!");
     out.close();
   }
}

When writing a jsp the situation gets much worse. You still can code your jsp page that does the same thing as the servlet above like this:

<%@ page buffer="16kb" %>
<%
String charset = "windows-1251";
response.setContentType( "text/html; charset=" + charset ); %>
<%="\u0423\u0440\u0430!"%>

"failure" means that that when the page outputs the cyrillic characters (\u0423\u0440\u0430) they come out as question marks.

Why is this code servlet engine dependent? Let us look at the .java file generated by Tomcat 4.0.1:

  response.setContentType("text/html;charset=ISO-8859-1");
  pageContext = _jspxFactory.getPageContext(this, request, response,
     "", true, 16384, true);
  ...
  out = pageContext.getOut();
  out.write("\r\n");
  String charset = "windows-1251";
  response.setContentType( "text/html; charset=" + charset );
  out.print("\u0423\u0440\u0430!");
  out.write("\r\n");

As I have noted earlier, response.setContentType() should be called before response.getWriter().

Tomcat, if the buffer is not "none", really calls resonse.getWriter() only when flushing the buffer the first time. So everyone is free to call response.setContentType() untill the buffer is flushed.

At the same time it looks like (can't tell for sure as Weblogic sources are not available) that Weblogic servlet-and-jsp engine really calls resonse.getWriter() right when the out object is constructed - that is before any user code in sriptlets, custom tags or beans has a chance to run. So on the Weblogic 6.0sp1 dynamic charset switching in a jsp looks quite impossible.

<% response.setContentType( "text/html; charset=windows-1251" ); %>
<%@ page buffer="none" %>
<%="\u0423\u0440\u0430!"%>

this works okay, I mean that not question marks but normal windows-1251 one-byte codes are generated for the cyrillic letters.

<%@ page buffer="none" %>
<% response.setContentType( "text/html; charset=windows-1251" ); %>
<%="\u0423\u0440\u0430!"%>

line of code gets executed (it is the carrige-return after the page directive translated to the servlet code). And as the buffering is off, emmiting anything to the out object causes response.getWriter() to be called immediately, thus making the further call to response.setContentType() useless.

The scriptlets have been chosen to demonstrate the character encoding effects for simplicity, all the same applies to code that runs from custom tags in tag libraries.

This discussion showes that it is possible to write a tag library that would dynamically switch character encoding in a jsp page. The behaviour of this tag library would depend on the servlet engine though.

If we wanted to make this taglib work on Tomcat but with buffering turned off (for any reason, say for speed optimization), then we would meet a certain obstacle. For this taglib to work it's tag has to be invoked prior to anything written to the out object. To achieve this we'll have two options:

option 1: put all the <%@ page ..%> and <%@ taglib ..%> directives without any spaces between them and then the taglib tag that performs the necessary operation right after, all on one line. Any occasional space between these will ruine the operation of the jsp.

option 2: put all these directives in separate lines, but "glue" them with jsp comments (or empty scriptlets). Here's a sample:

<%@ page buffer="none" %><%--
--%><%@ taglib uri="www.smth.org/our/magic/taglib" prefix="magic" %><%
%><magic:doit some-param="some-value" />

If one wants a completely portable solution for choosing the character encoding of a jsp page it seems that the only reasonable option is to have multiple copies of the jsp page, one for each character encoding - language pair. This approach has an additional benefit of having all language dependent data already put inisde the page, not being searched for at runtime (say, from resource bundles). It looks quite possible to develpon, for example, an Ant task that would automate generation of such multiple jsp pages from a single source and a resource bundle.

A real solution would be to update the jsp spec, but this is a whole new story. What is necessary is one of the following:

When a web developer has solved the problem of delivering localized content to the user he/she hits the next problem: it is necessary to deliver user input done in national characters back to the server - that is correctly process user submitted form parameters.

Let us see, how the browser sends national chars to the server. I will use an example when the form is submitted via method "GET", but submiting a form via method "POST" does not make any diffrecence except that all the same character secuences come in the POST body, not as a part of the GET query.

The browser generally does the following: it takes user input in national characters

Then the encoded query (possibly containing %xy codes) is sent to the server. ascii characters, according to the procedure described above are sent to the server as they are, provided that they have the same codes both in ascii character encoding and in the national character encoding that is used.

This is an often case, here is a short list of encoding that follow this rule (please mail me additions and exceptions to this list to make this page more full):

Let us assume that you enter two characters in the text field: a latin letter 'a' and a cyrillic letter 'a'. (If you do not have cyrillics support, but have a support of some other national language in your web browser you can repeat this test with that target language using character encodings for the test web page that are applicable to that language).

This is what you will get depending on the character encoding that page a.jsp has been marked with (for details on the character encoding for a page see the beginning of this document):

Note that the latin letter 'a' stands for itself in the query. This is due to the fact it's code is the same in ascii, windows-1251, KOI8-R and UTF-8.

character encoding of a.jsp	result of submitting the query
KOI8-R	b.jsp?n=a%C1
windows-1251	b.jsp?n=a%E0
UTF-8	b.jsp?n=a%D0%B0

Encoding of the cyrillic letter 'a' in the query string depends on the character encoding of the page from which the form has been submitted, that is of a.jsp. As you can from this example,

The author's experience showes that this is all true even for wap browsers in modern cellular phones: the pages for these devices should always be in UTF-8 character encoding if they contain national characters. And the wap phones that we tested that had cyrillics supported propelly returned data in the UTF-8 encoding just like in the examples above.

Now, that the way browsers send national characters back to the web server is clear, let us see what we can do to propelly decode the parameters at the server side.

Cyrillic letter 'a' is encoded as	in character encoding
0xC1	KOI8-R
0xE0	windows-1251
0xD0 0xB0	UTF-8

As you might have noted there's large simularity in the way data travels from and to a web server. (You may see some examples on this here.) But unfortunately there is one difference: while the web server tells the browser what character encoding the page it sends is in (via the Content-Type HTTP header), the client does not send such information.

Accordingly to the HTTP spec the HTTP request the browser send to the server (that contains the submitted form) may well contain the Content-Type header too. This would give the server the key to decript the form parameters. Regretfully our present internet browsers do not send it.

So this is what generally happens if we do not set up our servlet engine in a spcial way and do not write any extra code in our servlets/jsps. Assume that we have b.jsp as followes:

<%@ page contentType="text/html; charset=the-same-as-for-a.jsp" %>
<HTML>
  <BODY>
    <%String n = request.getParameter( "n" );%>
    n=<%= n %><br />
    code=<%=Integer.toString( (int) n.charAt(1), 16) %>
    <%if (n.length()>2){
        out.print( Integer.toString( (int)request.getParameter("n").charAt(2), 16) );
    } %>
  </BODY>
</HTML>

What we actually see is that the servlet engine takes every %xy component of the query, interprets it as a code of a Latin-1 character and puts it to the string:

character encoding declared for the jsp pages a.jsp and b.jsp	query string generated	what unicode characters are put into the string by request.getParameter("n")	what the string looks like if output back to the browser
KOI8-R	?n=a%C1	0x61 0xC1	"aÁ" or "a?"
windows-1251	?n=a%E0	0x61 0xE0	"aà" or "a?"
UTF-8	?n=a%D0%B0	0x61 0xD0 0xB0	"aÐ°"

We see that ascii char (latin 'a') is decoded okay, but the code(s) representing cyrillic 'a' are misinterpreted. This happens because the servlet engine assumes that the query string contains Latin-1 coded parameter values. When this is not the case we should use our knowledge about the actual encoding of the parameters to decode them correctly. There are several ways to do that.

There is always a Servlet Container independent way of decoding parameters that I call "method of burte force":

<%@ page contentType="text/html; charset=windows-1251" %>
<HTML>
  <BODY>
    <%
      String n1 = request.getParameter( "n" );
      String n = new String(
         n1.getBytes( "ISO-8859-1" ),
         "windows-1251"
      );
    %>
    n=<%= n %>
  </BODY>
</HTML>

It is a very often situation that the character encoding of the form parameters, that is the character encoding of the page from which the form has been submitted is the same as the character encoding of the current page. So, some utility class can be developed that would do something like (sure enough this function should be invoked after response.setContentType() has been called to set the character encoding of the response):

public class Util{
    public String getParameter(
        HttpServletRequest request,
        HttpServletResponse response,
        String parameterName){
        return new String(
            request.getParameter( parameterName ).getBytes("ISO8859_1"),
            response.getCharacterEncoding() /***/
        );
    }
}

Don't be confused with the ISO8859_1 character encoding name. This is what JDK 1.1 used to understand, see the remarks. The author suspects that for the same reasons as given in the remark the line of code marked with /***/ may cause trouble: response.getCharacterEncoding returns "the charset used for the MIME body" (see the spec), that is the IANA preferred character name (see the rfc 2045), while JDK 1.1 constructor String( byte[] bytes, String enc) accepts JDK charset name and may refuse to accept it.

The strong point of this approach (converting to byte array and reassabling back to string) is that it will work on every servlet container.

The weak point of this approach is that while decoding every parameter we create an extra Java byte array an an extra String, which is a waste of resources.

Recognizing a need for such functionality the jsp spec writers have introduced a new function to the servlet 2.3 api:

public void setCharacterEncoding(java.lang.String env)
throws java.io.UnsupportedEncodingException;

As the spec says, "This method must be called prior to reading request parameters or reading input using getReader()." That it is prior to the first request.getParameter().

As it may often happen that the character encoding of the form parameters, that is the character encoding of the page from which the form has been submitted is the same as the character encoding of the current page, it may be worthwile to add the following line of code to the beginning of each jsp page (or do the same thing in the servlet):

<% request.setCharacterEncoding( response.getCharacterEncoding() ); %>

or have some tag in some taglib that would perform the same function. (This would be a cleaner approach). Is it worth considering including this functionality into the jakarta taglibs i18n tag library?

Of course this line of code will work if the character encoding of response has been already set, as the effect of having the contentType attribute on the <%@ page %> directive, or calling response.setContentType() directly (see the beginning of this document for details and subtleties of doing this).

This approach (calling resoponse.setCharacterEncoding()) looks the best approach of solving the problem of decoding form parameters. Its only weakness is that great many of Servlet 2.2 servet engines are currently in operation and this approach is not for them.

There is great many various servlet containers from different vendors. The problem of decoding form parameters coded with national character encodings has been recognized by many vendors and various propriatary solutions to this problem exist. This solutions are specific to the container they are implemented with. (Here we need to speak only about Servlet 2.2 containers as for Servlet 2.3 containsers the problem has been solved as described in the previous section.)

Tomcat 3.3. There is a special Interceptor in Tomcat 3.3 called DecodeInterceptor. (Interceptors alias modules are configurable parts of software that costitue Tomcat 3.3. These parts of software participate in request processing. The interceptors=modules are configured in Tomcat 3.3 via conf/server.xml file and possibly in other xml configuration files.

Decode interceptor is documented in the Tomcat 3.3 docs with this interceptor it is possible to

BEA Weblogic 6.0sp1. Special context parameters described in web.xml web application descriptor switch charset for decoding request parameters. For example, the following parameter description:

     <context-param>
         <param-name>weblogic.httpd.inputCharset./rus/*</param-name>
         <param-value>windows-1251</param-value>
     </context-param>

Orders weblogic server to decode parameters in all requests submited to urls matching the /rus/* pattern as being encoded with windows-1251 encoding.

If someone needs to port an application designed using these features to Tomcat it should be possible to write an Interceptor that would mimic this Weblogic's behaviour.

It would be rather interesting to gather here information on how similar things can be done on other servlet 2.2 engines, so additions are welcome :-).

Interesting issues arise when uploading files. (Some details on file upload are prepared here).

The files are uploaded with HTTP requests that have Content-Type: multipart/form-data. These requests and are capable of conveying both files and textual form fields. As none of the current jsp specs provide any special api for processing such requests various third-party libraries are used to do this.

Package name	License type
maybeupload	BSD-style, very relaxed

I would like to collect some more information on popular packages that do this job, so this is another place that I hope to get some feedback at.

All these packages call request.getInputStream() and process the body of the POST request passed to the servlet/jsp. Thus the request.getParameter() mechanism is turned off and textual fields values if they were conveyed in the request should be retrived in a way specific to the package file uploading package.

maybeupload package, for example, provides a Servlet 2.2 wrapper for the request object that has its own versions of getParameter() and getParameterValues() methods.

As we have a new point where binary representation of form fields data is converted to Java strings we have a new point to worry about proper character encoding being used.

To provide proper national character decoding such third party libraries for uploading files should have an api to pass them the character encoding.

TODO: add more file uploading libraries to this section, find out weather they have an api for specifing the character encoding. As usual, help is highly welcome ;-).

1	details on MIME messages	rfc2045
2	IANA character set names	registry
3	Sun Microsystms specifications for servlets	servlet specs
4	Sun Microsystms specifications for jsp	jsp specs
5	Tomcat 3.3 documentation	tomcat 3.3 docs
6	Jakarta taglibs project home page, where you can get pre-release versions of the i18n tag library from	Jakarat taglibs
7	maybeupload home page	maybeupload