Another syntax oddity (not mentioned here) that breaks most highlighters: In Jav...

mistercow · on Nov 2, 2024

I also argue that failing to syntax highlight this correctly is a security issue. You can terminate block comments with Unicode escapes, so if you wanted to hide some malicious code in a Java source file, you just need an excuse for there to be a block of Unicode escapes in a comment. A dev who doesn’t know about this quirk is likely to just skip over it, assuming it’s commented out.

styglian · on Nov 3, 2024

I once wrote a puzzle using this, which (fortunately) doesn't work any more, but would do interesting things on older JDK versions: https://pastebin.com/raw/Bh81PwXY

ivanjermakov · on Nov 2, 2024

I have never seen this in Java! Is there any use cases where it could be useful?

susam · on Nov 2, 2024

I don't know about usefulness but it does let us write identifiers using Unicode characters. For example:

  public class Foo {
      public static void main(String[] args) {
          double \u03c0 = 3.14159265;
          System.out.println("\u03c0 = " + \u03c0);
      }
  }

Output:

  $ javac Foo.java && java Foo
  π = 3.14159265

Of course, nowadays we can simply write this with any decent editor:

  public class Foo {
      public static void main(String[] args) {
          double π = 3.14159265;
          System.out.println("π = " + π);
      }
  }

Support for Unicode escape sequences is a result of how the Java Language Specification (JLS) defines InputCharacter. Quoting from Section 3.4 of JLS <https://docs.oracle.com/javase/specs/jls/se23/jls23.pdf>:

  InputCharacter:
    UnicodeInputCharacter but not CR or LF

UnicodeInputCharacter is defined as the following in section 3.3:

  UnicodeInputCharacter:
    UnicodeEscape
    RawInputCharacter

  UnicodeEscape:
    \ UnicodeMarker HexDigit HexDigit HexDigit HexDigit

  UnicodeMarker:
    u {u}

  HexDigit:
    (one of)
    0 1 2 3 4 5 6 7 8 9 a b c d e f A B C D E F

  RawInputCharacter:
    any Unicode character

As a result the lexical analyser honours Unicode escape sequences absolutely anywhere in the program text. For example, this is a valid Java program:

  public class Bar {
      public static void \u006d\u0061\u0069\u006e(String[] args) {
          System.out.println("hello, world");
      }
  }

Here is the output:

  $ javac Bar.java && java Bar
  hello, world

However, this is an incorrect Java program:

  public class Baz {
      // This comment contains \u6d.
      public static void main(String[] args) {
          System.out.println("hello, world");
      }
  }

Here is the error:

  $ javac Baz.java
  Baz.java:2: error: illegal unicode escape
      // This comment contains \u6d.
                                   ^
  1 error

Yes, this is an error even if the illegal Unicode escape sequence occurs in a comment!

ivanjermakov · on Nov 2, 2024

I wonder if full unicode range was accepted because some companies are writing code in non-english.

layer8 · on Nov 2, 2024

Javac uses the platform encoding [0] by default to interpret Java source files. This means that Java source code files are inherently non-portable. When Java was first developed (and for a long time after), this was the default situation for any kind of plain text files. The escape sequence syntax allows to transform [1] Java source code into a portable (that is, ASCII-only) representation that is completely equivalent to the original, and also to convert it back to any platform encoding.

Source control clients could apply this automatically upon checkin/checkout, so that clients with different platform encodings can work together. Alternatively, IDEs could do this when saving/loading Java source files. That never quite caught on, and the general advice was to stick to ASCII, at least outside comments.

[0] Since JDK 18, the default encoding defaults to UTF-8. This probably also extends to javac, though I haven’t verified it.

[1] https://docs.oracle.com/javase/8/docs/technotes/tools/window...