Another syntax oddity (not mentioned here) that breaks most highlighters: In Java, unicode escapes can be anywhere, not just in strings. For example, the following is a valid class:
class Foo\u007b}
and this assert will not trigger:
assert
// String literals can have unicode escapes like \u000A!
"Hello World".equals("\u00E4");
I also argue that failing to syntax highlight this correctly is a security issue. You can terminate block comments with Unicode escapes, so if you wanted to hide some malicious code in a Java source file, you just need an excuse for there to be a block of Unicode escapes in a comment. A dev who doesn’t know about this quirk is likely to just skip over it, assuming it’s commented out.
I once wrote a puzzle using this, which (fortunately) doesn't work any more, but would do interesting things on older JDK versions: https://pastebin.com/raw/Bh81PwXY
InputCharacter:
UnicodeInputCharacter but not CR or LF
UnicodeInputCharacter is defined as the following in section 3.3:
UnicodeInputCharacter:
UnicodeEscape
RawInputCharacter
UnicodeEscape:
\ UnicodeMarker HexDigit HexDigit HexDigit HexDigit
UnicodeMarker:
u {u}
HexDigit:
(one of)
0 1 2 3 4 5 6 7 8 9 a b c d e f A B C D E F
RawInputCharacter:
any Unicode character
As a result the lexical analyser honours Unicode escape sequences absolutely anywhere in the program text. For example, this is a valid Java program:
public class Bar {
public static void \u006d\u0061\u0069\u006e(String[] args) {
System.out.println("hello, world");
}
}
Here is the output:
$ javac Bar.java && java Bar
hello, world
However, this is an incorrect Java program:
public class Baz {
// This comment contains \u6d.
public static void main(String[] args) {
System.out.println("hello, world");
}
}
Javac uses the platform encoding [0] by default to interpret Java source files. This means that Java source code files are inherently non-portable. When Java was first developed (and for a long time after), this was the default situation for any kind of plain text files. The escape sequence syntax allows to transform [1] Java source code into a portable (that is, ASCII-only) representation that is completely equivalent to the original, and also to convert it back to any platform encoding.
Source control clients could apply this automatically upon checkin/checkout, so that clients with different platform encodings can work together. Alternatively, IDEs could do this when saving/loading Java source files. That never quite caught on, and the general advice was to stick to ASCII, at least outside comments.
[0] Since JDK 18, the default encoding defaults to UTF-8. This probably also extends to javac, though I haven’t verified it.