Hacker Timesnew | past | comments | ask | show | jobs | submitlogin

In Zig, strings are always arrays of bytes [1]. It's the functions that work on the bytes that need to know about encoding... so for base64 they just use the bytes directly: https://github.com/oven-sh/bun/blob/main/src/base64/base64.z...

[1] https://ziglang.org/documentation/master/#String-Literals-an...



Curious, does that affect the complexity of string concatenation? As far as I remember V8 "uses" ropes, so string concatenation was constant time. Not O(n) like java. Which saves a lot of headaches


So Zig took the C/C++ approach to strings?


Not quite, Zig strings are not zero terminated like C strings but array slices (language-builtin ptr/length pairs). C++ now has std::string_view which is similar, but as stdlib feature, which makes it more awkward to use.

Zig also has a concept of "sentinel terminated arrays/slices" which allows easy interop with C APIs for string data, but the details and implications go a bit too far for a comment :)


C++'s std::string might as well be syntactic sugar on top of a std::vector<char> (due to significant similarities in implementation and API), with null-termination by default for C interop. In general, in C++ you shouldn't be passing around null-terminated strings unless you're working with some legacy C API, at least without also passing the size of the string (unless you really like calling strlen, like that GTA5 code [0]), at which point you might as well pass const std::string& or std::string_view (which can be constructed transparently from a std::string).

Why do you say that std::string_view is more awkward to use as a result of being part of the stdlib?

[0] https://hackertimes.com/item?id=26296339


> Why do you say that std::string_view is more awkward to use as a result of being part of the stdlib?

As far as I'm aware, the type of string literals in C++ is still a "raw" char pointer and not a string view. Also few libraries actually make use of std::string_view, while in Zig everything is built around strings as slices, from the language to the stdlib to 3rd party libs (easy to do of course in a new language ecosystem).


The first part is true for ‘normal’ string literals, but there's suffixed string literals since I think C++17: "foo"sv is a std::string_view if you have operator""sv from std::string_view_literals visible.


> As far as I'm aware, the type of string literals in C++ is still a "raw" char pointer and not a string view.

You can define a constexpr std::string_view with a literal if you so choose. And if you pass a string literal to a function that accepts a string_view, then the compiler has enough information to (and typically will) construct the string_view in constant time. (A sibling commenter also points out that you can use the sv suffix: https://en.cppreference.com/w/cpp/string/basic_string_view/o...)

(Meanwhile, if you pass a string literal to a function that accepts const std::string&, then that will be a linear-time operation, as it copies that data since std::string owns its data. But with any amount of indirection, you'll end up with an implicit strlen call. So this is absolutely a pitfall.)

> few libraries actually make use of std::string_view

This is a fair point, as it's a relatively new addition to the language and by no means mandatory.


Technically, C++ string literals are not pointers but arrays. Arrays can decay to pointers though.


It's the most sensible approach, especially for a low-level language. By the point you start caring about the meaning of the string contents you almost always need to deal with grapheme clusters and a whole lot more Unicode bs. Meanwhile many use cases only care about passing the string along or concatenating or replacing substrings, all of which can be done at the byte level with a sensible encoding like UTF-8.

Codepoint-level string abstractions in particular are complete nonsense that only serve to give you the illusion of making things easier before learn the hard way that Unicode is more complicated than that. This also goes for UTF-16 which is only a cope extension uf UCS-2 for those that already made this mistake before additionally realizing that 2 bytes are not enough to encode all human languages.

Now you might think that declaing all your strings are UTF-8 wouldn't have any of these problems and is the way to go .. until you find out that there are strings you can't represent as (valid) UTF-8 including things that are almost UTF-8 like filenames and other OS-provided data under most POSIX operating systems. This also applies to UTF-16 under Windows btw.


Sure, it's a low-level language, but my concern is that it doesn't even enforce the encoding? Yes, UTF-8 source code prevents string literals from being invalid, but any input from users/sockets/files could be invalid UTF-8. Basically, there's no distinction between a "string" with a known length and a byte array with a known length — the same mistake C and C++ have. Conceptually, strings and byte arrays are different things, even if they can be represented the same way, and a type system could enforce that.

And yes, UTF-16, used by Java and C#/.NET is a pain-point as it forces conversion from sockets/files from UTF-8 into UTF-16 so they can be used, then another one when writing back. But that's beside the point when talking about Zig.


Treating UTF-8 as a different and optional view on a type-agnostic bag of bytes is a feature, not a bug ;) (just look at the mess that Python3 made of that topic) Most of the time, data is just passed around without looking at or caring about its content, and for that it doesn't matter if the data is binary bytes, ascii, code-page encoded, shift-jis, utf-8 or any other format, it only matters at the endpoint when "opening the box". UTF-8 encoding/decoding/validation is handled by the stdlib in Zig, not by the language.


Of course it doesn't enforce encoding, it needs to be able to handle invalid data, bad encoding etc. if it wants to be a system language. You can trivially enforce it's valid utf-8 if that's what your application requires by using the `@import("std").unicode` package.


On an unrelated note, "string" etymologically just means byte array. Whenever I develop a language, or sketch a relevant data type, I use "Text"


Postgres gets this right by making SQL TEXT work efficiently.


that's a good thing actually. if your language restricts strings to valid Unicode, you lose the ability to do things like open files who's path contain invalid Unicode characters.


No, you just lose the ability to treat filenames as strings. They need to be their own type, but you can still open them.


> replacing substrings, all of which can be done at the byte level

You actually cannot implement substring replacement at the byte level with Unicode, just think about what happens if there is a modifier right after the substring in the original text. You cannot just avoid the fact that Unicode (and human writing in general) is a mess.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: