ChatGPT解决这个技术问题 Extra ChatGPT

If strings are immutable in .NET, then why does Substring take O(n) time?

Given that strings are immutable in .NET, I'm wondering why they have been designed such that string.Substring() takes O(substring.Length) time, instead of O(1)?

i.e. what were the tradeoffs, if any?

@Mehrdad: I like this question. Could you please tell me how we can determine O() of a given function in .Net? Is it clear or we should calculate it? Thank you
@odiseh: Sometimes (like in this case) it's clear that the string is being copied. If it isn't, then you could either look in the documentation, perform benchmarks, or try to look in the .NET Framework source code to figure out what it is.

C
Callum Watkins

UPDATE: I liked this question so much, I just blogged it. See Strings, immutability and persistence

The short answer is: O(n) is O(1) if n does not grow large. Most people extract tiny substrings from tiny strings, so how the complexity grows asymptotically is completely irrelevant.

The long answer is:

An immutable data structure built such that operations on an instance permit re-use of the memory of the original with only a small amount (typically O(1) or O(lg n)) of copying or new allocation is called a "persistent" immutable data structure. Strings in .NET are immutable; your question is essentially "why are they not persistent"?

Because when you look at operations that are typically done on strings in .NET programs, it is in every relevant way hardly worse at all to simply make an entirely new string. The expense and difficulty of building a complex persistent data structure doesn't pay for itself.

People typically use "substring" to extract a short string -- say, ten or twenty characters -- out of a somewhat longer string -- maybe a couple hundred characters. You have a line of text in a comma-separated file and you want to extract the third field, which is a last name. The line will be maybe a couple hundred characters long, the name will be a couple dozen. String allocation and memory copying of fifty bytes is astonishingly fast on modern hardware. That making a new data structure that consists of a pointer to the middle of an existing string plus a length is also astonishingly fast is irrelevant; "fast enough" is by definition fast enough.

The substrings extracted are typically small in size and short in lifetime; the garbage collector is going to reclaim them soon, and they didn't take up much room on the heap in the first place. So using a persistent strategy that encourages reuse of most of the memory is also not a win; all you've done is made your garbage collector get slower because now it has to worry about handling interior pointers.

If the substring operations people typically did on strings were completely different, then it would make sense to go with a persistent approach. If people typically had million-character strings, and were extracting thousands of overlapping substrings with sizes in the hundred-thousand-character range, and those substrings lived a long time on the heap, then it would make perfect sense to go with a persistent substring approach; it would be wasteful and foolish not to. But most line-of-business programmers do not do anything even vaguely like those sorts of things. .NET is not a platform that is tailored for the needs of the Human Genome Project; DNA analysis programmers have to solve problems with those string usage characteristics every day; odds are good that you do not. The few who do build their own persistent data structures that closely match their usage scenarios.

For example, my team writes programs that do on-the-fly analysis of C# and VB code as you type it. Some of those code files are enormous and thus we cannot be doing O(n) string manipulation to extract substrings or insert or delete characters. We have built a bunch of persistent immutable data structures for representing edits to a text buffer that permit us to quickly and efficiently re-use the bulk of the existing string data and the existing lexical and syntactic analyses upon a typical edit. This was a hard problem to solve and its solution was narrowly tailored to the specific domain of C# and VB code editing. It would be unrealistic to expect the built-in string type to solve this problem for us.


It would be interesting to contrast how Java does (or at least did at some point in the past) it: Substring returns a new string, but pointing at the same char[] as the larger string - that means that the larger char[] can no longer be garbage collected until the substring goes out of scope. I prefer .net's implementation by far.
I've seen this sort of code quite a bit: string contents = File.ReadAllText(filename); foreach (string line in content.Split("\n")) ... or other versions of it. I mean read an entire file, then process the various parts. That sort of code would be considerably faster and require less memory if a string was persistent; you would always have exactly one copy of the file in memory instead of copying each line, then the parts of each line as your process it. However, like Eric said - that is not the typical use case.
@configurator: Also, in .NET 4 the File.ReadLines method breaks up a text file into lines for you, without having to read it all into memory first.
@Michael: Java's String is implemented as a persistent data structure (that's not specified in the standards, but all implementations I know of do this).
Short answer: A copy of the data is made to allow garbage collection of the original string.
a
abelenky

Precisely because Strings are immutable, .Substring must make a copy of at least a portion of the original string. Making a copy of n bytes should take O(n) time.

How do you think you would copy a bunch of bytes in constant time?

EDIT: Mehrdad suggests not copying the string at all, but keeping a reference to a piece of it.

Consider in .Net, a multi-megabyte string, on which someone calls .SubString(n, n+3) (for any n in the middle of the string).

Now, the ENTIRE string cannot be Garbage Collected just because one reference is holding on to 4 characters? That seems like a ridiculous waste of space.

Further, tracking references to substrings (which may even be inside substrings), and trying to copy at optimal times to avoid defeating the GC (as described above), makes the concept a nightmare. It is far simpler, and more reliable, to copy on .SubString, and maintain the straightforward immutable model.

EDIT: Here's a good little read about the danger of keeping references to substrings within larger strings.


+1: Exactly my thoughts. Internally it probably uses memcpy which is still O(n).
@abelenky: I guess maybe by not copying it at all? It's already there, why should you have to copy it?
@Mehrdad: IF you are after performance. Just go unsafe in this case. Then you can get a char* substring.
@Mehrdad - you might be expecting too much there, it's called StringBuilder, and it's good a building strings. It's not called StringMultiPurposeManipulator
@SamuelNeff, @Mehrdad: Strings in .NET are not NULL terminated. As explained in Lippert's post, the first 4 bytes contain the length of the string. That's why, as Skeet points out, they can contain \0 characters.
P
Paŭlo Ebermann

Java (as opposed to .NET) provides two ways of doing Substring(), you can consider whether you want to keep just a reference or copy a whole substring to a new memory location.

The simple .substring(...) shares the internally used char array with the original String object, which you then with new String(...) can copy to a new array, if needed (to avoid hindering garbage collection of the original one).

I think this kind of flexibility is a best option for a developer.


You call it "flexibility" I call it "A way to accidentally insert a hard to diagnose bug (or a performance problem) into the software because I didn't realize I have to stop and think about all the places this code can possibly be called from (including those that would only be invented in the next version) just to get 4 characters from the middle of a string"
downvote retracted... After a bit more careful browsing of code it does look like a substring in java references a shared array, at least in the openjdk version. And if you want to ensure a new string there's a way to do that.
@Nir: I call it "status quo bias". To you the Java way of doing it seems fraught with risks and the .Net way the only sensbile choice. To Java programmers, the opposite is the case.
I strongly prefer .NET, but this sounds like one thing Java got right. It is useful that a developer be allowed to have access to a truly O(1) Substring method (without rolling your own string type, which would hinder interoperability with every other library, and wouldn't be as efficient as a built-in solution). Java's solution is probably inefficient though (requiring at least two heap objects, one for the original string and another for the substring); languages that support slices effectively replace the second object with a pair of pointers on the stack.
Since JDK 7u6 it's not true anymore - now Java always copies String contents for each .substring(...).
C
Community

Java used to reference larger strings, but:

Java changed its behavior to copying as well, to avoid leaking memory.

I feel like it can be improved though: why not just do the copying conditionally?

If the substring is at least half the size of the parent, one can reference the parent. Otherwise one can just make a copy. This avoids leaking a lot of memory while still providing a significant benefit.


Always copying allows you to remove the internal array. Halves the number of heap allocations, saving memory in the common case of short strings. It also means you don't need to jump through an additional indirection for each character access.
I think the important thing to take from this is that Java actually changed from using the same base char[] (with different pointers to the start and end) to creating a new String. This clearly shows that the cost-benefit analysis must show a preference for the creation of a new String.
b
bartonjs

None of the answers here addressed "the bracketing problem", which is to say that strings in .NET are represented as a combination of a BStr (the length stored in memory "before" the pointer) and a CStr (the string ends in a '\0').

The string "Hello there" is thus represented as

0B 00 00 00 48 00 65 00 6C 00 6F 00 20 00 74 00 68 00 65 00 72 00 65 00 00 00

(if assigned to a char* in a fixed-statement the pointer would point to the 0x48.)

This structure allows for fast lookup of the length of a string (useful in many contexts) and allows for the pointer to be passed in a P/Invoke to Win32 (or other) APIs which expect a null-terminated string.

When you do Substring(0, 5) the "oh, but I promised there'd be a null-character after the last character" rule says you need to make a copy. Even if you got the substring at the end then there'd be no place to put the length without corrupting the other variables.

Sometimes, though, you really do want to talk about "the middle of the string", and you don't necessarily care about the P/Invoke behavior. The recently added ReadOnlySpan<T> structure can be used to get a no-copy substring:

string s = "Hello there";
ReadOnlySpan<char> hello = s.AsSpan(0, 5);
ReadOnlySpan<char> ell = hello.Slice(1, 3);

The ReadOnlySpan<char> "substring" stores the length independently, and it does not guarantee that there's a '\0' after the end of the value. It can be used in many ways "like a string", but it is not "a string" since it doesn't have either BStr or CStr characteristics (much less both of them). If you never (directly) P/Invoke then there's not much of a difference (unless the API you want to call doesn't have a ReadOnlySpan<char> overload).

ReadOnlySpan<char> cannot be used as the field of a reference type, so there's also ReadOnlyMemory<char> (s.AsMemory(0, 5)), which is an indirect way of having a ReadOnlySpan<char>, so the same differences-from-string exist.

Some of the answers/comments on previous answers talked about it being wasteful to have the garbage collector have to keep a million-character string around while you continue to talk about 5 characters. That is precisely the behavior you can get with the ReadOnlySpan<char> approach. If you're just doing short computations, the ReadOnlySpan approach is probably better. If you need to persist it for a while and you're going to keep only a small percentage of the original string, doing a proper substring (to trim off the excess data) is probably better. There's a transition point somewhere in the middle, but it depends on your specific usage.


The 48 00 65 00 6C 00 6F 00 20 00 74 00 68 00 65 00 72 00 65 00 part only has one 6C 00 so it's actually "Helo there" instead of "Hello there" 😉