็”จChatGPT่งฃๅ†ณ่ฟ™ไธชๆŠ€ๆœฏ้—ฎ้ข˜ Extra ChatGPT

Why are emoji characters like ๐Ÿ‘ฉ‍๐Ÿ‘ฉ‍๐Ÿ‘ง‍๐Ÿ‘ฆ treated so strangely in Swift strings?

The character ๐Ÿ‘ฉโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ (family with two women, one girl, and one boy) is encoded as such:

U+1F469 WOMAN,
โ€U+200D ZWJ,
U+1F469 WOMAN,
U+200D ZWJ,
U+1F467 GIRL,
U+200D ZWJ,
U+1F466 BOY

So it's very interestingly-encoded; the perfect target for a unit test. However, Swift doesn't seem to know how to treat it. Here's what I mean:

"๐Ÿ‘ฉโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ".contains("๐Ÿ‘ฉโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ") // true
"๐Ÿ‘ฉโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ".contains("๐Ÿ‘ฉ") // false
"๐Ÿ‘ฉโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ".contains("\u{200D}") // false
"๐Ÿ‘ฉโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ".contains("๐Ÿ‘ง") // false
"๐Ÿ‘ฉโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ".contains("๐Ÿ‘ฆ") // true

So, Swift says it contains itself (good) and a boy (good!). But it then says it does not contain a woman, girl, or zero-width joiner. What's happening here? Why does Swift know it contains a boy but not a woman or girl? I could understand if it treated it as a single character and only recognized it containing itself, but the fact that it got one subcomponent and no others baffles me.

This does not change if I use something like "๐Ÿ‘ฉ".characters.first!.

Even more confounding is this:

let manual = "\u{1F469}\u{200D}\u{1F469}\u{200D}\u{1F467}\u{200D}\u{1F466}"
Array(manual.characters) // ["๐Ÿ‘ฉโ€", "๐Ÿ‘ฉโ€", "๐Ÿ‘งโ€", "๐Ÿ‘ฆ"]

Even though I placed the ZWJs in there, they aren't reflected in the character array. What followed was a little telling:

manual.contains("๐Ÿ‘ฉ") // false
manual.contains("๐Ÿ‘ง") // false
manual.contains("๐Ÿ‘ฆ") // true

So I get the same behavior with the character array... which is supremely annoying, since I know what the array looks like.

This also does not change if I use something like "๐Ÿ‘ฉ".characters.first!.

Comments are not for extended discussion; this conversation has been moved to chat.
Fixed in Swift 4. "๐Ÿ‘ฉโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ".contains("\u{200D}") still returns false, not sure if that's a bug or feature.
Yikes. Unicode has ruined text. It's turned plain text into a markup language.
@Boann yes and no... a lot of these changes were put in to make en/decoding things like Hangul Jamo (255 codepoints) not an absolute nightmare like it was for Kanji (13,108 codepoints) and Chinese Ideographs (199,528 codepoints). Of course, it's more complicated and interesting than the length of an an SO comment could allow, so I encourage you to check it out yourself :D

x
xoudini

This has to do with how the String type works in Swift, and how the contains(_:) method works.

The '๐Ÿ‘ฉโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ ' is what's known as an emoji sequence, which is rendered as one visible character in a string. The sequence is made up of Character objects, and at the same time it is made up of UnicodeScalar objects.

If you check the character count of the string, you'll see that it is made up of four characters, while if you check the unicode scalar count, it will show you a different result:

print("๐Ÿ‘ฉโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ".characters.count)     // 4
print("๐Ÿ‘ฉโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ".unicodeScalars.count) // 7

Now, if you parse through the characters and print them, you'll see what seems like normal characters, but in fact the three first characters contain both an emoji as well as a zero-width joiner in their UnicodeScalarView:

for char in "๐Ÿ‘ฉโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ".characters {
    print(char)

    let scalars = String(char).unicodeScalars.map({ String($0.value, radix: 16) })
    print(scalars)
}

// ๐Ÿ‘ฉโ€
// ["1f469", "200d"]
// ๐Ÿ‘ฉโ€
// ["1f469", "200d"]
// ๐Ÿ‘งโ€
// ["1f467", "200d"]
// ๐Ÿ‘ฆ
// ["1f466"]

As you can see, only the last character does not contain a zero-width joiner, so when using the contains(_:) method, it works as you'd expect. Since you aren't comparing against emoji containing zero-width joiners, the method won't find a match for any but the last character.

To expand on this, if you create a String which is composed of an emoji character ending with a zero-width joiner, and pass it to the contains(_:) method, it will also evaluate to false. This has to do with contains(_:) being the exact same as range(of:) != nil, which tries to find an exact match to the given argument. Since characters ending with a zero-width joiner form an incomplete sequence, the method tries to find a match for the argument while combining characters ending with a zero-width joiners into a complete sequence. This means that the method won't ever find a match if:

the argument ends with a zero-width joiner, and the string to parse doesn't contain an incomplete sequence (i.e. ending with a zero-width joiner and not followed by a compatible character).

To demonstrate:

let s = "\u{1f469}\u{200d}\u{1f469}\u{200d}\u{1f467}\u{200d}\u{1f466}" // ๐Ÿ‘ฉโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ

s.range(of: "\u{1f469}\u{200d}") != nil                            // false
s.range(of: "\u{1f469}\u{200d}\u{1f469}") != nil                   // false

However, since the comparison only looks ahead, you can find several other complete sequences within the string by working backwards:

s.range(of: "\u{1f466}") != nil                                    // true
s.range(of: "\u{1f467}\u{200d}\u{1f466}") != nil                   // true
s.range(of: "\u{1f469}\u{200d}\u{1f467}\u{200d}\u{1f466}") != nil  // true

// Same as the above:
s.contains("\u{1f469}\u{200d}\u{1f467}\u{200d}\u{1f466}")          // true

The easiest solution would be to provide a specific compare option to the range(of:options:range:locale:) method. The option String.CompareOptions.literal performs the comparison on an exact character-by-character equivalence. As a side note, what's meant by character here is not the Swift Character, but the UTF-16 representation of both the instance and comparison string โ€“ however, since String doesn't allow malformed UTF-16, this is essentially equivalent to comparing the Unicode scalar representation.

Here I've overloaded the Foundation method, so if you need the original one, rename this one or something:

extension String {
    func contains(_ string: String) -> Bool {
        return self.range(of: string, options: String.CompareOptions.literal) != nil
    }
}

Now the method works as it "should" with each character, even with incomplete sequences:

s.contains("๐Ÿ‘ฉ")          // true
s.contains("๐Ÿ‘ฉ\u{200d}")  // true
s.contains("\u{200d}")    // true

@MartinR According to the current UTR29 (Unicode 9.0), it is an extended grapheme cluster (rules GB10 and GB11), but Swift clearly uses an older version. Apparently fixing that is a goal for version 4 of the language, so this behaviour will change in future.
@MichaelHomer: Apparently that has been fixed, "๐Ÿ‘ฉโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ".count evaluates to 1 with the current Xcode 9 beta and Swift 4.
Wow. This is excellent. But now Iโ€™m getting nostalgic for the old days when the worst problem I had with strings is whether they use C or Pascal style encodings.
I understand why the Unicode standard may need to support this, but man, this is an overengineered mess, if anything :/
Correct isn't overengineered.
R
Rob Napier

The first problem is you're bridging to Foundation with contains (Swift's String is not a Collection), so this is NSString behavior, which I don't believe handles composed Emoji as powerfully as Swift. That said, Swift I believe is implementing Unicode 8 right now, which also needed revision around this situation in Unicode 10 (so this may all change when they implement Unicode 10; I haven't dug into whether it will or not).

To simplify thing, let's get rid of Foundation, and use Swift, which provides views that are more explicit. We'll start with characters:

"๐Ÿ‘ฉโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ".characters.forEach { print($0) }
๐Ÿ‘ฉโ€
๐Ÿ‘ฉโ€
๐Ÿ‘งโ€
๐Ÿ‘ฆ

OK. That's what we expected. But it's a lie. Let's see what those characters really are.

"๐Ÿ‘ฉโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ".characters.forEach { print(String($0).unicodeScalars.map{$0}) }
["\u{0001F469}", "\u{200D}"]
["\u{0001F469}", "\u{200D}"]
["\u{0001F467}", "\u{200D}"]
["\u{0001F466}"]

Ahโ€ฆ So it's ["๐Ÿ‘ฉZWJ", "๐Ÿ‘ฉZWJ", "๐Ÿ‘งZWJ", "๐Ÿ‘ฆ"]. That makes everything a bit more clear. ๐Ÿ‘ฉ is not a member of this list (it's "๐Ÿ‘ฉZWJ"), but ๐Ÿ‘ฆ is a member.

The problem is that Character is a "grapheme cluster," which composes things together (like attaching the ZWJ). What you're really searching for is a unicode scalar. And that works exactly as you're expecting:

"๐Ÿ‘ฉโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ".unicodeScalars.contains("๐Ÿ‘ฉ") // true
"๐Ÿ‘ฉโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ".unicodeScalars.contains("\u{200D}") // true
"๐Ÿ‘ฉโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ".unicodeScalars.contains("๐Ÿ‘ง") // true
"๐Ÿ‘ฉโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ".unicodeScalars.contains("๐Ÿ‘ฆ") // true

And of course we can also look for the actual character that is in there:

"๐Ÿ‘ฉโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ".characters.contains("๐Ÿ‘ฉ\u{200D}") // true

(This heavily duplicates Ben Leggiero's points. I posted this before noticing he'd answered. Leaving in case it is clearer to anyone.)


Wth does ZWJ stand for?
Zero Width Joiner
@RobNapier in Swift 4, String was allegedly changed back to a collection type. Does that affect your answer at all?
No. That just changed things like subscripting. It didn't change how Characters work.
K
Ky.

It seems that Swift considers a ZWJ to be an extended grapheme cluster with the character immediately preceding it. We can see this when mapping the array of characters to their unicodeScalars:

Array(manual.characters).map { $0.description.unicodeScalars }

This prints the following from LLDB:

โ–ฟ 4 elements
  โ–ฟ 0 : StringUnicodeScalarView("๐Ÿ‘ฉโ€")
    - 0 : "\u{0001F469}"
    - 1 : "\u{200D}"
  โ–ฟ 1 : StringUnicodeScalarView("๐Ÿ‘ฉโ€")
    - 0 : "\u{0001F469}"
    - 1 : "\u{200D}"
  โ–ฟ 2 : StringUnicodeScalarView("๐Ÿ‘งโ€")
    - 0 : "\u{0001F467}"
    - 1 : "\u{200D}"
  โ–ฟ 3 : StringUnicodeScalarView("๐Ÿ‘ฆ")
    - 0 : "\u{0001F466}"

Additionally, .contains groups extended grapheme clusters into a single character. For instance, taking the hangul characters แ„’, แ…ก, and แ†ซ (which combine to make the Korean word for "one": แ„’แ…กแ†ซ):

"\u{1112}\u{1161}\u{11AB}".contains("\u{1112}") // false

This could not find แ„’ because the three codepoints are grouped into one cluster which acts as one character. Similarly, \u{1F469}\u{200D} (WOMAN ZWJ) is one cluster, which acts as one character.


B
Brad Gilbert

The other answers discuss what Swift does, but don't go into much detail about why.

Do you expect โ€œAฬŠโ€ to equal โ€œร…โ€? I expect you would.

One of these is a letter with a combiner, the other is a single composed character. You can add many different combiners to a base character, and a human would still consider it to be a single character. To deal with this sort of discrepancy the concept of a grapheme was created to represent what a human would consider a character regardless of the codepoints used.

Now text messaging services have been combining characters into graphical emoji for years :) โ†’ ๐Ÿ™‚. So various emoji were added to Unicode.
These services also started combining emoji together into composite emoji.
There of course is no reasonable way to encode all possible combinations into individual codepoints, so The Unicode Consortium decided to expand on the concept of graphemes to encompass these composite characters.

What this boils down to is "๐Ÿ‘ฉโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ" should be considered as a single "grapheme cluster" if you trying to work with it at the grapheme level, as Swift does by default.

If you want to check if it contains "๐Ÿ‘ฆ" as a part of that, then you should go down to a lower level.

I don't know Swift syntax so here is some Perl 6 which has similar level of support for Unicode. (Perl 6 supports Unicode version 9 so there may be discrepancies)

say "\c[family: woman woman girl boy]" eq "๐Ÿ‘ฉโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ"; # True

# .contains is a Str method only, in Perl 6
say "๐Ÿ‘ฉโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ".contains("๐Ÿ‘ฉโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ")    # True
say "๐Ÿ‘ฉโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ".contains("๐Ÿ‘ฆ");        # False
say "๐Ÿ‘ฉโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ".contains("\x[200D]");  # False

# comb with no arguments splits a Str into graphemes
my @graphemes = "๐Ÿ‘ฉโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ".comb;
say @graphemes.elems;                # 1

Let's go down a level

# look at it as a list of NFC codepoints
my @components := "๐Ÿ‘ฉโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ".NFC;
say @components.elems;                     # 7

say @components.grep("๐Ÿ‘ฆ".ord).Bool;       # True
say @components.grep("\x[200D]".ord).Bool; # True
say @components.grep(0x200D).Bool;         # True

Going down to this level can make some things harder though.

my @match = "๐Ÿ‘ฉโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ".ords;
my $l = @match.elems;
say @components.rotor( $l => 1-$l ).grep(@match).Bool; # True

I assume that .contains in Swift makes that easier, but that doesn't mean there aren't other things which become more difficult.

Working at this level makes it much easier to accidentally split a string in the middle of a composite character for example.

What you are inadvertently asking is why does this higher level representation not work like a lower level representation would. The answer is of course, it's not supposed to.

If you are asking yourself โ€œwhy does this have to be so complicatedโ€, the answer is of course โ€œhumansโ€.


You lost me on your last example line; what do rotor and grep do here? And what is 1-$l?
The term "grapheme" is at least 50 years old. Unicode introduced it to the standard because they'd already used the term "character" to mean something quite different from what one ordinarily thinks of as a character. I can read what you wrote as being consistent with that but suspect others might get the wrong impression, hence this (hopefully clarifying) comment.
@BenLeggiero First, rotor. The code say (1,2,3,4,5,6).rotor(3) yields ((1 2 3) (4 5 6)). That's a list of lists, each length 3. say (1,2,3,4,5,6).rotor(3=>-2) yields the same except the second sublist starts with 2 rather than 4, the third with 3, and so on, yielding ((1 2 3) (2 3 4) (3 4 5) (4 5 6)). If @match contains "๐Ÿ‘ฉโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ".ords then @Brad's code creates just one sublist, so the =>1-$l bit is irrelevant (unused). It's only relevant if @match is shorter than @components.
grep tries to match each element in its invocant (in this case, a list of sublists of @components). It tries to match each element against its matcher argument (in this case, @match). The .Bool then returns True iff the grep produces at least one match.
N
Nilanshu Jaiswal

Swift 4.0 update

String received lots of revisions in Swift 4 update, as documented in SE-0163. Two emoji are used for this demo representing two different structures. Both are combined with a sequence of emoji.

๐Ÿ‘๐Ÿฝ is the combination of two emoji, ๐Ÿ‘ and ๐Ÿฝ

๐Ÿ‘ฉโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ is the combination of four emoji, with zero width joiner connected. The format is ๐Ÿ‘ฉโ€joiner๐Ÿ‘ฉโ€joiner๐Ÿ‘งโ€joiner๐Ÿ‘ฆ

1. Counts

In Swift 4.0 emoji is counted as grapheme cluster. Every single emoji is counted as 1. The count property is also directly available for string. So you can directly call it like this.

"๐Ÿ‘๐Ÿฝ".count  // 1. Not available on swift 3
"๐Ÿ‘ฉโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ".count  // 1. Not available on swift 3

Character array of a string is also counted as grapheme clusters in Swift 4.0, so both of the following codes print 1. These two emoji are examples of emoji sequences, where several emoji are combined together with or without zero width joiner \u{200d} between them. In swift 3.0, character array of such string separates out each emoji and results in an array with multiple elements (emoji). The joiner is ignored in this process. However, in Swift 4.0, character array sees all emoji as one piece. So that of any emoji will always be 1.

"๐Ÿ‘๐Ÿฝ".characters.count  // 1. In swift 3, this prints 2
"๐Ÿ‘ฉโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ".characters.count  // 1. In swift 3, this prints 4

unicodeScalars remains unchanged in Swift 4. It provides the unique Unicode characters in the given string.

"๐Ÿ‘๐Ÿฝ".unicodeScalars.count  // 2. Combination of two emoji
"๐Ÿ‘ฉโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ".unicodeScalars.count  // 7. Combination of four emoji with joiner between them

2. Contains

In Swift 4.0, contains method ignores zero width joiner in emoji. So it returns true for any of the four emoji components of "๐Ÿ‘ฉโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ", and return false if you check for the joiner. However, in Swift 3.0, the joiner is not ignored and is combined with the emoji in front of it. So when you check if "๐Ÿ‘ฉโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ" contains the first three component emoji, the result will be false

"๐Ÿ‘๐Ÿฝ".contains("๐Ÿ‘")       // true
"๐Ÿ‘๐Ÿฝ".contains("๐Ÿฝ")        // true
"๐Ÿ‘ฉโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ".contains("๐Ÿ‘ฉโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ")       // true
"๐Ÿ‘ฉโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ".contains("๐Ÿ‘ฉ")       // true. In swift 3, this prints false
"๐Ÿ‘ฉโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ".contains("\u{200D}") // false
"๐Ÿ‘ฉโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ".contains("๐Ÿ‘ง")       // true. In swift 3, this prints false
"๐Ÿ‘ฉโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ".contains("๐Ÿ‘ฆ")       // true

J
Joe

Emojis, much like the unicode standard, are deceptively complicated. Skin tones, genders, jobs, groups of people, zero-width joiner sequences, flags (2 character unicode) and other complications can make emoji parsing messy. A Christmas Tree, a Slice of Pizza, or a Pile of Poop can all be represented with a single Unicode code point. Not to mention that when new emojis are introduced, there is a delay between iOS support and emoji release. That and the fact that different versions of iOS support different versions of the unicode standard.

TL;DR. I have worked on these features and opened sourced a library I am the author for JKEmoji to help parse strings with emojis. It makes parsing as easy as:

print("I love these emojis ๐Ÿ‘ฉโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ๐Ÿ’ช๐Ÿพ๐Ÿงฅ๐Ÿ‘ง๐Ÿฟ๐ŸŒˆ".emojiCount)

5

It does that by routinely refreshing a local database of all recognized emojis as of the latest unicode version (12.0 as of recently) and cross-referencing them with what is recognized as a valid emoji in the running OS version by looking at the bitmap representation of an unrecognized emoji character.

NOTE

A previous answer got deleted for advertising my library without clearly stating that I am the author. I am acknowledging this again.


While I am impressed by your library, and I see how it is generally related to the topic at hand, I don't see how this directly relates to the question