How can I set text with JSON encoded emojis using DQL?

koculu · March 30, 2021, 7:40am

I try to set an emoji like this to a node predicate.

This character is converted to encoded string in JSON “\uD83D\uDE00”

{
	set {
		<0x1f> <post-text> "\uD83D\uDE00" .
  }
}

After executing above mutation I get the following:

I also checked the returned text in browser and clearly Dgraph corrupts encoded string values.

"post-text": "\uFFFD\uFFFD",

This is the value I get. That is why smiley is shown as question marks in the browser.

Is this a bug or am I doing something wrong?

MichelDiz · March 30, 2021, 12:12pm

Try \\uD83D\\uDE00

koculu · March 30, 2021, 12:37pm

That is not a valid unicode encoding. \u is a special indicator that following characters are the hex representation of underlying unicode character.
Check in the browser console as below.

Apparently, Dgraph serializer does not respect Unicode encoding or there is a bug.

MichelDiz · March 30, 2021, 1:22pm

Dgraph doesn’t support special characters. You have to follow the JSON escaping rules in order to store those values. You have to(on your end) scape and unscape those before sending/reading them to Dgraph.

@docs maybe we should document this.

koculu · March 30, 2021, 1:56pm

The JSON specification says: ’ To escape an extended character that is not in the Basic Multilingual Plane, the character is represented as a twelve-character sequence, encoding the UTF-16 surrogate pair. So, for example, a string containing only the G clef character (U+1D11E) may be represented as “\uD834\uDD1E”. ’

This is not special character. This is the definition of JSON serialization and deserialization.
I send values to Dgraph using .NET JsonSerializer and it converts all emojis encoded.
If DQL language designed to work on JSON data, then it would be wise to support this.

Related discussion: Consider adding a JavaScriptEncoder implementation that doesn't encode the block list or surrogate pairs. · Issue #42847 · dotnet/runtime · GitHub

MichelDiz · March 30, 2021, 1:59pm

Dgraphs’ string doesn’t support JSON. But it follows the JSON rules of scaping special characters. A Slash is a special character.

koculu · March 30, 2021, 2:02pm

We might create a bug / feature request to not to convert given input
“\uD83D\uDE00”
into this:
“\uFFFD\uFFFD”

If this syntax is not supported, Dgraph should throw an exception instead of converting the bytes into some other number.

MichelDiz · March 30, 2021, 2:09pm

If we throw an exception every time we see a special character, the users will be really mad cuz it will happen all the time. We could have a scaping directive, but for now, is perfectly fine to have it on your end.

koculu · March 30, 2021, 2:20pm

Yes, I handled the situation on my end. I insist on this to improve Dgraph.
These kind of issues can be send to backlog to be implemented in the future.

koculu · March 30, 2021, 2:37pm

If I am not wrong, DQL parser unquotes a string using primitive strconv Unquote function.

I am newbie on Go lang however I found following:

This might end up with one line code change sth like below.

json.Unmarshal([]byte(str), &str)

chewxy · March 30, 2021, 8:57pm

I’ll look into it - on a related note there is a PR I have for GraphQL (not DQL) that is for a similar issue (it’s yet to be merged):

github.com/vektah/gqlparser

Better handling of string lexing

vektah:master ← chewxy:master

opened 12:40AM - 23 Dec 20 UTC

chewxy

+62 -25

# The Problem # Given a schema like this: ``` type Post { id: ID! … raw: String! } ``` Here's a way to break a mutation: ``` mutation MyMutation { addPost(input: {raw: "\xaa Raw content"}) { numUids } } ``` giving rise to the following error: ``` input:3: Unexpected <Invalid> ``` The issue is that `\xaa` is not parsed correctly. # The Solution # The solution is quite simple: add a case to the `readString()` method of the lexer to handle `x` as a escape character. # Side Note: How Parsing of Strings Should Work Please note that the handling of bad inputs for `\xHH` (where `H` is a meta-variable standing in for a hexadecimal number) is not quite the same as elsewhere in the package. I've got a good reason for this, and I am willing to make changes to the other escape sequences as well. With this PR, the bad inputs will lead to the literals being returned - so that `"\xZZ"` will return `"\xZZ"`, while good inputs such as `"\xaa"` will be parsed correctly, returning `"ª"`. I believe this is more user friendly. In the example I had listed, the scenario is one where the server is receiving a mutation. The input string, could be an end user input. And end users do often type the wrong things. For example, consider a dyslexic person trying to write the sentence "they will give it to me/u". Said person would often type something along the lines of "they will give it to me\u". In this case, the extra parsing for UTF-8 characters in the string will cause this input to fail. What the user meant to type, in a string representation, is `"they will give it to me\\u"`. An argument could be made that it is the onus of the user of this library to escape the string `"they will give it to me\u"` to `"they will give it to me\\u"`. My counter argument is that the role of a lexer is to simply find tokens. A string token that contains the literal bytes in `"they will give it to me\u"` would qualify as a string token. That gqlparser goes above and beyond in order to parse out the UTF8 characters in the contents of a string is most commendable. But it should not return an error. In the example I have given so far, it would be very unfriendly to the end user, as well as the user of the library. There can be a further argument to be made - that having the user of this library parse the string and escape any invalid sequences would be extra computation wasted. Now as a total, the program has to go through the string twice - once to escape bad sequences (done by the user of this library), and the second, to parse the correct escape sequences into UTF8 chars (done by this library). If we could save computation by doing all at once, we would make the world a much nicer place. As I mentioned, I am willing to put the changes in for handling the rest of the bad escape sequences. I am also OK if you want to just keep returning errors for string parsing, and will modify this PR. Your call @vektah . # End Notes This was originally filed as an issue in Dgraph's community forum: https://discuss.hypermode.com/t/graphql-string-semantics-unclear-and-inconsistent-in-the-presence-of-escape-sequences/12019

Topic		Replies	Views
Can't store JSON as string DQL Dgraph kind:question , kind:bug	1	574	June 18, 2022
String escaping and language fulltext search question Issues kind:question , dgraph	4	477	July 29, 2021
Using the Dgraph Java setSetNquads() method, how do you insert UTF-8 with escape characters? Users	0	348	May 26, 2023
Unable to do DQL mutation from dgraph cloud Dgraph Cloud kind:question , kind:bug	1	956	December 9, 2021
Cannot Run Upsert in Dgraph Cloud DQl UI Dgraph Cloud / Slash GraphQL area:upsert	1	753	July 15, 2021

How can I set text with JSON encoded emojis using DQL?

Related topics