The allofterms fuzzy search produces incorrect results

Dgraph: v20.11.2

The allofterms fuzzy search produces incorrect results .

Scenario as follows:
Schema:

		apple_movietrailer_id: string .
		fandango_id: string .
		initial_release_date: datetime @index(year) .
		metacritic_id: string .
		name: string @index(hash, term, trigram, fulltext) @lang .
		netflix_id: string .
		prequel: [uid] .
		rottentomatoes_id: string .
		actor.dubbing_performances: [uid] .
		rating: [uid] @reverse .
		country: [uid] @reverse .
		rated: [uid] @reverse .

		type Film {
			apple_movietrailer_id: string
			fandango_id: string
			initial_release_date: dateTime
			metacritic_id: string
			name: string
			netflix_id: string
			prequel: [Film]
			rottentomatoes_id: string
		}
		
		type Actor {
			name: string
			actor.dubbing_performances: [Film]
		}	

When I load the following data query

	{
		set{
			_:a <name> "Jackie Chan"@en .
			_:b <name> "Jet Li"@en .
			_:c <name> "Bruce Lee"@en .
			_:a <name> "成龙"@cn .
			_:b <name> "李连杰"@cn .
			_:c <name> "李小龙"@cn .

			_:a <dgraph.type> "Actor" .
			_:b <dgraph.type> "Actor" .
			_:c <dgraph.type> "Actor" .
	
		}
	}

The results were correct:

// query
{
		var(func: allofterms( <name>@., "成龙"  ))@filter( (allofterms(<name>@., "成龙")) and type( <Actor>)){
		   uid0 as uid
		}
		statistics(func: uid(uid0)){count(uid)}
		q(func: uid(uid0), first:40,offset:0){
		   dgraphType:dgraph.type
		   expand(_all_)
		 }
	  }

// results
{"statistics":[{"count":1}],"q":[{"dgraphType":["Actor"],"name@en":"Jackie Chan","name@cn":"成龙"}]}

But
When I load the following data query:

	{
		set{
			_:a <name> "Jackie Chan"@en .
			_:b <name> "Jet Li"@en .
			_:c <name> "Bruce Lee"@en .
			_:a <name> "成龙"@zh .
			_:b <name> "李连杰"@zh .
			_:c <name> "李小龙"@zh .

			_:a <dgraph.type> "Actor" .
			_:b <dgraph.type> "Actor" .
			_:c <dgraph.type> "Actor" .
	
		}
	}

The results were correct:

// query
{
		var(func: allofterms( <name>@., "成龙"  ))@filter( (allofterms(<name>@., "成龙")) and type( <Actor>)){
		   uid0 as uid
		}
		statistics(func: uid(uid0)){count(uid)}
		q(func: uid(uid0), first:40,offset:0){
		   dgraphType:dgraph.type
		   expand(_all_)
		 }
	  }

// results
{"statistics":[{"count":0}],"q":[]}

Is that a bug? I just replaced cn with zh?

1 Like

This is a known issue and work has been done on it. The solution it turned out was to use a custom tokenizer for CJK languages. It hasn’t yet been merged into the mainline. I’ll get it done this week

Also, yes, you should use zh instead of cn.

Thank you very much.

@chewxy How is the job going and how can I receive the latest news? I’m concerned about that, too.

Hello, does this problem still exist in Dgraph v20.11.3?

yes it still exists. I built a new tokenizer but I have not yet incorporated it into the main repo

1 Like

Excuse me, have you made any progress on this