Feature request: full text search with tf-idf Scoring

diggy · June 26, 2019, 3:44pm

MichelDiz commented :

This feature does not currently exist. There is not much to do. But I have an idea that can help you. But you will have to use Facets.

The idea is to map the text by your application and then create a word structure. So you can track its usage. The only problem is that if you need the words to be unique in the system, you will need to manually add the number of repetitions. Inside the facets by the Map_text_body edge. Otherwise you would create several nodes of equal words. You can do it, but I think it would be kind of messy.

Using this structure you can create a system that analyzes words. You can identify texts that use a specific word thus identifying similarities. Let’s say several texts contain the word “Dgraph”. You can search for the word “Dgraph” and use @ reverse to identify which texts use it and also to collect values in facets.

An auto-complete system would be easily built. For you would just collect the mapped words in “Map_text_body”. And you can also rank using values on facets.

Here is a map of the structure itself. (Remember, this is just an idea for now)

You have the node with the text body and you have an edge that will contain the words mapped to nodes. Each word is a node.

Below I explain a little better.

The Schema

Map_text_body: [uid] @reverse . #pay attention, this [uid] syntax is just for master or Dgraph 1.1
# Map_text_body: uid @reverse . #Use this for oldest versions.
name: string @index(exact) .
Text_body: string .

Mutation

{
  set {
    _:Blank <name> "My Text Node" .
    _:Blank <Text_body> "Sir, in my heart there was a kind of fighting \n That would not let me sleep. Methought I lay \n Worse than the mutines in the bilboes. Rashly— \n And prais'd be rashness for it—let us know \n Our indiscretion sometimes serves us well ..." .

   #The mapping
    _:Blank <Map_text_body> _:Sir (times=1) .
    _:Blank <Map_text_body> _:heart (times=1) .
    _:Blank <Map_text_body> _:in_ (times=2) .
    _:Blank <Map_text_body> _:there (times=1) .
    _:Blank <Map_text_body> _:was (times=1).
    _:Blank <Map_text_body> _:my (times=1).
    
   #Creating the mapped words
    _:Sir 		<name> "Sir" .
    _:heart		<name> "heart" .
    _:in_ 		<name> "in" .
    _:there 	        <name> "there" .
    _:was 		<name> "was" .
    _:my 		<name> "my" .

    # Please add more words based on the "Text_body"
  }
}

Query

{
  q(func: eq(name,"My Text Node")) {
    uid
    Text_body
    Map_text_body @facets(orderdesc: times) {
      uid
      name
    }
  }
}

Response

{
  "data": {
    "q": [
      {
        "uid": "0x2",
        "Text_body": "Sir, in my heart there was a kind of fighting \n That would not let me sleep. Methought I lay \n Worse than the mutines in the bilboes. Rashly— \n And prais'd be rashness for it—let us know \n Our indiscretion sometimes serves us well ...",
        "Map_text_body": [
          {
            "uid": "0x5",
            "name": "in",
            "Map_text_body|times": 2
          },
          {
            "uid": "0x3",
            "name": "Sir",
            "Map_text_body|times": 1
          },
          {
            "uid": "0x4",
            "name": "heart",
            "Map_text_body|times": 1
          },
          {
            "uid": "0x6",
            "name": "there",
            "Map_text_body|times": 1
          },
          {
            "uid": "0x7",
            "name": "was",
            "Map_text_body|times": 1
          },
          {
            "uid": "0x8",
            "name": "my",
            "Map_text_body|times": 1
          }
        ]
      }
    ]
  },
  "extensions": {
    "server_latency": {
      "parsing_ns": 15859,
      "processing_ns": 1857051,
      "encoding_ns": 598937
    },
    "txn": {
      "start_ts": 25
    }
  }
}

Using Facets you can do various functions in Dgraph. Like assign Facet values to a variable, Sorting, Aggregation, Filtering and so on.

BTW

You will soon be able to use Upsert Block, this will make this process faster and more automatic.

But of course, you will have to map and create the block (write the query) manually or by some process that you have created in your application.

e.g:

 upsert {
      query {
		#Check if the text already exist
        var(func: eq(name,"My Text Node")) {
          v_Text as uid
        }

		#Check if any words already exist
        var(func: has(type_mapped_word)) @filter(eq(name,"Sir")) {
          v_Sir as uid
        }
        var(func: has(type_mapped_word)) @filter(eq(name,"heart")) {
          v_heart as uid
        }
        var(func: has(type_mapped_word)) @filter(eq(name,"in")) {
          v_in as uid
        }
        var(func: has(type_mapped_word)) @filter(eq(name,"there")) {
          v_there as uid
        }
    # You're free to define the form that will categorize the type.
    # Here I am using "type_mapped_word".

      }
      mutation {
          set {
           uid(v_Text) <name> "My Text Node" .
           uid(v_Text) <Text_body> "Sir, in my heart there was a kind of fighting \n That would not let me sleep. Methought I lay \n Worse than the mutines in the bilboes. Rashly— \n And prais'd be rashness for it—let us know \n Our indiscretion sometimes serves us well ..." .
           uid(v_Text) <Map_text_body>  uid(v_Sir) (times=1) .
           uid(v_Text) <Map_text_body> uid(v_heart) (times=1) .
           uid(v_Text) <Map_text_body> uid(v_in) (times=2) .
           uid(v_Text) <Map_text_body> uid(v_there) (times=1) .
           uid(v_Text) <Map_text_body> uid(v_was) (times=1).
           uid(v_Text) <Map_text_body> uid(v_my) (times=1).
    
           uid(v_Sir) 		<name> "Sir" .
           uid(v_Sir)) 		<type_mapped_word> "" .

           uid(v_heart)		<name> "heart" .
           uid(v_heart)		<type_mapped_word> "" .

           uid(v_in) 		<name> "in" .
           uid(v_in) 		<type_mapped_word> "" .

           uid(v_there) 	<name> "there" .
           uid(v_there) 	<type_mapped_word> "" .

           uid(v_was) 		<name> "was" .
           uid(v_was) 		<type_mapped_word> "" .

           uid(v_my) 		<name> "my" .
           uid(v_my) 		<type_mapped_word> "" .
          }
      }
    }

Topic		Replies	Views
[RFC] TF-IDF scoring for fulltext search in Dgraph Dev rfc , dgraph , area:querylang	6	979	June 3, 2021
Scoring FTS results Dgraph	1	571	June 24, 2019
Fuzzy Full Text Search Dgraph kind:question , dql	0	1140	December 4, 2021
Get Started with Dgraph - Fuzzy Search on Social Graphs - Dgraph documentation Documentation	0	437	August 28, 2020
Get positions of matching terms in query Dgraph kind:question	10	474	June 10, 2023