The CSVs I have are of the form:
movie_id, title, genres, budget, revenue, director_name, director_id, cast_ids, cast_names, cast_characters
1, “Sunshine”, [‘Drama’,‘Romance’], 100000, 8000, “David”, 154, [2,5,8], ['Hannah",“Laura”,“Carlos”], [“A”,“B”,“C”]
(genres, cast_ids, cast_names are in the form of lists)
So, I would like to have two Nodes mainly Person and Movie
Person (person_id, person_name)–[:ACTED_IN or :DIRECTED_BY]–>Movie (movie_id, title, budget, revenue, genres (in list format))
- only ACTED_IN relationship has an edge attribute ‘character’ and ‘count’
- DIRECTED_BY relationship has ‘count’ edge attribute
For example, for the above row, the nodes and relationships would be like shown below:
PERSON (person_id:154, person_name:‘David’) – [e:DIRECTED_BY (e.count)] → MOVIE (movie_id:1, title:‘Sunshine’, budget:100000, revenue:8000, genres:[‘Drama’,‘Romance’])
PERSON (person_id:2, person_name:‘Hannah’) – [r:ACTED_IN (r.character:“A”, r.count)] → MOVIE (movie_id:1, title:‘Sunshine’, budget:100000, revenue:8000, genres:[‘Drama’,‘Romance’])
PERSON (person_id:5, person_name:‘Laura’) – [r:ACTED_IN (r.character:“B”, r.count)] → MOVIE (movie_id:1, title:‘Sunshine’, budget:100000, revenue:8000, genres:[‘Drama’,‘Romance’])
PERSON (person_id:8, person_name:‘Carlos’) – [r:ACTED_IN (r.character:“C”, r.count)] → MOVIE (movie_id:1, title:‘Sunshine’, budget:100000, revenue:8000, genres:[‘Drama’,‘Romance’])
The count variable would increment everytime the same data is loaded (I’m having this extra count variable just to see if the count attribute is incremented when the same data is loaded again)