Assignee: Christidis Nikolaos (chriniko - [email protected])
This assessment is based on the popular website IMDb which offers movies and TV shows information. They have kindly made their dataset publicly available at IMDb Datasets.
Your mission, should you choose to accept it, is to write a web application in Java or Scala that can fulfil the following requirements:
-
Requirement #1 (easy): IMDb copycat: Present the user with endpoint for allowing them to search by movie’s primary title or original title. The outcome should be related information to that title, including cast and crew.
-
Requirement #2 (easy): Top rated movies: Given a query by the user, you must provide what are the top rated movies for a genre (If the user searches horror, then it should show a list of top rated horror movies).
-
Requirement #3 (easy): Typecasting: Given a query by the user, where he/she provides an actor/actress name, the system should determine if that person has become typecasted (at least half of their work is one genre).
-
Requirement #4 (easy): Find the coincidence: Given a query by the user, where the input is two actors/actresses names, the application replies with a list of movies or TV shows that both people have shared.
-
Requirement #5 (difficult): Six degrees of Kevin Bacon: Given a query by the user, you must provide what’s the degree of separation between the person (e.g. actor or actress) the user has entered and Kevin Bacon.
-
Stretch goal:
-
Add a UI to requirement #1 to present the result of such endpoint (you can also use OMDb,to include more details than the one present in the database, such as poster image, title description and so on).
-
Generalise for n amount of names on requirement #4.
-
-
Requirement 1 [DONE]
-
Requirement 2 [DONE]
-
Requirement 3 [DONE]
-
Requirement 4 [DONE]
-
Requirement 5 [DONE]
-
Stretch Goal 1 [NOT DONE]
-
Stretch Goal 2 [DONE]
-
You need to download Imdb csv dataset from here: https://www.imdb.com/interfaces/, https://datasets.imdbws.com/
-
You should download the following files:
name.basics.tsv.gz
title.akas.tsv.gz
title.basics.tsv.gz
title.crew.tsv.gz
title.episode.tsv.gz
title.principals.tsv.gz
title.ratings.tsv.gz
-
And put them in
src/main/resources/data
folder, so theDataLoader.java#initZippedFilesBindings()
method could work correctly
-
-
Have a running MySQL Server, so execute:
docker-compose up
in order to start MySQL -
Time to load the data from CSV Imdb files to MySQL:
-
One way is to import the dump I ship together with the assignment, folder which contains dump:
TODO
-
The other way is to:
-
Set the value of
output.directory
inapplication.properties
to a correct one, so application is able to extract the compressed files of IMDb. -
Change property of
spring.jpa.hibernate.ddl-auto
in application.properties tocreate
-
Start the application
-
Then hit via HTTP, POST method on:
http://localhost:1711/data/load
, which starts an ETL job (from CSV -> to MySQL), it takes around: [58 minutes - 1 hour](Note: 2018-11-04 21:13:28.189 INFO 7086 --- [pool-1-thread-1] c.c.l.movies.service.data.DataLoader : Total time to load data in ms: 3524434)
-
IMPORTANT, After successful etl operation, do not forget to change
spring.jpa.hibernate.ddl-auto
inapplication.properties
tovalidate
ornone
-
IMPORTANT, Do not forget to go to
MySQL Indexes Created
section in this file and create all the indexes, it is vital for response.
-
-
-
Run Neo4j via docker:
docker run \ --publish=7474:7474 --publish=7687:7687 \ --volume=$HOME/neo4j/data:/data \ --volume=$HOME/neo4j/logs:/logs \ --volume=$HOME/plugins:/plugins \ --volume=$HOME/import:/var/lib/neo4j/import \ neo4j:3.0
-
Common credentials:
neo4j/neo4j
||neo4j/1234
-
Time to load the data from MySQL to Neo4j:
-
One way is to cd into the $HOME/neo4j/data (see docker command) and replace the contents with those I have inside folder: Dump_for_Neo4j, do the same for $HOME/neo4j/logs
-
The other way is to:
- Hit via HTTP,
POST
method on:http://localhost:1711/acquaintance-links/prepare
, which starts anETL job (from MySQL -> to Neo4j)
, it takes a lot of time in order to complete so I advise you to wait for 5-8 completed steps (see logs) and then play with the degrees requirement. Keep in mind that the integration test for Req.5 (Requirement5_ITSpec.groovy) might not pass due to different loaded dataset.
- Hit via HTTP,
-
-
IMPORTANT, first you should have a running and populated(from etl operation, csv->sql) MySQL db.
-
Test case
Requirement5_ITSpec.groovy
might not pass due to different loaded dataset inside Neo4j, so it is ignored, but it is recommended to play with it. -
Execute
mvn clean verify
-
Load data (csv -> mysql) via http endpoint:
- localhost:1711/data/load
-
Requirement 1, use (GET):
localhost:1711/search/titles/?query=title:Ocean's Eleven&full-fetch=true
localhost:1711/search/titles/?query=title:alpha&full-fetch=true
-
Requirement 2, use (GET):
localhost:1711/search/titles/?query=genre:horror&full-fetch=false
localhost:1711/search/titles/?query=genre:Horror&full-fetch=false
-
Requirement 3, use (GET):
localhost:1711/search/names/?name=Will Smith&full-fetch=false
localhost:1711/search/names/?name=Charles Bronson&full-fetch=false
localhost:1711/search/names/?name=Steve McQueen&full-fetch=false
localhost:1711/search/names/?name=Danny DeVito&full-fetch=false
-
Requirement 4, use:
-
POST with body:
{ "names":["Angelina Jolie", "Brad Pitt"] }
To URL:
localhost:1711/search/names/coincidence
-
POST with body:
{ "names":["Orlando Bloom", "Johnny Depp"] }
To URL:
localhost:1711/search/names/coincidence
-
POST with body:
{ "names":["Brad Pitt", "Robert Redford"] }
To URL:
localhost:1711/search/names/coincidence
-
POST with body:
{ "names":["George Clooney", "Brad Pitt"] }
To URL:
localhost:1711/search/names/coincidence
-
POST with body:
{ "names":["George Clooney", "Matt Damon"] }
To URL:
localhost:1711/search/names/coincidence
-
POST with body:
{ "names":["George Clooney", "Matt Damon", "Brad Pitt"] }
To URL:
localhost:1711/search/names/coincidence
-
POST with body:
{ "names":["John Travolta", "Uma Thurman", "Samuel L. Jackson", "Bruce Willis"] }
To URL:
localhost:1711/search/names/coincidence
-
POST with body:
{ "names":["Al Pacino", "Robert De Niro", "Val Kilmer", "Jon Voight"] }
To URL:
localhost:1711/search/names/coincidence
-
-
Requirement 5, POST
localhost:1711/acquaintance-links
with body:{ "sourceFullName":"Matt Damon", "targetFullName":"Kevin Bacon" }
-
aka -> title_id
ALTER TABLE `movies_search_service_db`.`aka` ADD INDEX `title_id_idx` (`title_id` ASC);
-
basic -> primary_title
ALTER TABLE `movies_search_service_db`.`basic` ADD INDEX `primary_title_idx` (`primary_title` ASC);
-
basic -> original_title
ALTER TABLE `movies_search_service_db`.`basic` ADD INDEX `original_title_idx` (`original_title` ASC);
-
episode -> parent_tconst
ALTER TABLE `movies_search_service_db`.`episode` ADD INDEX `parent_tconst_idx` (`parent_tconst` ASC);
-
basic -> genres
ALTER TABLE `movies_search_service_db`.`basic` ADD INDEX `genres_idx` (`genres` ASC);
-
principal -> tconst
ALTER TABLE `movies_search_service_db`.`principal` ADD INDEX `principal_tconst_idx` (`tconst` ASC);
-
name -> primary_name
ALTER TABLE `movies_search_service_db`.`name` ADD INDEX `primary_name_idx` (`primary_name` ASC);
-
principal -> nconst
ALTER TABLE `movies_search_service_db`.`principal` ADD INDEX `principal_nconst_idx` (`nconst` ASC);
- "Film-Noir"
- "Action"
- "War"
- "History"
- "Western"
- "Documentary"
- "Sport"
- "Thriller"
- "News"
- "Biography"
- "Adult"
- "Comedy"
- "Mystery"
- "Musical"
- "Short"
- "Talk-Show"
- "Adventure"
- "Horror"
- "Romance"
- "Sci-Fi"
- "Drama"
- "Music"
- "Game-Show"
- "Crime"
- "Fantasy"
- "Animation"
- "Family"
- "Reality-TV"
-
select distinct category from principal limit 30590275;
-
Results:
- self
- director
- cinematographer
- composer
- producer
- editor
- actor
- actress
- writer
- production_designer
- archive_footage
- archive_sound