Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adjust the batchimport to the new features #2

Open
peterneubauer opened this issue Aug 17, 2013 · 2 comments
Open

Adjust the batchimport to the new features #2

peterneubauer opened this issue Aug 17, 2013 · 2 comments

Comments

@peterneubauer
Copy link
Contributor

Hi there,
I imported the musicbrainz database to Neo4j using the following approach, helped by @jexp:

Define 2 indexes (one mbid exact, for MBIDs and one mb fulltext, for everything else) in batch.properties:

batch_import.keep_db=false
batch_import.mapdb_cache.disable=true
batch_import.node_index.mb=fulltext
batch_import.node_index.mbid=exact
batch_import.csv.quotes=false
cache_type=none
use_memory_mapped_buffers=true
neostore.nodestore.db.mapped_memory=300M
neostore.relationshipstore.db.mapped_memory=3G
neostore.propertystore.db.mapped_memory=500M
neostore.propertystore.db.strings.mapped_memory=500M
neostore.propertystore.db.arrays.mapped_memory=0M
neostore.propertystore.db.index.keys.mapped_memory=15M
neostore.propertystore.db.index.mapped_memory=15M

Then, create the indexing instructions directly in the node.csv and rels.csv files, so we don't need the ...index.csv files anymore, see https://github.com/jexp/batch-import -> automatic indexing

kind:string:mb  comment status  position    name:string:mb  area    gender  format  barcode number  ended   length  end_date_year   begin_date_year mbid:string:mbid    type:string:mb  pk
artist              Talkshow Boy                        f               e8d94cf5-fafa-48fc-a6fa-aa50cf54d7f3        288762
artist              Vibulator                       f               735bfaad-6eb1-4f9c-b21d-cbaef7c79a92        97944
artist              Eat Me                      f               c38a93e8-2ecf-4848-b1d2-364202d9dc0c    Group   499198
artist              Uffe Andersen                       f               a7f3c871-3ba3-40b1-ba58-d08b40312789    Person  514886
artist              Headust                     f               eda60727-7036-437b-b53d-ae472818ee3a        212148
artist              Sons Of The Subway                      f               232d5716-c2b2-47e1-aa0c-264ec69e6a18        100774
artist              The Poe Boy Family                      f               672d599e-6a6c-456e-98ba-dac5a45e3ed8        43132
artist              Ralph Gusovius  Germany Male                f           1950    6ecfcea1-677d-427b-a38b-9c76ce92e313    Person  295052
artist              Elastik Band                        f               46e0639c-1ccf-45f5-b886-4cbf5549a2a1        61467

And then import the two files with something like

java -Xmx10G -server -Dfile.encoding=UTF-8 -jar ~/neo/batch-import/target/batch-import-jar-with-dependencies.jar ./graph.db nodes.csv rels.csv 

WDYT? It would make the output a lot easier, and the import took about 10min on my machine, 160M Properties, 75M relatoinships ...

@redapple
Copy link
Owner

Thank you Peter,

it happens I already started a branch "multi_nodescsv" in this very direction
https://github.com/redapple/sql2graph/tree/multi_nodescsv
This branch also uses different CSV nodes files (another recent feature from batchimport),
theses CSV files can be generated directly by the database engine (at least Postgresql in the case of MusicBrainz)

The branch is not very clean yet, but uses automatic indexing for MusicBrainz, but in a different (and more naïve way). Comparatively, your "mbid" exact index for all MBIDs is smart; in the branch I am using an index per entity (artists, labels...) and indexing "mbid" for each, which is definitely less elegant.

I should be connected back in the coming days so I can work on updating "multi_nodescsv" branch with your ideas, and merge them into master if we converge.

As for the processing times you have, I'm afraid I don't have that much RAM on my laptop or server :) (I'm running out of memory when importing too many entities).
Great to hear you've been able to import all MusicBrainz!

@peterneubauer
Copy link
Contributor Author

Yes,
let's connect, I am peter.neubauer on Skype!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants