Iterate over input data, don't load into memory #170

jfelectron · 2019-07-26T21:59:30Z

The current implementation loads the entire input file into memory, leading to memory growth and exhaustion for large data sets. This is a POC for out of core data sets.

Notes:

not a Java dev, this works but can likely be improved. PR is for visibility of the issue, which I spent a week or more on and off with.
does not solve the issue of the LDA model training trying to load all the data into memory, if possible we should find a way to make that iterable as well.

Further improvements:

make use of threads to speed this up, using one thread for 100M plus instances takes quite a while

Jonathan Foley added 5 commits July 19, 2019 11:30

use buffered reading of input file during data load

efb8555

handle missing file

e241fbe

don't keep references to written output objects

289e09a

reset OutputStream write to free resources

4c53ed3

log progress

462c700

jfelectron mentioned this pull request Jul 26, 2019

OutOfMemoryError while computing LDA model for large .mallet file #165

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Iterate over input data, don't load into memory #170

Iterate over input data, don't load into memory #170

jfelectron commented Jul 26, 2019 •

edited

Loading

Iterate over input data, don't load into memory #170

Are you sure you want to change the base?

Iterate over input data, don't load into memory #170

Conversation

jfelectron commented Jul 26, 2019 • edited Loading

jfelectron commented Jul 26, 2019 •

edited

Loading