memory - Aggregate key-value lines in a file by keys in Java -
i have huge file, composed ~800m rows (60g). rows can duplicates , composed id , value. example:
id1 valuea id1 valueb id2 valuea id3 valuec id3 valuea id3 valuec
note: ids not in order (and grouped) in example.
i want aggregate rows keys, in way:
id1 valuea,valueb id2 valuea id3 valuec,valuea
there 5000 possible values.
the file doesn't fit in memory can't use simple java collections. also, greatest part of lines single (like id2 example) , should written directly in output file.
for reason first solution iterate twice file:
- in first iteration store 2 structures, ids , no values:
- single value ids (s1)
- multiple values ids (s2)
- in second iteration, after discarding single value ids (s1) memory, write directly single values id-value pairs output file checking if not in multiple values ids (s2)
the problem can not finish first iteration cause memory limits.
i know problem faced in several ways (key-value store, map reduce, external sort).
my question method more adapt use , fast implement? once process , prefer use java methods (not external sort).
when dealing such large amount of data, need think out of box - , buffer entire thing
first: how it's working already?
lets have 4gb of video , i'm trying load video player.. player need perform 2 main operations:
buffering - 'splitting' video chunks , read 1 chunk @ time (buffer)
streaming - displaying result (video) software (player)
why? because impossible load memory @ once (and don't need it... @ specific moment user observes portion of video buffer (which portion of entire file)
second: how can us?
we can same thing large files:
- split main file smaller files (each file contains x rows x 'buffer')
- load java , group it
- save result new file
after process have many small files contains information this
id1 valuea,valueb id2 valuea id3 valuec,valuea
so each grouped file contains less rows the original small file derived from
- we can merge , try load java , re-group everything
- if process fails (still big) can merge small grouped files several grouped files (and repeat process)
Comments
Post a Comment