memory - Aggregate key-value lines in a file by keys in Java -


i have huge file, composed ~800m rows (60g). rows can duplicates , composed id , value. example:

id1   valuea id1   valueb  id2   valuea  id3   valuec id3   valuea id3   valuec 

note: ids not in order (and grouped) in example.

i want aggregate rows keys, in way:

id1   valuea,valueb id2   valuea id3   valuec,valuea 

there 5000 possible values.

the file doesn't fit in memory can't use simple java collections. also, greatest part of lines single (like id2 example) , should written directly in output file.

for reason first solution iterate twice file:

  • in first iteration store 2 structures, ids , no values:
    • single value ids (s1)
    • multiple values ids (s2)
  • in second iteration, after discarding single value ids (s1) memory, write directly single values id-value pairs output file checking if not in multiple values ids (s2)

the problem can not finish first iteration cause memory limits.

i know problem faced in several ways (key-value store, map reduce, external sort).

my question method more adapt use , fast implement? once process , prefer use java methods (not external sort).

when dealing such large amount of data, need think out of box - , buffer entire thing

first: how it's working already?

lets have 4gb of video , i'm trying load video player.. player need perform 2 main operations:

buffering - 'splitting' video chunks , read 1 chunk @ time (buffer)

streaming - displaying result (video) software (player)

why? because impossible load memory @ once (and don't need it... @ specific moment user observes portion of video buffer (which portion of entire file)

second: how can us?

we can same thing large files:

  1. split main file smaller files (each file contains x rows x 'buffer')
  2. load java , group it
  3. save result new file

after process have many small files contains information this

id1   valuea,valueb id2   valuea id3   valuec,valuea 

so each grouped file contains less rows the original small file derived from

  1. we can merge , try load java , re-group everything
  2. if process fails (still big) can merge small grouped files several grouped files (and repeat process)

Comments

Popular posts from this blog

sql - VB.NET Operand type clash: date is incompatible with int error -

SVG stroke-linecap doesn't work for circles in Firefox? -

python - TypeError: Scalar value for argument 'color' is not numeric in openCV -