parsing - Lexer/Parser design for data file -


i writing small program needs preprocess data files inputs program. because of can't change format of input files , have run problem.

i working in language doesn't have libraries sort of thing , wouldn't mind exercise planning on implementing lexer , parser hand. implement lexer based on this simple design.

the input file need interpret has section contains chemical reactions. different chemical species on each side of reaction separated '+' signs, names of species can have + characters in them (symbolizing electric charge). example:

n2+o2=>no+no n2++o2-=>no+no n2+ + o2 => no + no 

are valid , tokens output lexer should be

'n2' '+' 'o2' '=>' 'no' '+' 'no' 'n2+' '+' 'o2-' '=>' 'no' '+' 'no' 'n2+' '+' 'o2-' '=>' 'no' '+' 'no' 

(note last 2 identical). avoid ahead in lexer simplicity. problem lexer start reading of above inputs, when got 3rd character (the first '+'), wouldn't have way know whether part of species name or if separator between reactants.

to fix thought split off second , third examples above output:

'n2' '+' '+' 'o2-' '=>' 'no' '+' 'no' 

the parser use context, realize 2 '+' tokens in row means first part of previous species name, , correctly handle 3 of above cases. problem imagine try lex/parse

n2 + + o2- => no + no 

(note space between 'n2' , first '+'). invalid syntax, lexer described output same token outputs second , third examples , parser wouldn't able catch error.

so possible solutions see it:

  • implement lexer atleast 1 character ahead
  • include tokens whitespace
  • include leading white space in '+' token
  • create "combined" token includes both species name , trailing '+' without white space between, letting parser sort out whether '+' part of name or not.

since new kind of programming hoping can comment on proposed solutions (or suggest another). main reservation first solution not know how more complicated implementing lexer ahead is.

you don't mention implementation language, input syntax relatively simple 1 outline, don't think having logic along lines of following pseudo-code unreasonable.

string gettoken() {   string token = getalphanumeric(); // assumed ignore (eat) white-space    var ch = getchar(); // assumed ignore (eat) white-space    if (ch == '+')    {       var ch2 = getchar();        if (ch2 == '+')        token += '+';      else        putchar(ch2);    }    putchar(ch);    return token; } 

Comments

Popular posts from this blog

sql - VB.NET Operand type clash: date is incompatible with int error -

SVG stroke-linecap doesn't work for circles in Firefox? -

python - TypeError: Scalar value for argument 'color' is not numeric in openCV -