python - xml.sax parser and line numbers etc -
the task parse simple xml document, , analyze contents line number.
the right python package seems xml.sax. how use it?
after digging in documentation, found:
- the
xmlreader.locatorinterface has information:getlinenumber(). - the
handler.contenthandlerinterface hassetdocumenthandler().
the first thought create locator, pass contenthandler, , read information off locator during calls character() methods, etc.
but, xmlreader.locator skeleton interface, , can return -1 of methods. poor user, do, short of writing whole parser , locator of own??
i'll answer own question presently.
(well have, except arbitrary, annoying rule says can't.)
i unable figure out using existing documentation (or web searches), , forced read source code xml.sax(under /usr/lib/python2.7/xml/sax/ on system).
the xml.sax function make_parser() default creates real parser, kind of thing that?
in source code 1 finds expatparser, defined in expatreader.py. and...it has own locator, expatlocator. but, there no access thing. head-scratching came between , solution.
- write own
contenthandler, knowslocator, , uses determine line numbers - create
expatparserxml.sax.make_parser() - create
expatlocator, passingexpatparserinstance. - make
contenthandler, givingexpatlocator - pass
contenthandlerparser'ssetcontenthandler() - call
parse()onparser.
for example:
import sys import xml.sax class elthandler( xml.sax.handler.contenthandler ): def __init__( self, locator ): xml.sax.handler.contenthandler.__init__( self ) self.loc = locator self.setdocumentlocator( self.loc ) def startelement( self, name, attrs ): pass def endelement( self, name ): pass def characters( self, data ): lineno = self.loc.getlinenumber() print >> sys.stdout, "line", lineno, data def spit_lines( filepath ): try: parser = xml.sax.make_parser() locator = xml.sax.expatreader.expatlocator( parser ) handler = elthandler( locator ) parser.setcontenthandler( handler ) parser.parse( filepath ) except ioerror e: print >> sys.stderr, e if len( sys.argv ) > 1: filepath = sys.argv[1] spit_lines( filepath ) else: print >> sys.stderr, "try providing path xml file." martijn pieters points out below approach advantages. if superclass initializer of contenthandler called, turns out private-looking, undocumented member ._locator set, ought contain proper locator.
advantage: don't have create own locator (or find out how create it). disadvantage: it's documented, , using undocumented private variable sloppy.
thanks martijn!
the sax parser itself supposed provide content handler locator. locator has implement methods, can object long has right methods. xml.sax.xmlreader.locator class interface locator expected implement; if parser provided locator object handler can count on 4 methods being present on locator.
the parser encouraged set locator, not required so. expat xml parser provide it.
if subclass xml.sax.handler.contenthandler() it'll provide standard setdocumenthandler() method you, , time .startdocument() on handler called content handler instance have self._locator set:
from xml.sax.handler import contenthandler class mycontenthandler(contenthandler): def __init__(self): contenthandler.__init__(self) # initialize handler def startelement(self, name, attrs): loc = self._locator if loc not none: line, col = loc.getlinenumber(), loc.getcolumnnumber() else: line, col = 'unknown', 'unknown' print 'start of {} element @ line {}, column {}'.format(name, line, col)
Comments
Post a Comment