python - xml.sax parser and line numbers etc -
the task parse simple xml document, , analyze contents line number.
the right python package seems xml.sax
. how use it?
after digging in documentation, found:
- the
xmlreader.locator
interface has information:getlinenumber()
. - the
handler.contenthandler
interface hassetdocumenthandler()
.
the first thought create locator
, pass contenthandler
, , read information off locator during calls character()
methods, etc.
but, xmlreader.locator
skeleton interface, , can return -1 of methods. poor user, do, short of writing whole parser
, locator
of own??
i'll answer own question presently.
(well have, except arbitrary, annoying rule says can't.)
i unable figure out using existing documentation (or web searches), , forced read source code xml.sax
(under /usr/lib/python2.7/xml/sax/ on system).
the xml.sax
function make_parser()
default creates real parser
, kind of thing that?
in source code 1 finds expatparser
, defined in expatreader.py. and...it has own locator
, expatlocator
. but, there no access thing. head-scratching came between , solution.
- write own
contenthandler
, knowslocato
r, , uses determine line numbers - create
expatparser
xml.sax.make_parser()
- create
expatlocator
, passingexpatparser
instance. - make
contenthandler
, givingexpatlocator
- pass
contenthandler
parser'ssetcontenthandler()
- call
parse()
onparser
.
for example:
import sys import xml.sax class elthandler( xml.sax.handler.contenthandler ): def __init__( self, locator ): xml.sax.handler.contenthandler.__init__( self ) self.loc = locator self.setdocumentlocator( self.loc ) def startelement( self, name, attrs ): pass def endelement( self, name ): pass def characters( self, data ): lineno = self.loc.getlinenumber() print >> sys.stdout, "line", lineno, data def spit_lines( filepath ): try: parser = xml.sax.make_parser() locator = xml.sax.expatreader.expatlocator( parser ) handler = elthandler( locator ) parser.setcontenthandler( handler ) parser.parse( filepath ) except ioerror e: print >> sys.stderr, e if len( sys.argv ) > 1: filepath = sys.argv[1] spit_lines( filepath ) else: print >> sys.stderr, "try providing path xml file."
martijn pieters points out below approach advantages. if superclass initializer of contenthandler
called, turns out private-looking, undocumented member ._locator
set, ought contain proper locator
.
advantage: don't have create own locator
(or find out how create it). disadvantage: it's documented, , using undocumented private variable sloppy.
thanks martijn!
the sax parser itself supposed provide content handler locator. locator has implement methods, can object long has right methods. xml.sax.xmlreader.locator
class interface locator expected implement; if parser provided locator object handler can count on 4 methods being present on locator.
the parser encouraged set locator, not required so. expat xml parser provide it.
if subclass xml.sax.handler.contenthandler()
it'll provide standard setdocumenthandler()
method you, , time .startdocument()
on handler called content handler instance have self._locator
set:
from xml.sax.handler import contenthandler class mycontenthandler(contenthandler): def __init__(self): contenthandler.__init__(self) # initialize handler def startelement(self, name, attrs): loc = self._locator if loc not none: line, col = loc.getlinenumber(), loc.getcolumnnumber() else: line, col = 'unknown', 'unknown' print 'start of {} element @ line {}, column {}'.format(name, line, col)
Comments
Post a Comment