python - xml.sax parser and line numbers etc -


the task parse simple xml document, , analyze contents line number.

the right python package seems xml.sax. how use it?

after digging in documentation, found:

  • the xmlreader.locator interface has information: getlinenumber().
  • the handler.contenthandler interface has setdocumenthandler().

the first thought create locator, pass contenthandler, , read information off locator during calls character() methods, etc.

but, xmlreader.locator skeleton interface, , can return -1 of methods. poor user, do, short of writing whole parser , locator of own??

i'll answer own question presently.


(well have, except arbitrary, annoying rule says can't.)


i unable figure out using existing documentation (or web searches), , forced read source code xml.sax(under /usr/lib/python2.7/xml/sax/ on system).

the xml.sax function make_parser() default creates real parser, kind of thing that?
in source code 1 finds expatparser, defined in expatreader.py. and...it has own locator, expatlocator. but, there no access thing. head-scratching came between , solution.

  1. write own contenthandler, knows locator, , uses determine line numbers
  2. create expatparser xml.sax.make_parser()
  3. create expatlocator, passing expatparser instance.
  4. make contenthandler, giving expatlocator
  5. pass contenthandler parser's setcontenthandler()
  6. call parse() on parser.

for example:

import sys import xml.sax  class elthandler( xml.sax.handler.contenthandler ):     def __init__( self, locator ):         xml.sax.handler.contenthandler.__init__( self )         self.loc = locator         self.setdocumentlocator( self.loc )      def startelement( self, name, attrs ): pass      def endelement( self, name ): pass      def characters( self, data ):         lineno = self.loc.getlinenumber()         print >> sys.stdout, "line", lineno, data  def spit_lines( filepath ):     try:         parser = xml.sax.make_parser()         locator = xml.sax.expatreader.expatlocator( parser )         handler = elthandler( locator )         parser.setcontenthandler( handler )         parser.parse( filepath )     except ioerror e:         print >> sys.stderr, e  if len( sys.argv ) > 1:     filepath = sys.argv[1]     spit_lines( filepath ) else:     print >> sys.stderr, "try providing path xml file." 

martijn pieters points out below approach advantages. if superclass initializer of contenthandler called, turns out private-looking, undocumented member ._locator set, ought contain proper locator.

advantage: don't have create own locator (or find out how create it). disadvantage: it's documented, , using undocumented private variable sloppy.

thanks martijn!

the sax parser itself supposed provide content handler locator. locator has implement methods, can object long has right methods. xml.sax.xmlreader.locator class interface locator expected implement; if parser provided locator object handler can count on 4 methods being present on locator.

the parser encouraged set locator, not required so. expat xml parser provide it.

if subclass xml.sax.handler.contenthandler() it'll provide standard setdocumenthandler() method you, , time .startdocument() on handler called content handler instance have self._locator set:

from xml.sax.handler import contenthandler  class mycontenthandler(contenthandler):     def __init__(self):         contenthandler.__init__(self)         # initialize handler      def startelement(self, name, attrs):         loc = self._locator         if loc not none:             line, col = loc.getlinenumber(), loc.getcolumnnumber()         else:             line, col = 'unknown', 'unknown'         print 'start of {} element @ line {}, column {}'.format(name, line, col) 

Comments

Popular posts from this blog

sql - VB.NET Operand type clash: date is incompatible with int error -

SVG stroke-linecap doesn't work for circles in Firefox? -

python - TypeError: Scalar value for argument 'color' is not numeric in openCV -