pdf extraction - iTextSharp extract each character and getRectangle -


i parse entire pdf character character , able ascii value, font , rectangle of character on pdf document can later use save bitmap. tried using pdftextextractor.gettextfrompage gives entire text in pdf string.

the text extraction strategies bundled itextsharp (in particular locationtextextractionstrategy used default pdftextextractor.gettextfrompage overload without strategy argument) allows direct access collected plain text, not positions.

chris haas' mylocationtextextractionstrategy

@chris haas in his old answer here presents following extension of locationtextextractionstrategy

public class mylocationtextextractionstrategy : locationtextextractionstrategy {     //hold each coordinate     public list<rectandtext> mypoints = new list<rectandtext>();      //automatically called each chunk of text in pdf     public override void rendertext(textrenderinfo renderinfo) {         base.rendertext(renderinfo);          //get bounding box chunk of text         var bottomleft = renderinfo.getdescentline().getstartpoint();         var topright = renderinfo.getascentline().getendpoint();          //create rectangle         var rect = new itextsharp.text.rectangle(                                                 bottomleft[vector.i1],                                                 bottomleft[vector.i2],                                                 topright[vector.i1],                                                 topright[vector.i2]                                                 );          //add our main collection         this.mypoints.add(new rectandtext(rect, renderinfo.gettext()));     } } 

which makes use of helper class

//helper class stores our rectangle , text public class rectandtext {     public itextsharp.text.rectangle rect;     public string text;     public rectandtext(itextsharp.text.rectangle rect, string text) {         this.rect = rect;         this.text = text;     } } 

this strategy makes text chunks , enclosing rectangles available in public member list<rectandtext> mypoints can access this:

//create instance of our strategy var t = new mylocationtextextractionstrategy();  //parse page 1 of document above using (var r = new pdfreader(testfile)) {     var ex = pdftextextractor.gettextfrompage(r, 1, t); }  //loop through each chunk found foreach (var p in t.mypoints) {     console.writeline(string.format("found text {0} @ {1}x{2}", p.text, p.rect.left, p.rect.bottom)); } 

for task to parse entire pdf character character , able ascii value, font , rectangle of character 2 details wrong here:

  • the text chunks returned may contain multiple characters
  • the font information not provided.

thus, have tweak bit:

a new charlocationtextextractionstrategy

in addition mylocationtextextractionstrategy class charlocationtextextractionstrategy splits input glyph , provides font name:

public class charlocationtextextractionstrategy : locationtextextractionstrategy {     //hold each coordinate     public list<rectandtextandfont> mypoints = new list<rectandtextandfont>();      //automatically called each chunk of text in pdf     public override void rendertext(textrenderinfo wholerenderinfo)     {         base.rendertext(wholerenderinfo);          foreach (textrenderinfo renderinfo in wholerenderinfo.getcharacterrenderinfos())         {             //get bounding box chunk of text             var bottomleft = renderinfo.getdescentline().getstartpoint();             var topright = renderinfo.getascentline().getendpoint();              //create rectangle             var rect = new itextsharp.text.rectangle(                                                     bottomleft[vector.i1],                                                     bottomleft[vector.i2],                                                     topright[vector.i1],                                                     topright[vector.i2]                                                     );              //add our main collection             this.mypoints.add(new rectandtextandfont(rect, renderinfo.gettext(), renderinfo.getfont().postscriptfontname));         }     } }  //helper class stores our rectangle, text, , font public class rectandtextandfont {     public itextsharp.text.rectangle rect;     public string text;     public string font;     public rectandtextandfont(itextsharp.text.rectangle rect, string text, string font)     {         this.rect = rect;         this.text = text;         this.font = font;     } } 

using strategy this

charlocationtextextractionstrategy strategy = new charlocationtextextractionstrategy();  using (var pdfreader = new pdfreader(testfile)) {     pdftextextractor.gettextfrompage(pdfreader, 1, strategy); }  foreach (var p in strategy.mypoints) {     console.writeline(string.format("<{0}> in {3} @ {1}x{2}", p.text, p.rect.left, p.rect.bottom, p.font)); } 

you information character , including font.


Comments

Popular posts from this blog

sql - VB.NET Operand type clash: date is incompatible with int error -

SVG stroke-linecap doesn't work for circles in Firefox? -

python - TypeError: Scalar value for argument 'color' is not numeric in openCV -