pdf extraction - iTextSharp extract each character and getRectangle -
i parse entire pdf character character , able ascii value, font , rectangle of character on pdf document can later use save bitmap. tried using pdftextextractor.gettextfrompage gives entire text in pdf string.
the text extraction strategies bundled itextsharp (in particular locationtextextractionstrategy
used default pdftextextractor.gettextfrompage
overload without strategy argument) allows direct access collected plain text, not positions.
chris haas' mylocationtextextractionstrategy
@chris haas in his old answer here presents following extension of locationtextextractionstrategy
public class mylocationtextextractionstrategy : locationtextextractionstrategy { //hold each coordinate public list<rectandtext> mypoints = new list<rectandtext>(); //automatically called each chunk of text in pdf public override void rendertext(textrenderinfo renderinfo) { base.rendertext(renderinfo); //get bounding box chunk of text var bottomleft = renderinfo.getdescentline().getstartpoint(); var topright = renderinfo.getascentline().getendpoint(); //create rectangle var rect = new itextsharp.text.rectangle( bottomleft[vector.i1], bottomleft[vector.i2], topright[vector.i1], topright[vector.i2] ); //add our main collection this.mypoints.add(new rectandtext(rect, renderinfo.gettext())); } }
which makes use of helper class
//helper class stores our rectangle , text public class rectandtext { public itextsharp.text.rectangle rect; public string text; public rectandtext(itextsharp.text.rectangle rect, string text) { this.rect = rect; this.text = text; } }
this strategy makes text chunks , enclosing rectangles available in public member list<rectandtext> mypoints
can access this:
//create instance of our strategy var t = new mylocationtextextractionstrategy(); //parse page 1 of document above using (var r = new pdfreader(testfile)) { var ex = pdftextextractor.gettextfrompage(r, 1, t); } //loop through each chunk found foreach (var p in t.mypoints) { console.writeline(string.format("found text {0} @ {1}x{2}", p.text, p.rect.left, p.rect.bottom)); }
for task to parse entire pdf character character , able ascii value, font , rectangle of character 2 details wrong here:
- the text chunks returned may contain multiple characters
- the font information not provided.
thus, have tweak bit:
a new charlocationtextextractionstrategy
in addition mylocationtextextractionstrategy
class charlocationtextextractionstrategy
splits input glyph , provides font name:
public class charlocationtextextractionstrategy : locationtextextractionstrategy { //hold each coordinate public list<rectandtextandfont> mypoints = new list<rectandtextandfont>(); //automatically called each chunk of text in pdf public override void rendertext(textrenderinfo wholerenderinfo) { base.rendertext(wholerenderinfo); foreach (textrenderinfo renderinfo in wholerenderinfo.getcharacterrenderinfos()) { //get bounding box chunk of text var bottomleft = renderinfo.getdescentline().getstartpoint(); var topright = renderinfo.getascentline().getendpoint(); //create rectangle var rect = new itextsharp.text.rectangle( bottomleft[vector.i1], bottomleft[vector.i2], topright[vector.i1], topright[vector.i2] ); //add our main collection this.mypoints.add(new rectandtextandfont(rect, renderinfo.gettext(), renderinfo.getfont().postscriptfontname)); } } } //helper class stores our rectangle, text, , font public class rectandtextandfont { public itextsharp.text.rectangle rect; public string text; public string font; public rectandtextandfont(itextsharp.text.rectangle rect, string text, string font) { this.rect = rect; this.text = text; this.font = font; } }
using strategy this
charlocationtextextractionstrategy strategy = new charlocationtextextractionstrategy(); using (var pdfreader = new pdfreader(testfile)) { pdftextextractor.gettextfrompage(pdfreader, 1, strategy); } foreach (var p in strategy.mypoints) { console.writeline(string.format("<{0}> in {3} @ {1}x{2}", p.text, p.rect.left, p.rect.bottom, p.font)); }
you information character , including font.
Comments
Post a Comment