loads of useful information, examples and tutorials pertaining to web development utilizing asp.net, c#, vb, css, xhtml, javascript, sql, xml, ajax and everything else...

 



Advertise Here
C-Sharpener.com - Programming is Easy!  Learn Asp.Net & C# in just days, Guaranteed!

Parsing/Reading a PDF file with C# and Asp.Net to text

by naspinski 3/16/2009 1:19:00 PM

PDFs are a very ubiquitous and useful file type, but they can be a pain to work with programatically

PDFs are extensively used in my organization, and people always want programs that will extract information from them. It can be very difficult to get the information they want due to the strange format, but sometimes it's a necessity. Here is how to get a PDF into text, from there you are on your own!

download the necessary files

There are always more than one way to skin a cat when it comes to programming, but the easiest way I have found for PDFs is to use the fantastic, open-source project PDFBox. The download is good for all sorts of platforms, but you only need a few parts to use it with Asp.Net and C#.

*Now keep in mind the version numbers I show here may change, but the process should stay the same.

what you need

  • Just a few things, pull the following files into your bin:
    • FontBox-0.1.0-dev.dll
    • IKVM.GNU.Classpath.dll
    • IKVM.Runtime.dll
    • PDFBox-0.7.3.dll

  • Now just make sure you add a couple references in your project, it is a bit of a strange process, so follow it closely. First, add this reference:
    • IKVM.GNU.Classpath

  • Then build the project, as the next reference requires the previous to be built. So you then add this reference:
    • PDFBox-0.7.3

  • Then build again - once again, there may be a better way to do this, but this is what the documentation said and it worked, so I won't mess with it. Now all you need to do is make sure you add the following using statements to any code file that needs to use the parser:
    using org.pdfbox.pdmodel;
    using org.pdfbox.util;

all set, now use it

This quick snippet shows how to use the program to take in a pdf file and output it to a .txt file.  The inputs are simply the string 'pdf_in' which is the path to a PDF to parse, and the string 'txt_out' which is the path to the output text file. You can easily modify it to take in a Stream of some sort or something else like using a FileUpload or output some other way, but this should get the idea across.
public void parsePDF(string pdf_in, string txt_out)
{
    StreamWriter sw = new StreamWriter(txt_out, false);
    try
    {
        sw.WriteLine();
        sw.WriteLine(DateTime.Now.ToString());
        PDDocument doc = PDDocument.load(pdf_in);
        PDFTextStripper stripper = new PDFTextStripper();
        sw.Write(stripper.getText(doc));
    }
    catch (Exception ex) { Response.Write(ex.Message); }
    finally
    {
        sw.Close();
        sw.Dispose();
    }
}

And there you have it, your PDF is now a (most likely, ugly and difficult to parse) text file with your PDF data in it; now it's up to you to figure out how to use it. As you will see, PDFs can (not always) be very strange in how they come out as text, tables will often be in odd order and such and it is a new adventure each time to engineer an effective and acurate parsing scheme that is very case-specific. Normally I would offer a download, but these files are pretty big, so I will leave it to the guys at SourceForge.

Tags: , ,

c# | tutorials