loads of useful information, examples and tutorials pertaining to web development utilizing asp.net, c#, vb, css, xhtml, javascript, sql, xml, ajax and everything else...

 



Advertise Here
C-Sharpener.com - Programming is Easy!  Learn Asp.Net & C# in just days, Guaranteed!

Parsing/Reading a PDF file with C# and Asp.Net to text

by naspinski 3/16/2009 1:19:00 PM

PDFs are a very ubiquitous and useful file type, but they can be a pain to work with programatically

PDFs are extensively used in my organization, and people always want programs that will extract information from them. It can be very difficult to get the information they want due to the strange format, but sometimes it's a necessity. Here is how to get a PDF into text, from there you are on your own!

download the necessary files

There are always more than one way to skin a cat when it comes to programming, but the easiest way I have found for PDFs is to use the fantastic, open-source project PDFBox. The download is good for all sorts of platforms, but you only need a few parts to use it with Asp.Net and C#.

*Now keep in mind the version numbers I show here may change, but the process should stay the same.

what you need

  • Just a few things, pull the following files into your bin:
    • FontBox-0.1.0-dev.dll
    • IKVM.GNU.Classpath.dll
    • IKVM.Runtime.dll
    • PDFBox-0.7.3.dll

  • Now just make sure you add a couple references in your project, it is a bit of a strange process, so follow it closely. First, add this reference:
    • IKVM.GNU.Classpath

  • Then build the project, as the next reference requires the previous to be built. So you then add this reference:
    • PDFBox-0.7.3

  • Then build again - once again, there may be a better way to do this, but this is what the documentation said and it worked, so I won't mess with it. Now all you need to do is make sure you add the following using statements to any code file that needs to use the parser:
    using org.pdfbox.pdmodel;
    using org.pdfbox.util;

all set, now use it

This quick snippet shows how to use the program to take in a pdf file and output it to a .txt file.  The inputs are simply the string 'pdf_in' which is the path to a PDF to parse, and the string 'txt_out' which is the path to the output text file. You can easily modify it to take in a Stream of some sort or something else like using a FileUpload or output some other way, but this should get the idea across.
public void parsePDF(string pdf_in, string txt_out)
{
    StreamWriter sw = new StreamWriter(txt_out, false);
    try
    {
        sw.WriteLine();
        sw.WriteLine(DateTime.Now.ToString());
        PDDocument doc = PDDocument.load(pdf_in);
        PDFTextStripper stripper = new PDFTextStripper();
        sw.Write(stripper.getText(doc));
    }
    catch (Exception ex) { Response.Write(ex.Message); }
    finally
    {
        sw.Close();
        sw.Dispose();
    }
}

And there you have it, your PDF is now a (most likely, ugly and difficult to parse) text file with your PDF data in it; now it's up to you to figure out how to use it. As you will see, PDFs can (not always) be very strange in how they come out as text, tables will often be in odd order and such and it is a new adventure each time to engineer an effective and acurate parsing scheme that is very case-specific. Normally I would offer a download, but these files are pretty big, so I will leave it to the guys at SourceForge.

Tags: , ,

c# | tutorials

Related posts

Comments

7/23/2009 11:06:39 AM

gladient
I will check it tonight how it works Smile Thanks for the article!

gladient pl

8/17/2009 10:31:36 AM

 bookkeeping
Hi,

Very nice article.. I will bookmark, pls provide much more information.....you rock..

bookkeeping us

6/25/2010 6:42:33 AM

pingback
Pingback from overclock.net

Order form for a webpage... - Page 2 - Overclock.net - Overclocking.net

overclock.net

12/11/2010 8:57:01 AM

pingback
Pingback from yippeesoft.com

PDF文件结构的分析-IT之家 - YippeeSoft开心软件

yippeesoft.com

5/27/2011 8:36:27 AM

pingback
Pingback from schouren.info

Blog van Danny Schouren » Blog Archive » PDF lezen in VB.net

schouren.info

10/4/2011 2:26:56 AM

pingback
Pingback from cyberbrutus.com

PDF to Text with .NET « Cyber Brutus

cyberbrutus.com

10/22/2011 4:34:06 AM

pingback
Pingback from andrewreyling.com

Reading/Parsing a PDF File with C# | AndrewReyling.com

andrewreyling.com

7/5/2014 1:43:25 PM

pingback
Pingback from asp.thekollectable.com

[RESOLVED]Display and Search of a PDF Document in my web page | ASP Questions & Answers

asp.thekollectable.com


Comments are closed