Using iTextSharp to make table of contents in combined PDF

Table of contents with internal links help people navigate your PDF files, so it is a desirable property of your customer related PDFs. As part of my job we created a tool for users to merge PDFs of, for instance, construction manuals or leaflets into a combined PDF. For this we were using iTextSharp and everything was working fine. iTextSharp is a .NET port of the popular iText that Java developers use for working with PDFs.

Just when we thought everything was as it should be, the customer came forward with a request for us to make a table of contents in the combined PDF so their customers could easily navigate between the merged PDFs.

Making a table of contents would be easy seeing as iTextSharp gives you the count of pages in each PDF. The problem comes when you want to make internal links between the PDFs. iTextSharp can also be used to make PDFs from HTML, and as so it carries the ability to make links in a single PDF. So the way you would make a table of contents was to make it using HTML. Unfortunately you can’t make internal links between PDFs like that. You can only make internal links in the PDF you are currently creating via HTML. When you try to make a link to the third page in a one page HTML table of contents you will get an error even though you combine it later with a larger PDF. Anyway… As is most often the case there is a workaround for this particular problem. I personally found part of the solution on a forum somewhere, but it’s been a long time and I forgot where it was so unfortunately I can’t link to it.

The solution consists of two parts. First the html to create the pseudo links to make into real links later, and then the code to make the transition from fake link to real link.

First. We want to make sure that when we go through our combined PDF we are able to distinguish the links we want to create. If you create your table of contents using HTML you are using tables and the easiest way is to make sure that you have HTML that only creates the table of content, and each time you have three td tags in your row the third is always the link. I chose to make it like so:

string.Format("<tr><td></td><td><div style='FONT-SIZE: 10pt'>{0}</div></td><td align='left'>|{0}|{1}</td></tr>", currentPage, name);

When you run your HTML through the iTextSharp tokenizer you need to make sure that is knows that it needs to interpret our link as an Action. This is accomplished like so:

foreach (IElement item in HTMLWorker.ParseToList(reader, this._Styles))
{
    //If there are problems with some of the chars in the pdf not being rendered, you can set the font and encoding in Font.cs in the itextsharp project. Just search for "Change font here!!!"
    if (item.GetType() == typeof(PdfPTable))
    {
        PdfPTable table = (PdfPTable)item;
        table.SetWidths(new int[] { 70, 15, 250 });
        foreach (PdfPRow row in table.Rows)
        {
            PdfPCell cell = row.GetCells()[2];
            if (cell != null)
            {
                Paragraph p = (Paragraph)cell.CompositeElements[0];
                if (p.Content.Contains("|"))
                {
                    Chunk c = new Chunk(p.Content.Split('|')[2], new Font() { Color = BaseColor.WHITE }).SetAction(new PdfAction(p.Content.Split('|')[1]));
                    p.Clear();
                    p.Add(c);
                }
            }
        }
    }
    document.Add(item);
}

Here we can see that we create a new Chunk and set an Action on that. In this next piece of code we see why:

private static void ListPdfLinks(PdfReader reader)
{
    //Get the current page
    PdfDictionary PageDictionary = reader.GetPageN(2);

    //Get all of the annotations for the current page
    PdfArray Annots = PageDictionary.GetAsArray(PdfName.ANNOTS);

    //Make sure we have something
    if ((Annots == null) || (Annots.Length == 0))
        return;

    //Loop through each annotation
    foreach (PdfObject A_loopVariable in Annots.ArrayList)
    {
        //Convert the itext-specific object as a generic PDF object
        PdfDictionary AnnotationDictionary = (PdfDictionary)PdfReader.GetPdfObject(A_loopVariable);

        //Make sure this annotation has a link
        if (!AnnotationDictionary.Get(PdfName.SUBTYPE).Equals(PdfName.LINK))
            continue;

        //Make sure this annotation has an ACTION
        if (AnnotationDictionary.Get(PdfName.A) == null)
            continue;

        //Get the ACTION for the current annotation
        PdfDictionary AnnotationAction = (PdfDictionary)AnnotationDictionary.Get(PdfName.A);

        //Test if it is a URI action. NOTE: URI and not URL
        if (AnnotationAction.Get(PdfName.S).Equals(PdfName.URI))
        {
            //Remove the old action, I don't think this is actually necessary but I do it anyways
            AnnotationAction.Remove(PdfName.S);
            //Add a new action that is a GOTO action
            AnnotationAction.Put(PdfName.S, PdfName.GOTO);
            //The destination is an array containing an indirect reference to the page as well as a fitting option
            PdfArray NewLocalDestination = new PdfArray();
            NewLocalDestination.Add((PdfObject)reader.GetPageOrigRef(Convert.ToInt32(AnnotationAction.Get(PdfName.URI).ToString())));
            NewLocalDestination.Add(PdfName.FIT);
            AnnotationAction.Remove(PdfName.URI);
            //Add the array to the annotation's destination (/D)
            AnnotationAction.Put(PdfName.D, NewLocalDestination);
        }
    }
}

This requires some explanation. Usually you would run this method for all the pages in your PDF to make your Actions into GOTO Actions, but seeing as we only want to do it for our table of contents, which was page 2 in my case, we call

PdfDictionary PageDictionary = reader.GetPageN(2);

All pages in a PDF has a collection of Annotations which can be a bunch of different things. What we do is run through all the Annotations, and when we encounter a link with an Action we take this Action and turns it into a GOTO, which works like an internal link in the PDF. This is basically all there is to it.

There is one thing you need to know though. When making or merging PDFs with iTextSharp you use a PdfWriter. There are two different ones standard in iTextSharp. One called PdfCopy and one called PdfSmartCopy. As the name implies the PdfSmartCopy is an advanced version of PdfCopy. What it does is basically make sure that the different images and other embedded content you have only get embedded once. There is however a “bug” in the smart copy, so whenever you have a GOTO Action it goes into an infinite loop. What it does is check its parent page (the table of contents page) to see if that page has any links, and for any link on that page it runs the same method again. So if you have internal links you need to use the PdfCopy.

TransactionScope – Why, How and Why not

TransactionScope. A word that should be in every .NET programmer working with databases’ vocabulary. This post gives a brief description of the wonder that is TransactionScope. But before you get all excited, remember to read the entire post to see the caveats involved in the bottom.

Why

When working with data stored in databases one often finds oneself in the scenario where you need to delete, insert or update a large number of rows at a time, most likely in multiple tables. Imagine you have something called a Category and for each Category you have a bunch of Documents, Fields and maybe something else. When you delete the Category you want to delete all the Documents, Fields and other related data as well, but if something goes wrong with one of the deletes you want to rollback the changes made so far so the database is not in an inconsistent state. For most relational databases this could be solved by using foreign keys, with cascading abilities, but that only works for deletes. If you want to avoid that, or your commands involve more than simple deletion, there is another alternative.

Transactions
Most relational databases have the concept of transactions build into them. When using transactions you basically ensure that all your database queries are executed as one, instead of individually. If you are not that strong in the use of, say, SQL Server the use of transactions can be daunting, but the .NET framework has a nice wrapper for the database transactions.
Using TransactionScope you can gain the desired effect of Transactions only using your own .NET code. All your queries to the database, made inside a TransactionScope, will be executed, but only committed once you tell the TransactionScope to commit the changes made. This means that if one of your queries result in an error, none of the changes will be made, and you ensure the integrity of your data. Following is a small example on how to use TransactionScope.

How

TransactionScope resides in the System.Transactions namespace, and you need to tell your code that you want to use that:

using System.Transactions;

TransactionScope implements IDisposable, so the prettiest way to use it is in a using-statement like so:

using (TransactionScope scope = new TransactionScope())
{
    //a bunch of SQL queries
    scope.Complete();
}

A nice property of TransactionScope is the fact that if you make calls to other methods or even classes inside the TransactionScope, and call additional SQL queries in those other methods they still belong to the TransactionScope and will be treated just like other queries made inside the scope. Note the call to scope.Complete();. This is what tells the TransactionScope to commit all the queries to the database. The constructor to TransactionScope has a couple of overloads that gives you the opportunity to specify certain options. In most simple cases this is not necessary, but as you can see in the following section it can be required sometimes.

Why not

While the TransactionScope brings along a bunch of nice features it is not without its faults. Some of them are highlighted in this post. A small summary mentions that using TransactionScope with the empty constructor (like in my example above) sets the isolation level to SERIALIZABLE which makes your queries prone to deadlocks, and like, for instance SqlCommands, TransactionScope comes with a timeout setting. So if you have queries with a combined execution time of more than 30 seconds your transaction will timeout.

A last problem I want to highlight in this post is the fact that if you use multiple different connections in your TransactionScope, the transaction will be treated as a distributed query and will suffer a large performance penalty. So if performance is important in your case you should use the same database connection for all commands in the TransactionScope.

In summary: TransactionScopes can be really helpful in ensuring data integrity by providing a rollback mechanism when encountering errors, but should not be used blindly.

Running a parameterized method in its own thread

The computers of today often have at least two cpu cores, and can as such work with different processes and threads at the same time. This is a feature we would often like to use in our programs when applicable. In most frameworks this have been made easy for us, and the .NET framework is no exception.
Let’s say we have a program that runs through all the pages of a PDF file and do something for each of them. In this example there are no side effects from using multithreading, but you should be aware that race conditions can occur. Normally what you would do, would look something like this:

using System;
using System.Threading;

namespace ThreadDemo
{
    public class Program
    {
        public static void Main(string[] args)
        {
            //... Code to find out the number of pages in the pdf.
            int numberOfPages = 10;
            for (int i = 1; i <= numberOfPages; i++)
            {
                Thread t = new Thread(Whatever);
                t.Start();
            }
        }

        public static void Whatever()
        {
            Console.WriteLine("Writing something");
            Console.Read();
        }
    }
}

This works fine as long as the method you are calling doesn’t take any parameters. The reason for this is that the Thread constructor takes a delegate as a parameter, and in this example the method name counts as that delegate. But let’s say we wanted to pass along a string as a parameter. Then we would have to write a delegate that tells the Thread what code to run. It would look like this:

using System;
using System.Threading;

namespace ThreadDemo
{
    public class Program
    {
        public static void Main(string[] args)
        {
            //... Code to find out the number of pages in the pdf.
            int numberOfPages = 10;
            for (int i = 1; i <= numberOfPages; i++)
            {
                string page = "current pagenumber is " + i;
               
                ThreadStart starter = delegate() { Whatever(page); };
                Thread t = new Thread(starter);
                t.Start();
            }
        }

        public static void Whatever(string str)
        {
            Random random = new Random();
            //Used to show that the threads don't run in FIFO order
            Thread.Sleep(random.Next(0, 1000));
            Console.WriteLine(str);
            Console.Read();
        }
    }
}

The output of the above code could look like this:
Multithreading output example

The reason we don’t give i as a parameter, is that we would only pass a reference to that integer, and then we would have some side effects when it gets changed before we get to use it.
If we want to, we can also inline the delegate like so:

            for (int i = 1; i <= numberOfPages; i++)
            {
                string page = "current pagenumber is " + i;
                Thread t = new Thread(delegate() { Whatever(page); });
                t.Start();
            }

Finally you can exploit the delegates functionality and instead of calling a method in it, simply write the code you want to run. This should only be used if you don’t plan to run the same piece of code other places in your program where you don’t have that delegate available, but as a rule you should call an existing method. Both for readability, and reusability. But if you want to inline the whole thing, it can be done like this:

using System;
using System.Threading;

namespace ThreadDemo
{
    class Program
    {
        static void Main(string[] args)
        {
            //... Code to find out the number of pages in the PDF.
            int numberOfPages = 10;
            for (int i = 1; i <= numberOfPages; i++)
            {
                string page = "current pagenumber is " + i;
                Thread t = new Thread(delegate()
                {
                    Random random = new Random();
                    //Used to show that the threads don't run in FIFO order
                    Thread.Sleep(random.Next(0, 1000));
                    Console.WriteLine(page);
                    Console.Read();
                });
                t.Start();
            }
        }
    }
}

Scraping webpages using webrequest and regular expressions

If you ever find yourself in a situation where your program needs to be able to obtain some information from a webpage on the interwebs, you don’t have to fret about it. It is fairly simple, and in this post I will show one of the ways to do it.
I have made a small C# console application that reads the input from a user, then makes a google search and prints out the total number of search results for the given string.
I’ll start by showing the code, and afterwards explain the key points.

using System;
using System.Net;
using System.IO;
using System.Text.RegularExpressions;

namespace ScrapingDemo
{
    class Program
    {
        static void Main(string[] args)
        {
            string search = Console.ReadLine();
            string re = @"http://www.google.com/search?hl=en&q=" + search.Replace(" ", "+");
            HttpWebRequest request = WebRequest.Create(re) as HttpWebRequest;
            try
            {
                StreamReader reader = new StreamReader(request.GetResponse().GetResponseStream());

                string matchString = @"Results .* of about <b>(?<totalNumber>.*)</b> for <b>";

                Match match = Regex.Match(reader.ReadToEnd(), matchString, RegexOptions.IgnoreCase);

                string number = match.Groups["totalNumber"].Value;
                Console.WriteLine(number);
                reader.Dispose();
                request.KeepAlive = false;
            }
            catch (Exception e)
            {
                Console.WriteLine(e.Message);
            }
            Console.Read();
        }
    }
}

As you can see the code is small and fairly easy to read. First we need to include some namespaces:

using System.Net;
using System.IO;
using System.Text.RegularExpressions;

Then we have to make the WebRequest.

string search = Console.ReadLine();
string re = @"http://www.google.com/search?hl=en&q=" + search.Replace(" ", "+");
HttpWebRequest request = WebRequest.Create(re) as HttpWebRequest;

Here we use the WebRequest.Create method, since that can take either a string or a uri as parameter, and store it as a HttpWebRequest. The reason we have to cast it, is that WebRequest is an abstract class, and as such, some of its methods are not defined. fortunately HttpWebRequest inherits from WebRequest, and has implemented the required methods.
We then use a StreamReader to get the response.

StreamReader reader = new StreamReader(request.GetResponse().GetResponseStream());

You should always do this inside a try-catch block, because IO business is very error prone :). If you want to be able to match several regular expressions against the response, you should store the output from reader.ReadToEnd() in a string, but in this example it suffices to inline the use in our Match.
The Regexp.Match method finds the first occurrence of the string we give as the second parameter, and since we wrote (?.*), it stores all the chars it finds with .* in a group called totalNumber.

//Be aware that if Google changes the layout of their page this matchString might not be correct anymore
string matchString = @"Results .* of about <b>(?<totalNumber>.*)</b> for <b>";
            
Match match = Regex.Match(reader.ReadToEnd(), matchString, RegexOptions.IgnoreCase);

You can have as many groups as you like in a single match, but of course the groups will only get a value if the entire regular expressions matches a piece of the inputstring. We then take the value of that group and print it to the console.

string number = match.Groups["totalNumber"].Value;
Console.WriteLine(number);

And finally we dispose of our StreamReader, and makes sure that our WebRequest does not keep an open connection to the web server.

reader.Dispose();
request.KeepAlive = false;

Instead of using the reader.Dispose method directly, you can use the using statement in the .NET Framework like this:

using (StreamReader reader = new StreamReader(request.GetResponse().GetResponseStream()))
{
    string matchString = @"Results .* of about <b>(?<totalNumber>.*)</b> for <b>";
    Match match = Regex.Match(reader.ReadToEnd(), matchString, RegexOptions.IgnoreCase);

    string number = match.Groups["totalNumber"].Value;

    Console.WriteLine(number);
}

The using statement is used as a scope, and as soon as the code inside the scope is executed, the objects inside the using statement, here the reader, is disposed automatically.

Using Process class to execute program and reading output

Sometimes you find yourself in a situation where you have to do something with your program, and can’t find a framework that can help you do it, and you know it would be a lot of work writing the code yourself. At times like these it can be beneficial to use a command line tool instead.
I wanted to take a PDF file and make each page into its own swf file. To do this by myself would have taken a hell of a lot of time so I turned to the tool pdf2swf which can be found at Swftools homepage. I then needed to call the program from my C# program. To do that I used the Process class from the System.Diagnostics library. The Process class can be used for a bunch of things concerning processes, but I’ll only use it to start a new process in this example. The code for starting pdf2swf with Process looks like this:

First you need to declare that you are using the System.Diagnostics library:

using System.Diagnostics;

And then you write the following code in a method:

Process proc = new Process();

//The path to the .exe you want to run.
proc.StartInfo.FileName = @"C:\PATH\TO\EXE\pdf2swf.exe";

//Set to false to prevent the program from using the operating system shell.
proc.StartInfo.UseShellExecute = false;

//The arguments to pass to the program you want to run.
proc.StartInfo.Arguments = "pdfPath + " -o " + newFileName + ".swf";

//Start the process.
proc.Start();

//Optional, not used if you wanna use the process multithreaded.
proc.WaitForExit();

//Remember to close the process when you are done.
proc.Close();

Sometimes you want to read the output, to determine different things. Maybe to try to see whether pdf2swf failed and you need to do it again with some other options. I had to make jpg files of the swf files, and for that I needed to know the size of the swf files, to pass to the converter. Fortunately for me, pdf2swf writes, among other things, the resolution of the swf file as output, and from that you can get the height and width. To do that I had to listen to the output. To listen, you need to set some more options for the Process object. The code to do that and get the output in a string looks like this:

//Needs to come before the proc.Start(). Tells the process to redirect the output to the Process.StandardOutput stream.
proc.StartInfo.RedirectStandardOutput = true;

//Saves the output in a string.
string output = proc.StandardOutput.ReadToEnd();

And you can now use regular expressions or whatever you like to get the desired information from the output string.
Just to recap. The full code for the example looks like this:

Process proc = new Process();

//The path to the .exe you want to run.
proc.StartInfo.FileName = @"C:\PATH\TO\EXE\pdf2swf.exe";

//Set to false to prevent the program from using the operating system shell
proc.StartInfo.UseShellExecute = false;

//Tells the process to redirect the output to the Process.StandardOutput stream.
proc.StartInfo.RedirectStandardOutput = true;

//The arguments to pass to the program you want to run.
proc.StartInfo.Arguments = "pdfPath + " -o " + newFileName + ".swf";

//Start the process
proc.Start();

//Optional, not used if you wanna use the process it multithreaded.
proc.WaitForExit();

//Saves the output in a string.
string output = proc.StandardOutput.ReadToEnd();

//Remember to close the process when you are done.
proc.Close();

A final thought. Before you try to use a tool in a multithreaded way, make sure that it supports multithreading. And as a bonus, I can divulge that pdf2swf supports multithreading, but Swf to Image does not.

Using Strategy Pattern

The Strategy Pattern can be a huge help in providing maintainability to your code. Often when you write a program it starts out having a small set of features that are easy to manage. But as that project grows, you add more features to the list, and suddenly there’s a place in your program where the behavior varies according to different criteria. If you are lucky you only have two different behaviors, but it is often the case that you don’t really know how many different behaviors your program needs to support.
When that is the case, what often happens is that you write a lot of if-then-else statements, and your code becomes cluttered. To illustrate this I’ve taken some code from an application I wrote some time ago. A quick explanation of the program: It takes a bunch of files (episodes of different series, like “How I Met Your Mother”), and formats the filename according to some pattern you set in the properties. But the important part is that it goes online to find the name of the episode based on the season and episode number. This is where the different behaviors come into play. There is a multitude of different websites which contain the information on what the episode names are for a given series – IMDB.com and epguides.com being two big ones, and sometimes you have to change between different providers to get a correct result.

Let’s say we have a C# program that contains a method which takes as parameter the name of a television series, and finds the data detailing the episodes in that series, and adds that to a dictionary. Having two different providers would make it look something like this:

private string provider = "imdb"; //Changed in some other method when we need it to change.

public void FindResponseCache(string serieName)
{
    try
    {
        if (provider == "imdb")
        {
            //... Code to find the responseCache
        }
        else if (provider == "epguides")
        {
            //... Code to find the responseCache   
        }
    }
    catch (Exception)
    {
        ErrorText.Text += "Could not find series: " + serieName + ". Check for spellingerrors.";
    }
}

This could work if we only had the two different providers, but if we added more it would quickly become difficult to keep track of. This is where we can use the strategy pattern.
As we can see, the two if-statements do the same thing. The only difference is the way they do it. When this happens, we can abstract the responsibility of finding a response cache away from the main program, and let some other class handle it. What we do when using strategy pattern to solve this problem, is define an interface that defines the method(s) we need to use to get the expected result. In our case we make an interface called IFindingStrategy which has one method named GetResponseCache, that looks like this:

namespace SerieRenamer
{
    interface IFindingStrategy
    {
        
        //Specifies how to get the responseCache.
        string GetResponseCache(string serieName);
    }
}

Each time we add a new provider, we simply make a new class that implements our IFindingStrategy interface, for example IMDB:

namespace SerieRenamer
{
    class IMDBStrategy : IFindingStrategy
    {
        public string GetResponseCache(string serieName)
        {
            //... Code that gets the job done.
        }
    }
}

The point of all this, is that it gives us the opportunity to simplify our FindResponseCache method like so:

private IFindingStrategy FindingStrategy = new IMDBStrategy(); //Changed in some other method when we need it to change.

public void FindResponseCache(string serieName)
{
    try
    {
        FindingStrategy.GetResponseCache(serieName);
    }
    catch (Exception)
    {
        ErrorText.Text += "Could not find series: " + serieName + ". Check for spellingerrors.";
    }
}

In this way we don’t need to scroll through a lot of if-then-else or switch cases to add other providers, or modify the ones we have already, and the added readability is a gift in itself.