Guide to sharing documents

IMPORTANT NOTE TO ALL USERS
I have some very bad news. Wikidot has decided that they will start to put ads on free websites. For me personally, this is a no go zone. I will not put effort in maintaining, let alone providing any more content if it requires adverts. That means that this website will go offline unless anybody cares enough to become the Administrator for it. If you care enough, become a member of the site and I will make you Master Administrator. The site is then yours to do as you wish with it.

Introduction

This guide specializes on how to produce a digital document that is worth sharing. Namely the technical aspect, the trade-off between quality and file size. This guide aims to empower you to get most out of that trade-off. Too many people share low quality documents that are too big, and that is not only a waste of bandwidth, but also of time, because others will have to scan the same document again later, or at least spend considerable time cleaning it up. Scanning documents is a lot of work1, and it's a pity if it has to happen twice.

This guide can be split into the following parts:

  1. How to scan a document and process the images in Photoshop
  2. How to put the images in a pdf
  3. How to clean up low quality pdf's

Unless you already have a profound knowledge of the subject, I would recommend to read the whole guide, from front to back, because the later parts assume that you master the former.

Software you will need to complete this guide (maybe there are some opensource tools (eg. the gimp) that can achieve the same thing, but im not up to date about wheter they can really do it all, so I will use the professional tools in this guide2):

  1. Photoshop
  2. JPG2000 plugin for Photoshop which you should copy to the \Plug-Ins\File Formats\ folder in your Photoshop install directory.
  3. Acrobat Professional
  4. A sheet of black paper (if you want to scan documents, you can get it in art shops)

If you want to scan documents, needless to say you need a scanner. Some basic familiarity with Photoshop will definitely help you through this guide, but don't let that stop you from trying. I try to explain everything in a way that allows beginners to follow, and advanced users to still learn something new.

How to scan documents

Read the documentation of your scanner and get a bit familiar with the scanner software. If it has different modes, you'll want the most advanced mode where you can control the scanning process as much as possible.

Open Photoshop. You should try to get your scanner working from within Photoshop if possible. In the menus, go to File>Import>. In my case this has an entry Epson perfection V200. If all goes well, you will see your scanner here. Try it. If your scanner software pops up, your set to go.

Types of documents

Figure-01.png
figure 1: A picture which uses halftone to simulate gray.

We will now go over some of the different ways and settings you can use when scanning different types of documents. It is not to be dealt with lightly. Scanning documents improperly is the main cause of all the low quality stuff that circles on the net.

Grab a book of your shelf. We will use that one as a test case. Open it in the middle and have a look. There are 3 main types of documents/pages and each is scanned in a different way. If there are colors, this is a color document/page. If there is gray ink, it is a grayscale document/page. If there is only white paper and black ink, it is a B/W document/page. Note the important difference between grayscale and B/W.

There is also a difference between the type of page vs the type of document. If you have mainly B/W text, but it has illustrations here and there, the document as a whole is mainly B/W, which is better (smaller file size) than grayscale, which is better than color. You don't even have to consider pages as a whole. Many pages are B/W, but have gray or color illustrations on them. We can then achieve the best result by scanning the page twice, keeping the whole page B/W, and sticking the illustration on top of it in Acrobat, but that is described later.

Considering illustrations, line art and images that use halftone dither like figure 1 on the left are considered B/W, whereas photographs mostly use gray ink, and look bad when scanned as B/W.

Resolution

Now that we know how to recognize the type of document at hand, another thing to decide is at what resolution to scan it. Resolution is expressed as dpi (dots per inch). In general the higher the resolution, the longer it takes to scan, but more importantly, the bigger the file size, and the higher the quality of the resulting image. The exact choice of resolution might differ from document to document. If you are scanning an art photography book you might want to go for some extra quality. Never guess, always try different settings and compare the file sizes and quality until you find a good balance. Next be consistent. If we consider the art photography book again, if you determined that 400dpi grayscale is good for the grayscale photographs, then use it throughout for all the photographs in this book. Or if you scan a book as B/W, don't scan some entire pages as grayscale, because it will look bad. Especially since often in grayscale the white paper tends to come out not entirely white…

The choice of resolution further depends on the type of document. A B/W scan will usually be done in 600 dpi. Color and gray already look good at 300 dpi. Lean back a bit away from the screen (+/-1.5m) and have a look at figure 2, four times the same character, scanned with different settings. Keep in mind that the character you are looking at is merely 4 mm high on paper, so you can't see the fine details at that size, at least not without a looking glass.

Figure-02.png
figure 2: from left to right: 300dpi/gray, 300dpi/B/W, 600dpi/B/W, 600dpi/gray3

Now I hope we agree on the following: at the same resolution, gray looks better than B/W, and 600dpi is way better than 300 regardless of the color setting. From experience I can propose this conclusion. 300 dpi B/W is a bit low on quality (I'm a perfectionist), whereas 600dpi/gray is scandalous in file size. As a practical choice we have the other two options left. 600 dpi B/W is looking best. Fortune goes our way today, since it turns out that compression is better than with 300 dpi/gray too. This justifies to choose 600 dpi B/W for scanning text and line art documents.
Further more, there is some quality advantage of B/W too. Scanned paper never looks completely white, so you won't get an even background in grayscale, but you will in B/W. We resemble reality because there is only white paper and black ink to distinguish, and that is an extra benefit. I 'd say, try it out, print it (obviously you have to print at 600dpi as well) and look. Compare your printout with the original, and you will be very pleased with the quality.

Optical Character Recognition

Isn't the best quality/file size trade-off OCR? Yes it is. But! Saving OCR'ed documents has several problems. First of all, OCR software makes mistakes. This is tackled by most software by using a spelling checker that allows you to correct all the dubious words. Needless to say that this is time consuming and that errors always remain, but that's just the beginning. Imagine the document is about a special subject like mathematics, music or programming with special characters and notation, and the best OCR software goes to it's knees. The best software is also meek when it comes down to reconstructing layout and fonts. In practice, you will have to reconstruct the layout in Indesign if you really want it to look like the original. Font Expert is a program that will help you recognize a font, or at least find one that is close. It might be your only choice if you get a bad quality scan, and you want to reconstruct a quality document from it, and there is no other way of obtaining it, or if you want to learn desktop publishing and need a practice subject.

If your main concern is that you want documents to be searchable, note that Acrobat Pro has an OCR feature that allows you to save the image with the text underneath, which will make a document searchable without the work and the layout problems. If you know then that at 600dpi B/W a page can usually be saved in less than 50KB (10mb for a 200p. Book), it is usually not worth the hassle to OCR it.

The actual scanning

Let's get our hands dirty. Scanning should be done from Photoshop usually, so you can immediately check and correct the results and save them away. It's a bad idea to scan an entire book first to notice afterwards that you haven't been paying attention and three quarters of the pages are rotated, or some color setting was not completely ideal, especially considering how much work scanning is.

Black and White pages

Ok, lets start scanning. If you have a book with covers, which are often in color, ignore them for now. We will first do the bulk of the work. I assume the simplest case first of a B/W document. Try to be consistent. Find the best way to position a page of the given format on the scanner, check both sides, as they will be different. Corners are often the best things to help you align properly. Always try to find a way so you can align the side of the page with the edge of you scanning surface, this way your pages will be straight. Now remember this, and scan all your pages in the same way. You will have to press the book a bit onto the glass, so the paper is flat against it, otherwise you get your text skewed. Check the software of your scanner for the in and output size. Also be consistent. If you are scanning an A4 document, make sure that both the in and output are A4.

Note: in theory it could be handy to have your images without the margins, and then in Acrobat you could print them to a pdf with the desired page size. The problem with this is that I have not found a way to crop the margins automatically. Doing it by hand is to much work. The reason all this is important, is for printing out the document again. If you print it from Acrobat using the shrink to printable area setting, and you have scanned with margins, then you will have double margins, ending with the content being reduced in size. If you print with scaling set to none, you need to make sure that all pages are consistent in size, because if some would be to big, you have a problem. If the format you print is different from the original, you have no choice but to fit to the printable area. Consider a Letter document that you print on A4 paper, which is narrower. If you don't fit to the printable area, you might loose content on the sides. So, the ideal situation would be scan without margins, and to print with fit to printable area, but scanning without margins might be a problem unless you find a way to automate cropping the margins. If we find one here at the filesharing portal, we will update this guide.

Scanning B/W has one aditional setting. That's the treshold. It usually is a number between 0 and 255. Everything darker than your treshold will be black, and anything lighter will be white. You should take a representing page from your document with fine details like italic text and test several settings to find the best result before you start scanning your book. You should not have to change this between pages of the same document.

Ok, now scan. Once you get the hang, and all is going smooth, you can easily scan several pages to Photoshop before processing them all, but at first go one at a time, until you are confident that you won't make any mistakes while scanning.

Grayscale or color images or pages

When scanning color or grayscale, set your scanner settings to color 24 bit or to gray 8 bit. Higher values are usually overkill, unless you really scan something special. Often halftone is used for mixing colors in print. This gives some problems when scanning, because the scanlines will form a pattern with the printed lines, which will show through the image like on figure 3. This phenomenon is called moire. If you have this problem, check your scanner software for a setting called descreen. Cd covers and newspapers are the worst in this regard, followed by magazines, then books. Real photographs will not suffer from this problem at all. It is advised to have your scanner software take care of this, because you will most likely not get an equally good result from post-processing in Photoshop. My scanner actually goes back and forth when scanning with descreen. It can be especially challenging to get crystal clear text in cd covers, since the descreen function uses a blur to remove the moire. There is a Photoshop plugin to descreen also, and it's makers claim it to be better than most scanner software. Judge for yourself. An alternative might be rotating your original a bit and/or scanning at higher resolutions. Experimenting will be your best bet.

Figure-03.png
figure 3: Part of a cd cover suffering from moire. Scanned first without any adjustments, and next with descreen.

Another problem that may arise with color and grayscale pages is that text on the back of the page may shine through in your scan. This is because the light traverses the page, is reflected on the next page and comes back to the scanning lamp. To avoid this, put a sheet of black paper behind the page you are scanning…
Also experiment with all the settings of your scanner software regarding contrast, levels and color balances, until you get good results (eg. white paper should really be white, etc…).

Processing in Photoshop

Now what will we be doing from Photoshop once the image comes in? If you are unfamiliar with Photoshop, open the help and read a bit about actions. Actions allow you to automate specific tasks. With something as repetitive as scanning all pages of a book, actions will come in handy for sure. For example I use an action that with a function key, opens the save as dialog in the correct folder, letting me choose a filename, after which it will close the file. This may seem silly, but it saves you several mouseclicks on each page, which quickly becomes worth it.

Black and White images

The first thing to do in Photoshop is to get rid of the black stuff around the edges. Activate the select tool photshop-select-tool.png and select the areas around the edges and press delete. As long as your background color is set to white, this will erase redundant black pixels. If your pages all have pretty big margins, and you are pretty sure you will position them identically on the scanner, you could create an action to erase for example 1 cm around all edges of the page. Make sure you never delete anything useful. If you scan a page that has large white areas it will often have dust speckels. It might also be worth cleaning them.

Sometimes a page is not straight. To produce a decent document, you should rotate it. Rotation doesn't work on bitmap (B/W) images, so you will have to convert to grayscale and back before and after rotating. Make sure you don't change the resolution, and use 50% threshold when converting back to bitmap. Find a straight line in you page like a line of text, and zoom in at the beginning of this line. Now, select the ruler tool photshop-ruler-tool.png (it tends to hide under the eyedropper). Take a good anchor point, like the bottom of a letter. Drag right until the end of the line. Drop it at the bottom of a letter here. The ruler tool has now measured the angle at which your page is rotated. In the menus, go Image>Rotate canvas>Arbitrary… Click Ok. Your image should now be straight. I use an action for this too, with a stop in the middle to let me measure the angle.

One more thing that comes in handy sometimes is centering your image. Activate the move tool photshop-move-tool.png and use the arrows of your keyboard to move the image. If due to the way you scan your document every page needs to be moved, you can count the number of times you press the arrows to keep it all consistent. Or even better, make an action with a function key for it!

Now you are ready to save this page. Save it as a tiff, which is a handy format because it uses lossless compression. The settings that pop up should all be right. LWZ and ZIP compression are both fine. It doesn't matter which one is smaller, because we will use yet another compression later. On filenames, to save yourself some trouble later, choose a naming format that will keep the files in order (alphabetically) I like to use the page numbers that are actually on that page, this way you can easily find things back if you lose count.

Color and grayscale images

Try to get the settings from your scanner so that you have to do as little as possible post-processing in Photoshop. You might want to still improve on levels, color balance and sharpening. When ready, save them in JPEG 2000 format. This is an inproved compression standard based on JPEG. It is the best compression scheme for color and grayscale, except for lineart. It tends to give ugly artifacts on large areas of a single color and sharp edges. You don't get images like that when scanning, so jpeg2000 is the best format. Play with the quality setting until you are satisfied with the quality/size tradeoff. In advanced options, I choose float and tile size 1024 because that gives the smallest file size.

Some covers only really use some line art and a handful of colors. In that situation, it may strike you that jpeg 2000 is not really the best compression format. Ideally you would like to have the large even areas really in 1 color instead of the nuances a scan gives (like paper grain etc). Then you can save them as a png with only a few colors. This will look neater and be a much smaller file.

Here is a trick to get most out of such an image. Beware, before you start: it's quite tricky to get it right, but if you do, you will be able to save a high quality cover in a very small file. Assuming that you have a background color, with some text on it and maybe an image. Use the rectangular selection tool to select your picture. Right click and choose layer via cut. Whatever happens, we don't want to change anything to the picture. You might even want to extract that to a seperate file and save this in jpeg 2000 like described for composite pages in the next section. Use select>select color range to select your foreground text. It takes some practice to get this right. Don't have the fuzziness to low. It is better to have a pixel to much than a pixel short on your letters. After you get your selection, use the other selection tools to substract or add from the selection until it is just right. Especially if the color range has selected some pixels here and there on the page that obviously don't belong to your text. After you have exactly your text, use layer via cut again and call the layer foreground. Assuming that there is nothing else left on the background layer but the background color, double click the layer to make it editable. Ctrl-click on the thumbnail of the layer to select the pixels it contains. Next use filters>blur>average to turn it into 1 color. Deselect (ctrl-d). Use the color picker to identify the color and use edit>fill to fill the whole layer with this color. You can do this for any areas with single colors that come on top. Of course you won't fill the whole layer then, but often area's are rectangular and you can easily make a selection to fill. This helps to get rid of black edges and other irregularities. Just make sure that the order of your layers represents background to foreground in your image. You can use this also to make sure that text is nicely 100% black etc…

When you are done, use file>save for web and devices. Here you can pick png, either 8 bit or 24 bits, and play with the settings until you find the best quality/filesize tradoff. Often 16 colors is more than enough. The only downside of png is that it does not save resolution information, but we will take care of that in Acrobat.

Composite pages

If you are dealing with a B/W page with some illustrations, scan the page twice. On the B/W page, clear the areas where the illustrations are. In the color or gray scan, only keep your illustrations. Keep each of them as a seperate file. Save them away like “page10a, page10b”. This way they will be in our pdf document so we can stick them on top of our B/W page. It is currently not possible to combine encodings in Photoshop. This will allow you to use the best compression format for all the pieces apart. It will give a more consistent look than if you scan an entire page in color, and the filesize will be much smaller.

If you have a book with a lot of illustrations, I found that the best way to scan them is by creating my photoshop actions and scanner settings first for both the B/W and for the illustrations. Next when scanning a page, you can keep your original on the flatbed in place with one hand, and switch scanner settings with the other hand. This way you can scan both times without having to go through the book twice. That saves considerable time.

How to create a pdf from your images.

Almost invariably, documents are these days stored as pdf's. There is another format called DjVu. You might want to check that out. Personally I have no experience with it, so I won't go into it.

The basic procedure of creating a pdf is really quite simple. Let's assume you have only gotten full pages, so no composite ones, and only tiffs and jpeg2000's and no png's. Open Acrobat and go file>create pdf>from multiple files and choose add folder to add the folder where all your images are. Now you should see all your images, of course they are in order since you saved them alphabetically. Select Large file size and click next. Keep merge to one pdf. I always uncheck continue when errors happen since I want to know if there are errors. Click create and save your pdf. Now go advanced>pdf optimizer…

On the images pane, turn off downsampling everywhere. For color and gray images, retain the existing compression, but for monochrome (B/W) choose JBIG22. I prefer lossless, because, well it is lossless, but it might be worth having a look what difference in filesize/quality you get with lossy compression. You can have a look to the other settings in the optimizer, but if you are just saving scans, this will be the most important setting. Save your file and grin. You should now have an awfully small high quality pdf.

Composite pages

If you have color/gray illustrations in your document, they will be in the pdf as separate pages. We still have to put them in the right place. Right-click on the toolbar to enable the advanced editing toolbar. Activate the TouchUp Object tool and click an illustration that needs to be put into place. Next choose edit>copy and open the pages panel at the left. Click on the page that needs to hold the illustration. Now go edit>paste and move your image to the right position. Now you can delete the original page of the illustration. Repeat the process for all other illustrations and save your file. Next see the previous paragraph for how to optimize the pdf.

Using png's

If you created png images in photoshop, you'll notice that they look really big in your pdf. This is because png does not save resolution information, so acrobat does not know how big it really needs to be. To correct this you need to make a pdf document from a blank page. To do this, first open edit>preferences and make sure that the default page size for new document is correct. If your page size is not listed there, I'm afraid you can't enter a custom one here. Create an empty page in open office or another app and print it to a pdf with the right size4. If your page size was listed, you can use File>Create PDF>From blank page instead. Now from the pages pane of this new document, drag this empty page to the pages pane of your book. Now you can copy and paste the image on it like described in the last paragraph. Now you can resize and move the image with the TouchUp Object tool, until it fits on the page and is positioned correctly.

Fixing existing pdf documents

Note: before reading this section, please read the section on scanning documents. A lot of concepts are introduced there and are assumed here. For example how to choose whether to save your documents in B/W or in grayscale, is explained above.

One thing that's often useful to do is fixing bad scans that come your way. Especially if the content is valuable and you want to continue sharing it with other people, or you want to print it. It is hard to write a consistent guide about this, because the most volatile part of the art is very dependent on the specific condition of the document that you want to fix. We will assume that the problems are scans in grayscale or color when they should have been in B/W as well as to low resolution, text shining through from the back of the pages and rotated pages. Another common problem is scans with parts of pages missing, but obviously there is nothing you can do about that except waiting for a better scan to come your way.

Open you document in Acrobat and from the menu's choose Advanced>Document Processing>Export all images. Make sure you choose to save them as tiff, because that will be lossless. If all is good, you'll have a picture per page. If you get tons of small image fragments, then you are pretty much lost. I think some software saves pdf's in this way. I don't think there is anything you can do then. If you are only or also fixing rotation, have a look in the section on scanning documents to see how to fix that. For other problems read on.

Open your folder with images, and open a representative one in Photoshop. This means not the front cover, not a page with a photograph of the author, but a page that has exactly the problems most of the pages you want to fix have. We will create an action that fully automatically processes all the pages. That is why this one needs to be representative. It's also good if it's a page that has some finer details, that might risk to disappear in processing, so you can check that before continuing with the rest of your document.

A good start is to have a look in the image menu to mode to see if you are dealing with color or grayscale. Further down you find image size. The part that interests us is document size. Here you can see what the used resolution is. Later we can change it here too. The actual size of the document might also be interesting if you want to revert it to a sensible format like A4 for printing. However, this should be done carefully, you should usually have constrain proportions checked, only adjust the height and use canvas size from the same menu to adjust the width. As an alternative and more recommended way to do this, when all your images are without the page margins, you can leave them as they are and when you have a pdf, print it from Acrobat to a pdf with the desired page settings, and choose center in the print dialog. You will then have a nice pdf in the format of your likings.

Now that we know what we have, close this box and open the actions window. Create a new action. Note that the red dot turns on indicating that all actions we take in Photoshop are being recorded. This means that if you miss somewhere and you need to go back or experiment, you should temporarily turn off recording and delete the redundant steps from the action.

Lets start. Open the image size dialog and change the drop down box at the bottom to Bicubic Smoother and then the resolution to 600dpi. Now we are at the fuzzy part of this guide, because all depends on the actual document at hand what needs to be done to it. Turn off recording and experiment a bit. We will start with the end, because that is one step that is not optional. Whatever processing you do, you should end with this. Go to the menu image>mode and choose Bitmap. This is B/W. Change the dropbox to show 50% threshold and change the resolution to 600 dpi. Click Ok and compare the result to the original grayscale image. If you have the feeling you lost a lot of quality underway, you might want to try some of the following extra processing steps before converting to Bitmap.I will sum up some of the useful features of Photoshop, but all the actual settings will depend on your image.

Image>adjustments>levels is an interesting tool for improving contrast. For example if the text from the back shines through, you might play with the sliders until the paper is white and the foreground is black. Remember the 50% threshold of the bitmap conversion. Maybe 50% was not the best cutoff, and with this tool you can manually adjust that before converting.

On images which come from a low resolution like 150 dpi, there is a powerful tool called smart sharpen from the filters menu. It's not a simple tool, but you might want to have a look in the Photoshop help, which explains the settings somewhat. The biggest thing to watch out for with smart sharpen is that you might lose the finer details if you set the radius to high. Beware, smart sharpen often takes a long time to process, especially on 600 dpi images, and even more so if you are processing a 500 page book. You can probably easily go for supper while its busy.

When you manage to create a good image, use the history window to go back, turn on recording and record all the steps. Choose save as from the file menu, create a new folder and save your file there as tiff, but don't change the filename (when we run the action automatically, it will retain the filenames, which is what we want). The compression type does not matter, and normally you don't have to change any of the settings in the save dialog. Now close the file, and then turn off recording. Go and find that file and delete it.

Now for the powerfull stuff.Go to File>Automate>batch. Here select the action you just recorded. Choose the folder with all the images and keep all the checkboxes unchecked, unless if you get color profile warnings, in which case you might choose to suppress these. Keep destination on none. Press Ok. Photoshop should now start to process your images. In this folder you will now have all files processed, also those that shouldn't have been like color front and back covers, etc. Delete them, and process them manually and put them in the folder. On how to process these other pages and on how to make a pdf from all these images, please see the section of this guide on How to encode the images and put them in a pdf.

written by najas

BlinkListblogmarksdel.icio.usdiggFarkfeedmelinksFurlLinkaGoGoNewsVineNetvouzRedditYahooMyWebFacebook

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License