Faster Searching in Big PDFs

April 25, 2007 - 2:19pm ||| 6 Comments | Add new

Does this sound familiar: You open a huge PDF and need to quickly find the page containing the topic you're interested in. (We'll assume this PDF has no bookmarks or a linked TOC that will suffice.) 

Our Google-ized instincts immediately reach for the Find (Command/Control-F) field to enter the word or phrase we're looking for. Acrobat (or Reader, doesn't matter) finds the first couple of instances in a reasonable amount of time, but soon it slows to a crawl as we click Find Next one too many times and it hits a dry patch. 

The little read-out says, "Searching 342 of 575 … 343 of 575 … 344 ….345 …346 … " Two minutes later and we're still staring at the page progression, hypnotized, waiting for a hit: "517 of 575 … 518 …519…520…"

Agh! Snap out of it, man!

By choosing one little command in Acrobat Pro v8, you can put an end to this misery for yourself and for anyone else who wants instant finding or searching, even in the most massive of PDFs.

——-
Embed an Index
——-
Using Acrobat Pro, you can create a full-text index of the contents of a single PDF, similar to how Google indexes all the text in the pages of a web site, and (new to v8) embed it into the PDF. Then when you Find or Search, Acrobat or Reader searches the *index,* not the PDF. Since the index file is much smaller, operations are lightning-quick. And, since the index knows which page numbers its words appear on, the end result is the same.

We've been able to create indexes in Acrobat Pro for many versions now, always using the Catalog command. PDF content providers typically index a folder full of PDFs so that a single Search (Command/Control-Shift-F) can hunt down the search text in a whole collection of PDFs. And I suppose you could use Catalog to create an index of a single PDF too, though I never bothered.

All that is still possible in Acrobat Pro 8, and the old ways of associating an index with a particular PDF still work.

But as I mentioned, Acrobat Pro 8 added a new twist: Indexes are embeddable in a PDF. Once they're embedded, you no longer have to keep track of the separate .pdx and .idx files generated for each PDF's index, making sure they always travel with the file. End users don't have to figure out how to tell Reader to use the index during Finds and Searches, since Reader 8 and Acrobat 8 automatically use it if it's embedded. (Earlier versions of Reader and Acrobat ignore the embedded index.)

Cool, huh? Best of all, it's dead-simple to do.

1. Open the PDF in Acrobat Pro 8 and choose Advanced > Document Processing > Manage Embedded Index.

2. The resulting dialog box will tell you that the "the document does not contain an embedded index." Ignore that and click the Embed Index button.

3. An alert pops up, saying that Acrobat is about to 1) Save and close the document; 2) Build a search index for it; 3) Embed the index; and 4) Reopen the document. Click the OK button if you want to proceed … yes indeedy, you do,
so click!

The PDF closes, and after a few seconds of watching a progress bar create the index, it opens right back up again.

——-
Before and After
——-
For my guinea pig test file, I downloaded the InDesign CS3 "full documentation" PDF from Adobe's web site:
http://www.adobe.com/support/documentation/en/indesign_incopy/

This puppy tips the scales at 46.35 MB and 762 pages. Whoa, mama!

Before I indexed it, I ran a search (Edit > Search) for the term "blend" and timed it. On my late-model Compaq, Acrobat Pro 8 took 24 seconds to display the 153 matches in its Results window.

After embedding the index (which added 2.8 MB to the filesize), and purging the Search cache (see below) to keep things fair; I ran the same search. This time, Acrobat took about, oh, a nanosecond to display the same 153 matches. I had the same blink-of-an-eye results in Reader 8, on both platforms.

You can bet that from now on, I'll be routinely embedding indexes in all of the larger PDFs on my hard drive, especially all those software documentation ones I keep needing to find things in.

If you post large PDFs for your customers to download, like catalogues or periodicals, you might want to do the same.

——-
About that Search Cache
——-
Both Acrobat and Reader already do something similar when you're repeatedly hunting for terms in the same PDF. They cache the text and save it in a file so that subsequent Finds and Searches are fast. You can adjust the size of the cache, or purge it, in Preferences > Search.

But embedding an index in a PDF ensures that Finds and Searches are always fast in Reader 8 or Acrobat 8, regardless of the state of the user's cache, even if it's the first time they need to find something quickly.

Comments (Subscribe to Comments RSS)

1 February 3, 2011 - 4:46am by Peter (not verified):

Ls,

Can I print an indexpage with pagenumbers and add the indexpage to a PDF-document with predifined words??

Familair with concordantion option in Word

gr.

Peter

2 October 24, 2011 - 9:34am by kodakcollector (not verified):

I have a 360mb portfolio reared with Acrobat 9 and is made up of 74 antique catalogues with some 4200 pages. When I search my full text index directly for ‘kodak’ which has some 16,000 instances the search time is 6 seconds. When I embed the index into the portfolio and do the same search we are talking 6 minutes.

Am I missing something? I made the cash bigger. Even after executing the same search several times the speed does not decrease. Searches with less hits do better. I have also set portfolio to read the the created index. Which also did not help.

Any ideas? I would be happy to send you the cd.

Thanks - charlie

3 October 28, 2011 - 10:06am by Gerrad (not verified):

I just use some Black Ops mods to modify and tweak some files and it works super fast now searching through my PDFs!

4 April 24, 2012 - 5:28am by Lara (not verified):

My Herbs book is 460 pages. Unfortunately i do not have Acrobat, so instead i used a program called PDF Index Generator to create my book index. It actually did a decent job :)

5 August 10, 2012 - 12:14pm by Brian (not verified):

I am experiencing problems with the index tool in Acrobat 9 Pro. I have created an index for a large catalog of .pdf files (multiple directories with about 700 files with over 20,000 pages indexed). After the index is created and I try to search the catalog, the application hangs and becomes unresponsive. Any suggestions on how to make the index function properly?

6 September 30, 2016 - 8:16am by tmeserve:

i am trying to index a pdf doc file that is 512,000kb 60,000 + pages when I click on manage Embedded index it stalls at 79%. so I tried to save as Reduced size and it stalled. I need to be able to index the file with out chopping in to two any suggestions or another program that could be used

Post new comment

The content of this field is kept private and will not be shown publicly.
By submitting this form, you accept the Mollom privacy policy.