Machine Design

PDF searcher finds technical data

A search engine specifically designed to sift through information in PDF catalog pages targets technical users looking for product information.

The DirectIndustry portal search engine lists only manufacturers. A search by keyword returns a list of product images in order of keyword relevance. Company officials say this eases searching because the site prioritizes images of products and logos over pure text.

Developers at DirectIndustry.comsay they have devised proprietary algorithms for converting ordinary PDF pages into versions that can be searched online. They use these techniques to field a search site specifically devoted to finding information in industrial catalogs.

According to company officials, three independent programs go into manipulating and pulling out data from catalog PDFs. The first step is to divide a PDF of a catalog into PDFs of its individual pages. Each catalog page gets converted into a jpeg file so it can be viewed in an ordinary Web browser without starting up a PDF viewer. Simultaneously, another program extracts and organizes data from the original catalog PDF. The program tags each word in this database with the original page number from the catalog on which it was found. This tagging lets the search engine display a jpeg with the information of interest.

Finally, the original singlepage PDFs are stored in parallel with the corresponding jpeg image of them. When a user wants to get a closer look at an area on the page, the site reverts back to the PDF for a zoom-in. The same process takes place when users select text or save the page locally.

DirectIndustry says it takes about 2 hr or less to render a typical PDF catalog into its searchable format. The service, called the Virtual Technical Library, now contains over 90,000 pages of technical information from over 3,400 PDFs, the company says. These come from 5,600 companies with about 26% of them hailing from the U.S.

TAGS: CAD Archive
Hide comments


  • Allowed HTML tags: <em> <strong> <blockquote> <br> <p>

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.