Re: [DIYbio] Removing watermarks from pdfs (pdfparanoia)

On Wed, Feb 06, 2013 at 06:12:44PM +0000, Cathal Garvey wrote:
> Very true; MAT only specialises in finding binary metadata like what
> software made the file, etc.: to remove "text" metadata like embedded
> IPs, identifying front-pages, etc, you'd need to profile what if
> anything is done by a particular publisher to their PDFs, and have a
> tool that removes this data specially.
>
> For example, to remove a frontpage, you might need to "explode" the PDF
> into images, discard the first image, and recompress into a new PDF.
With softwares like ghostscript, pdftext and many others, you can just
remove the page without any conversion.
>
> To remove text/images embedded on the bottom of each PDF page, you could
> do the same except use imagemagick on each image before
> recompression.
You can also do a conversion to postscript and use a script to remove the
nasty part.
>
> Major disadvantage to this route is that it would convert a text +
> images PDF (high compression ratio, easy to extract text for re-use)
> into an images-only PDF (large file size, poor compression, impossible
> to extract text without OCR).
For sure !
>
> If you can extract text of course, you could try extracting text +
> images and perhaps script the creation of an entirely new PDF file. This
> is the opposite approach; instead of blacklisting content ("This bit
> contains IP address info"), you're whitelisting content ("These bits are
> the text and images that form the actual paper").
Hmmm interesting idea ! But some publishing software (LaTeX works well
for this) make strange things with sentences posititions and stuff. So
you couldn't get a good text-flow without manual intervention. Maybe
the best way to do this would be to use the HTML version, but they
exist only for recent publications (and old pdf ones are pure bitmap
with sometimes an OCRed text overlay).

  • Digg
  • Del.icio.us
  • StumbleUpon
  • Reddit
  • RSS

0 comments:

Post a Comment