Re: [DIYbio] Removing watermarks from pdfs (pdfparanoia)

On Wed, Feb 6, 2013 at 12:12 PM, Cathal Garvey wrote:
> For example, to remove a frontpage, you might need to "explode" the PDF
> into images, discard the first image, and recompress into a new PDF.

I don't recommend this method, because converting most pdfs into
images will cause loss of text. You can delete entire pages in the pdf
format by deleting the "stream" objects and modifying the xref table.

> To remove text/images embedded on the bottom of each PDF page, you could
> do the same except use imagemagick on each image before recompression.

Most text in a pdf document is "semantic", surrounded by pdf markup
that can be directly deleted. I can imagine there might be one or two
cases where publishers are adding an image to a pdf with your ip
address, in which case you can delete that single image. However, if
the page content is an image itself (no selectable text), then they
might have chosen to add the image into the page, in which case the
only way to remove the watermark would be to use imagemagick as you
say, and draw over the offending image. So far I haven't seen this yet
in any of the documents I have read over the years.

> Major disadvantage to this route is that it would convert a text +
> images PDF (high compression ratio, easy to extract text for re-use)
> into an images-only PDF (large file size, poor compression, impossible
> to extract text without OCR).

right..

> If you can extract text of course, you could try extracting text +
> images and perhaps script the creation of an entirely new PDF file. This
> is the opposite approach; instead of blacklisting content ("This bit
> contains IP address info"), you're whitelisting content ("These bits are
> the text and images that form the actual paper").

How would you whitelist content you've never seen before?

- Bryan
http://heybryan.org/
1 512 203 0507

--
-- You received this message because you are subscribed to the Google Groups DIYbio group. To post to this group, send email to diybio@googlegroups.com. To unsubscribe from this group, send email to diybio+unsubscribe@googlegroups.com. For more options, visit this group at https://groups.google.com/d/forum/diybio?hl=en
Learn more at www.diybio.org
---
You received this message because you are subscribed to the Google Groups "DIYbio" group.
To unsubscribe from this group and stop receiving emails from it, send an email to diybio+unsubscribe@googlegroups.com.
To post to this group, send email to diybio@googlegroups.com.
Visit this group at http://groups.google.com/group/diybio?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.

  • Digg
  • Del.icio.us
  • StumbleUpon
  • Reddit
  • RSS

0 comments:

Post a Comment