Re: [DIYbio] Removing watermarks from pdfs (pdfparanoia)

Very true; MAT only specialises in finding binary metadata like what
software made the file, etc.: to remove "text" metadata like embedded
IPs, identifying front-pages, etc, you'd need to profile what if
anything is done by a particular publisher to their PDFs, and have a
tool that removes this data specially.

For example, to remove a frontpage, you might need to "explode" the PDF
into images, discard the first image, and recompress into a new PDF.

To remove text/images embedded on the bottom of each PDF page, you could
do the same except use imagemagick on each image before recompression.

Major disadvantage to this route is that it would convert a text +
images PDF (high compression ratio, easy to extract text for re-use)
into an images-only PDF (large file size, poor compression, impossible
to extract text without OCR).

If you can extract text of course, you could try extracting text +
images and perhaps script the creation of an entirely new PDF file. This
is the opposite approach; instead of blacklisting content ("This bit
contains IP address info"), you're whitelisting content ("These bits are
the text and images that form the actual paper").

On 06/02/13 14:29, Bjonnh wrote:
> On Wed, Feb 06, 2013 at 10:56:32AM +0000, Cathal Garvey wrote:
>> Check this one out: https://mat.boum.org/
>>
> This is not enough to remove metadatas, there are white text
> watermarks sometimes, classic text
> watermarks (like "downloaded from…"), comments watermarks (pdf format comments, not the one you see on
> Adobe products) which are invisible except when you open the file with
> a text/hex editor…
>
> I think the best way to find which kind of watermark you have is to
> compare two files from two different providers if it's possible.
>

--
-- You received this message because you are subscribed to the Google Groups DIYbio group. To post to this group, send email to diybio@googlegroups.com. To unsubscribe from this group, send email to diybio+unsubscribe@googlegroups.com. For more options, visit this group at https://groups.google.com/d/forum/diybio?hl=en
Learn more at www.diybio.org
---
You received this message because you are subscribed to the Google Groups "DIYbio" group.
To unsubscribe from this group and stop receiving emails from it, send an email to diybio+unsubscribe@googlegroups.com.
To post to this group, send email to diybio@googlegroups.com.
Visit this group at http://groups.google.com/group/diybio?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.

  • Digg
  • Del.icio.us
  • StumbleUpon
  • Reddit
  • RSS

0 comments:

Post a Comment