Re: [DIYbio] Removing watermarks from pdfs (pdfparanoia)

> Hmmm interesting idea ! But some publishing software (LaTeX works well
> for this) make strange things with sentences posititions and stuff. So
> you couldn't get a good text-flow without manual intervention. Maybe
> the best way to do this would be to use the HTML version, but they
> exist only for recent publications (and old pdf ones are pure bitmap
> with sometimes an OCRed text overlay).

Using HTML version is something I hadn't even considered, excellent
idea. Far more malleable than PDF!

WRT old bitmapped PDFs, there's less to lose by converting to images and
re-compressing after imagemagick.

One reason I suggested exploding/recompressing is that by doing so, you
will naturally destroy lots of metadata that you might not have realised
was there, otherwise. If you directly edit the file in a format with
compatible metadata, like postscript (is it compatible?), then the tools
might blindly copy metadata back and forth if you don't know to say "No,
delete that.".. whereas the 'stupid' way, of just bitmapping, applying a
blind if necessary, and recompressing, gives you an apparently "brand
new" PDF consisting only of dumb images.

It's bloated and ugly, but it's only going to have the sort of watermark
that you can see with your naked eye; very easy to see if something is
slipping through your net!

On 06/02/13 18:15, Bjonnh wrote:
> On Wed, Feb 06, 2013 at 06:12:44PM +0000, Cathal Garvey wrote:
>> Very true; MAT only specialises in finding binary metadata like what
>> software made the file, etc.: to remove "text" metadata like embedded
>> IPs, identifying front-pages, etc, you'd need to profile what if
>> anything is done by a particular publisher to their PDFs, and have a
>> tool that removes this data specially.
>>
>> For example, to remove a frontpage, you might need to "explode" the PDF
>> into images, discard the first image, and recompress into a new PDF.
> With softwares like ghostscript, pdftext and many others, you can just
> remove the page without any conversion.
>>
>> To remove text/images embedded on the bottom of each PDF page, you could
>> do the same except use imagemagick on each image before
>> recompression.
> You can also do a conversion to postscript and use a script to remove the
> nasty part.
>>
>> Major disadvantage to this route is that it would convert a text +
>> images PDF (high compression ratio, easy to extract text for re-use)
>> into an images-only PDF (large file size, poor compression, impossible
>> to extract text without OCR).
> For sure !
>>
>> If you can extract text of course, you could try extracting text +
>> images and perhaps script the creation of an entirely new PDF file. This
>> is the opposite approach; instead of blacklisting content ("This bit
>> contains IP address info"), you're whitelisting content ("These bits are
>> the text and images that form the actual paper").
> Hmmm interesting idea ! But some publishing software (LaTeX works well
> for this) make strange things with sentences posititions and stuff. So
> you couldn't get a good text-flow without manual intervention. Maybe
> the best way to do this would be to use the HTML version, but they
> exist only for recent publications (and old pdf ones are pure bitmap
> with sometimes an OCRed text overlay).
>

--
-- You received this message because you are subscribed to the Google Groups DIYbio group. To post to this group, send email to diybio@googlegroups.com. To unsubscribe from this group, send email to diybio+unsubscribe@googlegroups.com. For more options, visit this group at https://groups.google.com/d/forum/diybio?hl=en
Learn more at www.diybio.org
---
You received this message because you are subscribed to the Google Groups "DIYbio" group.
To unsubscribe from this group and stop receiving emails from it, send an email to diybio+unsubscribe@googlegroups.com.
To post to this group, send email to diybio@googlegroups.com.
Visit this group at http://groups.google.com/group/diybio?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.

  • Digg
  • Del.icio.us
  • StumbleUpon
  • Reddit
  • RSS

0 comments:

Post a Comment