HN2new | past | comments | ask | show | jobs | submitlogin
A case study in PDF forensics: The Epstein PDFs (pdfa.org)
407 points by DuffJohnson 5 days ago | hide | past | favorite | 237 comments




I found this part interesting:

There are also other documents that appear to simulate a scanned document but completely lack the “real-world noise” expected with physical paper-based workflows. The much crisper images appear almost perfect without random artifacts or background noise, and with the exact same amount of image skew across multiple pages. Thanks to the borders around each page of text, page skew can easily be measured, such as with VOL00007\IMAGES\0001\EFTA00009229.pdf. It is highly likely these PDFs were created by rendering original content (from a digital document) to an image (e.g., via print to image or save to image functionality) and then applying image processing such as skew, downscaling, and color reduction.


GNOME Desktop users can put this in a Bash script in ~/.local/share/nautilus/ for more convincing looking fake PDF scans, accessible from your right-click menu. I do not recall where I copied it from originally to give credit so thanks, random internet person (probably on Stack Exchange). It works perfectly.

  ROTATION=$(shuf -n 1 -e '-' '')$(shuf -n 1 -e $(seq 0.05 .5))

  for pdf in "$@";
    do magick  -density 150 $pdf \
              -linear-stretch '1.5%x2%' \
              -rotate 0.4 \
              -attenuate '0.01' \
              +noise  Multiplicative \
              -colorspace 'gray' \
              "${pdf%.*}-fakescan.${pdf##*.}"
  done

That seq is probably supposed to be $(seq 0.05 0.05 0.5). Right now it's always 0.05.

Note that you can get random numbers straight from bash with $RANDOM. It's 15 bit (0 to 32767) but good enough here; this would get between 0.05 and 0.5: $(printf "0.%.4d\n" $((500 + RANDOM % 4501)))