ocrmypdf 让 PDF 可搜索

买的一些课程配套资料都是 PDF 格式的,为了防止盗版都事先用的图片转成的 PDF,这样 PDF 里的内容既没法复制也没法搜索,在查找资料里的关键词的时候就很不方便,所以就想着把这些 PDF 转成可搜索的 PDF。找到了一款工具叫做 ocrmypdf,可以把 PDF 转成可搜索的 PDF,而且还支持中文,这里记录一下使用方法。详细使用文档可以参考官方文档 OCRmyPDF documentation

安装

sudo apt install ocrmypdf

使用

指定 OCR 的语言

安装语言包

sudo apt install tesseract-ocr-chi-sim

查看是否安装成功

$ tesseract --list-langs
List of available languages (3):
chi_sim
eng
osd

注意参数 -l 后面的语言包名称是下划线,而不是短横线。

ocrmypdf -l chi_sim input.pdf output.pdf
$  ocrmypdf -l chi_sim  --redo-ocr  input.pdf output.pdf
Scanning contents: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 752/752 [00:14<00:00, 51.36page/s]
Start processing 24 pages concurrently
   33 redoing OCR                                                                                                                                                 
   26 [tesseract] lots of diacritics - possibly poor OCR                                                                                                          
   54 [tesseract] lots of diacritics - possibly poor OCR                                                                                                          
   88 [tesseract] lots of diacritics - possibly poor OCR                                                                                                          
  119 [tesseract] lots of diacritics - possibly poor OCR                                                                                                          
  203 [tesseract] lots of diacritics - possibly poor OCR                                                                                                          
  256 [tesseract] lots of diacritics - possibly poor OCR                                                                                                          
  265 [tesseract] lots of diacritics - possibly poor OCR                                                                                                          
  347 [tesseract] lots of diacritics - possibly poor OCR                                                                                                          
  376 [tesseract] lots of diacritics - possibly poor OCR                                                                                                          
  383 [tesseract] lots of diacritics - possibly poor OCR                                                                                                          
  386 [tesseract] lots of diacritics - possibly poor OCR                                                                                                          
  402 [tesseract] lots of diacritics - possibly poor OCR                                                                                                          
  404 [tesseract] lots of diacritics - possibly poor OCR                                                                                                          
  403 [tesseract] lots of diacritics - possibly poor OCR                                                                                                          
  412 [tesseract] lots of diacritics - possibly poor OCR                                                                                                          
  415 [tesseract] lots of diacritics - possibly poor OCR                                                                                                          
  410 [tesseract] lots of diacritics - possibly poor OCR                                                                                                          
  439 [tesseract] lots of diacritics - possibly poor OCR                                                                                                          
  519 [tesseract] lots of diacritics - possibly poor OCR                                                                                                          
  526 [tesseract] lots of diacritics - possibly poor OCR                                                                                                          
  587 [tesseract] lots of diacritics - possibly poor OCR                                                                                                          
  591 [tesseract] lots of diacritics - possibly poor OCR                                                                                                          
  595 [tesseract] lots of diacritics - possibly poor OCR                                                                                                          
  607 [tesseract] lots of diacritics - possibly poor OCR                                                                                                          
  644 [tesseract] lots of diacritics - possibly poor OCR                                                                                                          
  661 [tesseract] lots of diacritics - possibly poor OCR                                                                                                          
  682 [tesseract] lots of diacritics - possibly poor OCR                                                                                                          
  720 [tesseract] lots of diacritics - possibly poor OCR                                                                                                          
  742 [tesseract] lots of diacritics - possibly poor OCR                                                                                                          
OCR: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 752.0/752.0 [03:41<00:00,  3.40page/s]
Postprocessing...
Some input metadata could not be copied because it is not permitted in PDF/A. You may wish to examine the output PDF's XMP metadata.                              
PDF/A conversion: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 752/752 [01:09<00:00, 10.80page/s]
Recompressing JPEGs: 0image [00:00, ?image/s]
Deflating JPEGs: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 756/756 [00:00<00:00, 920.21image/s]
JBIG2: 0item [00:00, ?item/s]
Optimize ratio: 1.20 savings: 17.0%
Output file is okay but is not PDF/A (seems to be No PDF/A metadata in XMP)

转换的结果还不错,页面排版不会改变,保持原样,但是搜索文字时可能需要用空格分开。