ocrmypdf 让 PDF 可搜索

2023-09-19 工欲善其事必先利其器 Efficiency, OCR, ocrmypdf 评论字数统计: 617(字) 阅读时长: 3(分)

买的一些课程配套资料都是 PDF 格式的，为了防止盗版都事先用的图片转成的 PDF，这样 PDF 里的内容既没法复制也没法搜索，在查找资料里的关键词的时候就很不方便，所以就想着把这些 PDF 转成可搜索的 PDF。找到了一款工具叫做 ocrmypdf，可以把 PDF 转成可搜索的 PDF，而且还支持中文，这里记录一下使用方法。详细使用文档可以参考官方文档 OCRmyPDF documentation。

安装

sudo apt install ocrmypdf

使用

指定 OCR 的语言

安装语言包

sudo apt install tesseract-ocr-chi-sim

查看是否安装成功

$ tesseract --list-langs
List of available languages (3):
chi_sim
eng
osd

注意参数 -l 后面的语言包名称是下划线，而不是短横线。

ocrmypdf -l chi_sim input.pdf output.pdf

$  ocrmypdf -l chi_sim  --redo-ocr  input.pdf output.pdf
Scanning contents: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 752/752 [00:14<00:00, 51.36page/s]
Start processing 24 pages concurrently
   33 redoing OCR                                                                                                                                                 
   26 [tesseract] lots of diacritics - possibly poor OCR                                                                                                          
   54 [tesseract] lots of diacritics - possibly poor OCR                                                                                                          
   88 [tesseract] lots of diacritics - possibly poor OCR                                                                                                          
  119 [tesseract] lots of diacritics - possibly poor OCR                                                                                                          
  203 [tesseract] lots of diacritics - possibly poor OCR                                                                                                          
  256 [tesseract] lots of diacritics - possibly poor OCR                                                                                                          
  265 [tesseract] lots of diacritics - possibly poor OCR                                                                                                          
  347 [tesseract] lots of diacritics - possibly poor OCR                                                                                                          
  376 [tesseract] lots of diacritics - possibly poor OCR                                                                                                          
  383 [tesseract] lots of diacritics - possibly poor OCR                                                                                                          
  386 [tesseract] lots of diacritics - possibly poor OCR                                                                                                          
  402 [tesseract] lots of diacritics - possibly poor OCR                                                                                                          
  404 [tesseract] lots of diacritics - possibly poor OCR                                                                                                          
  403 [tesseract] lots of diacritics - possibly poor OCR                                                                                                          
  412 [tesseract] lots of diacritics - possibly poor OCR                                                                                                          
  415 [tesseract] lots of diacritics - possibly poor OCR                                                                                                          
  410 [tesseract] lots of diacritics - possibly poor OCR                                                                                                          
  439 [tesseract] lots of diacritics - possibly poor OCR                                                                                                          
  519 [tesseract] lots of diacritics - possibly poor OCR                                                                                                          
  526 [tesseract] lots of diacritics - possibly poor OCR                                                                                                          
  587 [tesseract] lots of diacritics - possibly poor OCR                                                                                                          
  591 [tesseract] lots of diacritics - possibly poor OCR                                                                                                          
  595 [tesseract] lots of diacritics - possibly poor OCR                                                                                                          
  607 [tesseract] lots of diacritics - possibly poor OCR                                                                                                          
  644 [tesseract] lots of diacritics - possibly poor OCR                                                                                                          
  661 [tesseract] lots of diacritics - possibly poor OCR                                                                                                          
  682 [tesseract] lots of diacritics - possibly poor OCR                                                                                                          
  720 [tesseract] lots of diacritics - possibly poor OCR                                                                                                          
  742 [tesseract] lots of diacritics - possibly poor OCR                                                                                                          
OCR: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 752.0/752.0 [03:41<00:00,  3.40page/s]
Postprocessing...
Some input metadata could not be copied because it is not permitted in PDF/A. You may wish to examine the output PDF's XMP metadata.                              
PDF/A conversion: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 752/752 [01:09<00:00, 10.80page/s]
Recompressing JPEGs: 0image [00:00, ?image/s]
Deflating JPEGs: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 756/756 [00:00<00:00, 920.21image/s]
JBIG2: 0item [00:00, ?item/s]
Optimize ratio: 1.20 savings: 17.0%
Output file is okay but is not PDF/A (seems to be No PDF/A metadata in XMP)

转换的结果还不错，页面排版不会改变，保持原样，但是搜索文字时可能需要用空格分开。

本文链接： https://lifeislife.cn/2023/09/19/ocrmypdf-让PDF可搜索/

版权声明： 本博客所有文章除特别声明外，均采用 CC BY 4.0 CN协议许可协议。转载请注明出处！

夜云泊软件工程师

Game is Game