ocrmypdf 让 PDF 可搜索

买的一些课程配套资料都是 PDF 格式的,为了防止盗版都事先用的图片转成的 PDF,这样 PDF 里的内容既没法复制也没法搜索,在查找资料里的关键词的时候就很不方便,所以就想着把这些 PDF 转成可搜索的 PDF。找到了一款工具叫做 ocrmypdf,可以把 PDF 转成可搜索的 PDF,而且还支持中文,这里记录一下使用方法。详细使用文档可以参考官方文档 OCRmyPDF documentation

安装

1
sudo apt install ocrmypdf

使用

指定 OCR 的语言

安装语言包

1
sudo apt install tesseract-ocr-chi-sim

查看是否安装成功

1
2
3
4
5
$ tesseract --list-langs
List of available languages (3):
chi_sim
eng
osd

注意参数 -l 后面的语言包名称是下划线,而不是短横线。

1
ocrmypdf -l chi_sim input.pdf output.pdf
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
$  ocrmypdf -l chi_sim  --redo-ocr  input.pdf output.pdf
Scanning contents: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 752/752 [00:14<00:00, 51.36page/s]
Start processing 24 pages concurrently
33 redoing OCR
26 [tesseract] lots of diacritics - possibly poor OCR
54 [tesseract] lots of diacritics - possibly poor OCR
88 [tesseract] lots of diacritics - possibly poor OCR
119 [tesseract] lots of diacritics - possibly poor OCR
203 [tesseract] lots of diacritics - possibly poor OCR
256 [tesseract] lots of diacritics - possibly poor OCR
265 [tesseract] lots of diacritics - possibly poor OCR
347 [tesseract] lots of diacritics - possibly poor OCR
376 [tesseract] lots of diacritics - possibly poor OCR
383 [tesseract] lots of diacritics - possibly poor OCR
386 [tesseract] lots of diacritics - possibly poor OCR
402 [tesseract] lots of diacritics - possibly poor OCR
404 [tesseract] lots of diacritics - possibly poor OCR
403 [tesseract] lots of diacritics - possibly poor OCR
412 [tesseract] lots of diacritics - possibly poor OCR
415 [tesseract] lots of diacritics - possibly poor OCR
410 [tesseract] lots of diacritics - possibly poor OCR
439 [tesseract] lots of diacritics - possibly poor OCR
519 [tesseract] lots of diacritics - possibly poor OCR
526 [tesseract] lots of diacritics - possibly poor OCR
587 [tesseract] lots of diacritics - possibly poor OCR
591 [tesseract] lots of diacritics - possibly poor OCR
595 [tesseract] lots of diacritics - possibly poor OCR
607 [tesseract] lots of diacritics - possibly poor OCR
644 [tesseract] lots of diacritics - possibly poor OCR
661 [tesseract] lots of diacritics - possibly poor OCR
682 [tesseract] lots of diacritics - possibly poor OCR
720 [tesseract] lots of diacritics - possibly poor OCR
742 [tesseract] lots of diacritics - possibly poor OCR
OCR: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 752.0/752.0 [03:41<00:00, 3.40page/s]
Postprocessing...
Some input metadata could not be copied because it is not permitted in PDF/A. You may wish to examine the output PDF's XMP metadata.
PDF/A conversion: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 752/752 [01:09<00:00, 10.80page/s]
Recompressing JPEGs: 0image [00:00, ?image/s]
Deflating JPEGs: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 756/756 [00:00<00:00, 920.21image/s]
JBIG2: 0item [00:00, ?item/s]
Optimize ratio: 1.20 savings: 17.0%
Output file is okay but is not PDF/A (seems to be No PDF/A metadata in XMP)

转换的结果还不错,页面排版不会改变,保持原样,但是搜索文字时可能需要用空格分开。