DTP関連

テキスト埋め込みのPDFからテキストを抜き取る

ターミナル作業になります(MacOSX Sierra)

--------------------------------------------------------------
PDFtoTEXT

Usage: pdftotext [options] <PDF-file> [<text-file>]
 -f <int>             : first page to convert
 -l <int>             : last page to convert
 -r <fp>              : resolution, in DPI (default is 72)
 -x <int>             : x-coordinate of the crop area top left corner
 -y <int>             : y-coordinate of the crop area top left corner
 -W <int>             : width of crop area in pixels (default is 0)
 -H <int>             : height of crop area in pixels (default is 0)
 -layout              : maintain original physical layout
 -fixed <fp>          : assume fixed-pitch (or tabular) text
 -raw                 : keep strings in content stream order
 -htmlmeta            : generate a simple HTML file, including the meta information
 -enc <string>        : output text encoding name
 -listenc             : list available encodings
 -eol <string>        : output end-of-line convention (unix, dos, or mac)
 -nopgbrk             : don't insert page breaks between pages
 -bbox                : output bounding box for each word and page size to html.  Sets -htmlmeta
 -bbox-layout         : like -bbox but with extra layout bounding box data.  Sets -htmlmeta
 -opw <string>        : owner password (for encrypted files)
 -upw <string>        : user password (for encrypted files)
 -q                   : don't print any messages or errors
 -v                   : print copyright and version info
 -h                   : print usage information
 -help                : print usage information
 --help               : print usage information
 -?                   : print usage information

最初の1ページをテキスト化する

PDFtoTEXT -raw -f 1 -l 1 inf.pdf out

inf:変換元PDF
out:出力先 省略するとinf.txtとなる

トップ   編集 凍結解除 差分 履歴 添付 複製 名前変更 リロード   新規 一覧 検索 最終更新   ヘルプ   最終更新のRSS
Last-modified: 2023-10-31 (火) 09:13:35