linux - Search string using pdfgrep and format output -


i'm using pdfgrep search name in multiple pdfs stored in directory , storing results in file:

pdfgrep -r 'my string' > ../output-file

it prints following output:

./file1.pdf:     91   string                               just_another_string                   75              53            49            30              57               48                74             69 ./file2.pdf:     8    string                                just_another_string                                                              40 ./file3.pdf:     92 string                                  just_another_string                   64              62            76             50           76            88             80             148 

i'm getting many unecessary whitespaces in each line between each column in output. i'd reformat output such these multiple white spaces reduced 1 whitespace between each column.

is there way this? in advance.

quick , dirty way: use awk. assuming format looks that: (assuming original command correct)

pdfgrep -r 'my string' | awk '{print "$1 $2 $3 $4 $5 $6 $7 $8 $9"}' > ../output-file 

edit based on comments:

@inian's answer better (since handles arbritary numbers of columns), in nutshell, mine doing telling awk split input whitespace , print out single space between each column ... example skip first column not including $1, or swap 3rd , 4th columns printing $4 $3).

for efficiency, if want shove database, want using python (or perl or php quick check on profile should show preference) sql importing. 500 pdf's doesn't phase me... expect away like:

pdfgrep -r 'my string' > ../output-file 

and run python program looks like:

import sys  open("output-file","rt") f:    line in f:       data = line.split() #now have array split whitespace       cleanline = " ".join(data) #now each element has single space between , next       #or stick data directly database; details omitted because there way many variables here 

Comments