i'm using pdfgrep search name in multiple pdfs stored in directory , storing results in file:
pdfgrep -r 'my string' > ../output-file
it prints following output:
./file1.pdf: 91 string just_another_string 75 53 49 30 57 48 74 69 ./file2.pdf: 8 string just_another_string 40 ./file3.pdf: 92 string just_another_string 64 62 76 50 76 88 80 148
i'm getting many unecessary whitespaces in each line between each column in output. i'd reformat output such these multiple white spaces reduced 1 whitespace between each column.
is there way this? in advance.
quick , dirty way: use awk. assuming format looks that: (assuming original command correct)
pdfgrep -r 'my string' | awk '{print "$1 $2 $3 $4 $5 $6 $7 $8 $9"}' > ../output-file
edit based on comments:
@inian's answer better (since handles arbritary numbers of columns), in nutshell, mine doing telling awk split input whitespace , print out single space between each column ... example skip first column not including $1, or swap 3rd , 4th columns printing $4 $3).
for efficiency, if want shove database, want using python (or perl or php quick check on profile should show preference) sql importing. 500 pdf's doesn't phase me... expect away like:
pdfgrep -r 'my string' > ../output-file
and run python program looks like:
import sys open("output-file","rt") f: line in f: data = line.split() #now have array split whitespace cleanline = " ".join(data) #now each element has single space between , next #or stick data directly database; details omitted because there way many variables here
Comments
Post a Comment