i using python3
, urllib3
, tika-server-1.13
in order text different types of files. python code:
def get_text(self, input_file_path, text_output_path, content_type): global config headers = util.make_headers() mime_type = contenttype.get_mime_type(content_type) if mime_type != '': headers['content-type'] = mime_type open(input_file_path, "rb") input_file: fields = { 'file': (os.path.basename(input_file_path), input_file.read(), mime_type) } retry_count = 0 while retry_count < int(config.get("tika", "retriescount")): response = self.pool.request('put', '/tika', headers=headers, fields=fields) if response.status == 200: data = response.data.decode('utf-8') text = re.sub("[\[][^\]]+[\]]", "", data) final_text = re.sub("(\n(\t\r )*\n)+", "\n\n", text) open(text_output_path, "w+") output_file: output_file.write(final_text) break else: if retry_count == (int(config.get("tika", "retriescount")) - 1): return false retry_count += 1 return true
this code works html files, when trying parse text docx files doesn't work.
i server http error code 422: unprocessable entity
using tika-server
documentation i've tried using curl
check if works it:
curl -x put --data-binary @test.docx http://localhost:9998/tika --header "content-type: application/vnd.openxmlformats-officedocument.wordprocessingml.document"
and worked.
at tika server docs:
422 unprocessable entity - unsupported mime-type, encrypted document & etc
this correct mime-type(also checked tika's detect system), it's supported , file not encrypted.
i believe related how upload file tika server, doing wrong?
you're not uploading data in same way. --data-binary
in curl uploads binary data is. no encoding. in urllib3, using fields
causes urllib3 generate multipart/form-encoded
message. on top of that, you're preventing urllib3 setting header on request tika can understand it. either stop updating headers['content-type']
or pass body=input_file.read()
.
Comments
Post a Comment