Python - urllib3 get text from docx using tika server -


i using python3, urllib3 , tika-server-1.13 in order text different types of files. python code:

def get_text(self, input_file_path, text_output_path, content_type):     global config      headers = util.make_headers()     mime_type = contenttype.get_mime_type(content_type)     if mime_type != '':         headers['content-type'] = mime_type      open(input_file_path, "rb") input_file:         fields = {             'file': (os.path.basename(input_file_path), input_file.read(), mime_type)         }      retry_count = 0     while retry_count < int(config.get("tika", "retriescount")):         response = self.pool.request('put', '/tika', headers=headers, fields=fields)         if response.status == 200:             data = response.data.decode('utf-8')             text = re.sub("[\[][^\]]+[\]]", "", data)             final_text = re.sub("(\n(\t\r )*\n)+", "\n\n", text)             open(text_output_path, "w+") output_file:                 output_file.write(final_text)             break         else:             if retry_count == (int(config.get("tika", "retriescount")) - 1):                 return false             retry_count += 1     return true 

this code works html files, when trying parse text docx files doesn't work.

i server http error code 422: unprocessable entity

using tika-server documentation i've tried using curl check if works it:

curl -x put --data-binary @test.docx http://localhost:9998/tika --header "content-type: application/vnd.openxmlformats-officedocument.wordprocessingml.document" 

and worked.

at tika server docs:

422 unprocessable entity - unsupported mime-type, encrypted document & etc

this correct mime-type(also checked tika's detect system), it's supported , file not encrypted.

i believe related how upload file tika server, doing wrong?

you're not uploading data in same way. --data-binary in curl uploads binary data is. no encoding. in urllib3, using fields causes urllib3 generate multipart/form-encoded message. on top of that, you're preventing urllib3 setting header on request tika can understand it. either stop updating headers['content-type'] or pass body=input_file.read().


Comments