The application of the Pytesseract library in natural language processing and its case sharing
Pytesseract is a Python library based on the Tesseract OCR engine to extract and recognize the text in the picture.In natural language processing, Pytesseract can realize a variety of applications, including text extraction, optical character recognition, and image transfer text.This article will share the application and related cases of Pytesseract in natural language processing.
1. Install PYTESSERACT and Tesseract OCR engine
To use Pytesseract, you first need to install the Tesseract OCR engine.Windows users can download and install the appropriate version from https://github.com/ub-mannheim/tesseract/wiki, and then add its path to the environment variable.Linux users can install through the package manager, such as the following commands under Ubuntu for installation:
sudo apt-get install tesseract-ocr
After the installation is complete, use the PIP command to install Pytesseract:
pip install pytesseract
Second, text extraction
Below is a simple example, showing how to use Pytesseract to extract the text in the picture:
python
import cv2
import pytesseract
# Read the picture
image = cv2.imread('example.png')
# 图片 Pre -processing
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
gray = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY | cv2.THRESH_OTSU)[1]
# Use Pytesseract for text recognition
text = pytesseract.image_to_string(gray, lang='chi_sim')
# Print results
print(text)
This code first reads a picture with OpenCV library and is grayized and two -value processing.Then, use pytesseract's `Image_to_String () function to identify the text in the picture.Finally, print out the extra text.
3. Optical character recognition
In addition to extracting the text in the picture, Pytesseract can also recognize the characters in the picture.The following is an example, which is used to identify the characters in the picture of the verification code:
python
import cv2
import pytesseract
# Read the verification code picture
image = cv2.imread('captcha.png')
# 图片 Pre -processing
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
# Use Pytesseract for character recognition
text = pytesseract.image_to_string(gray, config='--psm 10')
# Print results
print(text)
The above code first reads a verification code picture and converts it to gray image.Then, use pytesseract's `Image_to_String ()` function for character recognition.In this case, we used the configuration option of `-PSM 10` to tell the Tesseract OCR engine for character-level recognition.Finally, print the identification results.
Fourth, image transfer text
Pytesseract can also be used to convert the text in the picture into editing text files.The following is an example code:
python
import cv2
import pytesseract
# Read the picture
image = cv2.imread('example.png')
# 图片 Pre -processing
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
gray = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY | cv2.THRESH_OTSU)[1]
# Use Pytesseract for text recognition
text = pytesseract.image_to_string(gray, lang='chi_sim')
# Save the recognition results as text file
with open('output.txt', 'w', encoding='utf-8') as file:
file.write(text)
Similar to text extraction examples, this code also processes the picture first, and then uses Pytesseract to identify the text.The difference is that the identification results are finally preserved into a text file.
The above is several applications of Pytesseract libraries in natural language processing.By using this library, we can easily extract and identify texts and characters from the pictures to support various text processing and analysis tasks.