When processing text, words and sentences are the building blocks. However, raw text is often messy, with abbreviations, mixed languages, extra spaces, and encoding issues.
To make text usable for NLP models, we need sentence segmentation, abbreviation handling, language detection, encoding fixes, and whitespace cleanup.
Let's explore these step by step! 🚀
1️⃣ Sentence Segmentation – Splitting Text into Sentences
📌 Problem:
A paragraph is one big chunk of text, but NLP models work better when they understand individual sentences.
🔹 Example:
"Dr. Smith is a great doctor. He works at AI Labs. NLP is amazing!"
💡 Solution: Use spaCy or NLTK to split text into sentences.
import nltk
nltk.download("punkt")
from nltk.tokenize import sent_tokenize
text = "Dr. Smith is a great doctor. He works at AI Labs. NLP is amazing!"
sentences = sent_tokenize(text)
print(sentences)
🔹 Output:
['Dr. Smith is a great doctor.', 'He works at AI Labs.', 'NLP is amazing!']
👉 Why is this useful?
- Helps text summarization and question-answering systems.
- Makes NLP more structured for better analysis.
2️⃣ Handling Abbreviations – Expanding Short Forms
📌 Problem:
Abbreviations can cause confusion for NLP models:
-
"Dr." → "Doctor"
-
"AI" → "Artificial Intelligence"
-
"e.g." → "for example"
💡 Solution: Use custom dictionaries to expand abbreviations.
abbr_dict = {
"Dr.": "Doctor",
"AI": "Artificial Intelligence",
"e.g.": "for example"
}
text = "Dr. Smith is an AI expert. e.g., he works on NLP."
for abbr, full_form in abbr_dict.items():
text = text.replace(abbr, full_form)
print(text)
🔹 Output:
Doctor Smith is an Artificial Intelligence expert. for example, he works on NLP.
👉 Why is this useful?
- Makes text more understandable for models.
- Essential for machine translation, chatbots, and search engines.
3️⃣ Language Detection – Identifying the Language
📌 Problem:
Social media and web text often contain multiple languages. NLP models need to detect the language before processing it correctly.
🔹 Example:
"Bonjour, comment ça va?"
→ Detect as French (fr)
💡 Solution: Use langdetect to identify the language.
from langdetect import detect
text = "Bonjour, comment ça va?"
language = detect(text)
print(language) # Output: 'fr' (French)
👉 Why is this useful?
- Helps multilingual chatbots and translation systems.
- Allows models to apply correct NLP preprocessing per language.
4️⃣ Text Encoding – Converting Text into Machine-Readable Format
📌 Problem:
Some text files contain special characters (e.g., UTF-8, ISO-8859-1 issues). If not handled, it causes errors in NLP models.
💡 Solution: Convert text to UTF-8 encoding to handle special characters.
text = "Café, naïve, résumé, coöperate".encode("utf-8").decode("utf-8")
print(text)
🔹 Output:
Café, naïve, résumé, coöperate
👉 Why is this useful?
- Prevents encoding errors in web scraping, PDFs, and multilingual text.
- Ensures text is consistent across NLP pipelines.
5️⃣ Handling Whitespace Tokens – Removing Extra Spaces
📌 Problem:
Text data may have extra spaces, tabs, and newlines, making processing harder.
🔹 Example:
" NLP is amazing! "
⬇️
"NLP is amazing!"
💡 Solution: Use regex or strip() to clean spaces.
import re
text = " NLP is amazing! "
clean_text = re.sub(r"\s+", " ", text).strip()
print(clean_text)
🔹 Output:
"NLP is amazing!"
👉 Why is this useful?
- Ensures consistent formatting for tokenization.
- Improves accuracy of NLP models.
🚀 Wrapping Up: Sentence & Word-Level Processing for NLP
We’ve covered five essential preprocessing techniques to structure messy text:
✅ Sentence Segmentation → Splits text into sentences.
✅ Abbreviation Handling → Expands short forms for better understanding.
✅ Language Detection → Identifies which language the text is in.
✅ Text Encoding → Fixes encoding issues for smooth processing.
✅ Whitespace Handling → Cleans up extra spaces for better tokenization.
🔹 Next Up: Advanced NLP—Topic Modeling, Named Entity Recognition (NER), and Dependency Parsing! 🚀
Top comments (0)