DEV Community

Cover image for Sentence & Word-Level Processing in NLP
Mayank Gupta
Mayank Gupta

Posted on

Sentence & Word-Level Processing in NLP

When processing text, words and sentences are the building blocks. However, raw text is often messy, with abbreviations, mixed languages, extra spaces, and encoding issues.

To make text usable for NLP models, we need sentence segmentation, abbreviation handling, language detection, encoding fixes, and whitespace cleanup.

Let's explore these step by step! 🚀


1️⃣ Sentence Segmentation – Splitting Text into Sentences

📌 Problem:

A paragraph is one big chunk of text, but NLP models work better when they understand individual sentences.

🔹 Example:

"Dr. Smith is a great doctor. He works at AI Labs. NLP is amazing!"

💡 Solution: Use spaCy or NLTK to split text into sentences.

import nltk

nltk.download("punkt")
from nltk.tokenize import sent_tokenize

text = "Dr. Smith is a great doctor. He works at AI Labs. NLP is amazing!"
sentences = sent_tokenize(text)

print(sentences)
Enter fullscreen mode Exit fullscreen mode

🔹 Output:

['Dr. Smith is a great doctor.', 'He works at AI Labs.', 'NLP is amazing!']
Enter fullscreen mode Exit fullscreen mode

👉 Why is this useful?

  • Helps text summarization and question-answering systems.
  • Makes NLP more structured for better analysis.

2️⃣ Handling Abbreviations – Expanding Short Forms

📌 Problem:

Abbreviations can cause confusion for NLP models:

  • "Dr." → "Doctor"
  • "AI" → "Artificial Intelligence"
  • "e.g." → "for example"

💡 Solution: Use custom dictionaries to expand abbreviations.

abbr_dict = {
    "Dr.": "Doctor",
    "AI": "Artificial Intelligence",
    "e.g.": "for example"
}

text = "Dr. Smith is an AI expert. e.g., he works on NLP."
for abbr, full_form in abbr_dict.items():
    text = text.replace(abbr, full_form)

print(text)
Enter fullscreen mode Exit fullscreen mode

🔹 Output:

Doctor Smith is an Artificial Intelligence expert. for example, he works on NLP.

👉 Why is this useful?

  • Makes text more understandable for models.
  • Essential for machine translation, chatbots, and search engines.

3️⃣ Language Detection – Identifying the Language

📌 Problem:

Social media and web text often contain multiple languages. NLP models need to detect the language before processing it correctly.

🔹 Example:

"Bonjour, comment ça va?" → Detect as French (fr)

💡 Solution: Use langdetect to identify the language.

from langdetect import detect

text = "Bonjour, comment ça va?"
language = detect(text)

print(language)  # Output: 'fr' (French)
Enter fullscreen mode Exit fullscreen mode

👉 Why is this useful?

  • Helps multilingual chatbots and translation systems.
  • Allows models to apply correct NLP preprocessing per language.

4️⃣ Text Encoding – Converting Text into Machine-Readable Format

📌 Problem:

Some text files contain special characters (e.g., UTF-8, ISO-8859-1 issues). If not handled, it causes errors in NLP models.

💡 Solution: Convert text to UTF-8 encoding to handle special characters.

text = "Café, naïve, résumé, coöperate".encode("utf-8").decode("utf-8")
print(text)
Enter fullscreen mode Exit fullscreen mode

🔹 Output:

Café, naïve, résumé, coöperate

👉 Why is this useful?

  • Prevents encoding errors in web scraping, PDFs, and multilingual text.
  • Ensures text is consistent across NLP pipelines.

5️⃣ Handling Whitespace Tokens – Removing Extra Spaces

📌 Problem:

Text data may have extra spaces, tabs, and newlines, making processing harder.

🔹 Example:

" NLP is amazing! "

⬇️

"NLP is amazing!"

💡 Solution: Use regex or strip() to clean spaces.

import re

text = "   NLP   is   amazing!    "
clean_text = re.sub(r"\s+", " ", text).strip()

print(clean_text)
Enter fullscreen mode Exit fullscreen mode

🔹 Output:

"NLP is amazing!"

👉 Why is this useful?

  • Ensures consistent formatting for tokenization.
  • Improves accuracy of NLP models.

🚀 Wrapping Up: Sentence & Word-Level Processing for NLP

We’ve covered five essential preprocessing techniques to structure messy text:

Sentence Segmentation → Splits text into sentences.

Abbreviation Handling → Expands short forms for better understanding.

Language Detection → Identifies which language the text is in.

Text Encoding → Fixes encoding issues for smooth processing.

Whitespace Handling → Cleans up extra spaces for better tokenization.

🔹 Next Up: Advanced NLP—Topic Modeling, Named Entity Recognition (NER), and Dependency Parsing! 🚀

Top comments (0)