How to Apply Sentiment Analysis to Longer Text

#python #datascience #machinelearning #tutorial

Sentiment Analysis With Long Sequences | Towards Data Science

James Briggs ・ Mar 10, 2021 ・ 10 min read
towardsdatascience.com

The de-facto standard in many natural language processing (NLP) tasks nowadays is to use a transformer. Text generation? Transformer. Question-and-answering? Transformer. Language classification? Transformer!

However, one of the problems with many of these models (a problem that is not just restricted to transformer models) is that we cannot process long pieces of text.

Almost every article I write on Medium contains 1000+ words, which, when tokenized for a transformer model like BERT, will produce 1000+ tokens. BERT (and many other transformer models) will consume 512 tokens max — truncating anything beyond this length.

Although I think you may struggle to find value in processing my Medium articles, the same applies to many useful data sources — like news articles or Reddit posts.

We will take a look at how we can work around this limitation. In this article, we will find the sentiment for long posts from the /r/investing subreddit. This article will cover:

High-Level Approach
Getting Started

Data
Initialization Tokenization Preparing The Chunks
Split
CLS and SEP
Padding
Reshaping For BERT Making Predictions

Top comments (1)

alibehery • Nov 13 '22

where is the article ?

DEV Community

How to Apply Sentiment Analysis to Longer Text

Sentiment Analysis With Long Sequences | Towards Data Science

James Briggs ・ Mar 10, 2021 ・ 10 min read
towardsdatascience.com

Top comments (1)

Read next

The 7 Best Python Libraries Every Developer Needs to Know

Como publicar um pacote no npm: um guia passo a passo

Small But Mighty: Survey of Small Language Models in the LLM Era

Demystifying CXL Heterogeneous Systems with Heimdall Benchmark

Sentiment Analysis With Long Sequences | Towards Data Science

James Briggs ・ Mar 10, 2021 ・ 10 min read towardsdatascience.com

Read next

The 7 Best Python Libraries Every Developer Needs to Know

Como publicar um pacote no npm: um guia passo a passo

Small But Mighty: Survey of Small Language Models in the LLM Era

Demystifying CXL Heterogeneous Systems with Heimdall Benchmark

James Briggs ・ Mar 10, 2021 ・ 10 min read
towardsdatascience.com