Homework 7. Sentiment Analysis: Twitter Dataset

Introduction

This assignment is based on the Twitter Dataset from UW. The file contains 16,000 tweets that were collected from Twitter. The atttributes are sentences labeled with score ranging from 0 (most negative) to 4 (most positive). The objective is to predict the sentiment of each tweet.

Data Preprocessing

Loading the Data

url = "https://library.startlearninglabs.uw.edu/DATASCI410/Datasets/twitter_data.csv"
df = pd.read_csv(url, sep=",")
df.columns = ["sentiment_label","tweet_text"]

Process Text

import re
import string
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from wordcloud import WordCloud, STOPWORDS

def preprocess(text, steps):
    """
    Processes text based on a list of provided steps.

    Args:
    - text (str): The input string to be processed.
    - steps (list): A list of operations to apply on the text.

    Returns:
    - text (str): Processed text.
    """

    # Remove non-ASCII characters
    if 'remove_non_ascii' in steps:
        text = ''.join([x for x in text if ord(x) < 128])

    # Convert to lowercase
    if 'lowercase' in steps:
        text = text.lower()

    # Remove punctuation
    if 'remove_punctuation' in steps:
        text = ''.join(char for char in text if char not in string.punctuation)

    # Remove numbers
    if 'remove_numbers' in steps:
        text = re.sub("\d+", "", text)

    # Remove extra whitespaces
    if 'strip_whitespace' in steps:
        text = ' '.join(text.split())

    # Remove stopwords
    if 'remove_stopwords' in steps:
        stops = set(stopwords.words('english'))
        text = ' '.join([word for word in text.split() if word not in stops])

    # Stem words
    if 'stem_words' in steps:
        lmtzr = WordNetLemmatizer()
        text = ' '.join([lmtzr.lemmatize(word) for word in text.split()])

    return text

steps = ['remove_non_ascii', 'lowercase', 'remove_punctuation', 'remove_numbers',
         'strip_whitespace', 'remove_stopwords', 'stem_words']

# Apply preprocessing to each tweet and store the result in a new column
df['clean_tweet'] = df['tweet_text'].apply(lambda s: preprocess(s, steps))

# Generate wordcloud for positive sentiment tweets
pos_clean_string = ','.join(df[df['sentiment_label'] == 4]['clean_tweet'])
wordcloud = WordCloud(
    max_words=50,
    width=2500,
    height=1500,
    background_color='black',
    stopwords=STOPWORDS
).generate(pos_clean_string)

We can plot the word cloud to visualize the most common words in positive sentiment tweets.

fig = plt.figure(
figsize = (20, 10),
facecolor = 'k',
edgecolor = 'k')
plt.imshow(wordcloud, interpolation = 'bilinear')
plt.axis('off')
plt.tight_layout(pad=0)
plt.show()

wordcloud

Split Data into Training and Test Sets

# Declare the TFIDF vectorizer.
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5, max_features=6228, stop_words='english')
# Fit the vectorizer over the dataset
clean_texts = df['clean_tweet']
tf_idf_tweets = vectorizer.fit_transform(clean_texts)
# split the data
y_targets = np.array(df['sentiment_label'])
X_train, X_test, y_train, y_test = train_test_split(tf_idf_tweets, y_targets, test_size=40000,random_state=42)

Classification Models

model = SGDClassifier(loss='modified_huber', learning_rate='adaptive', penalty='elasticnet', alpha=2.9e-05, eta0=0.00164)
model.fit(X_train, y_train)

score = model.score(X_test, y_test)
print("Model Score:", score)

Model Score: 0.755.