Homework 7. Sentiment Analysis: Twitter Dataset
Introduction
This assignment is based on the Twitter Dataset from UW. The file contains 16,000 tweets that were collected from Twitter. The atttributes are sentences labeled with score ranging from 0 (most negative) to 4 (most positive). The objective is to predict the sentiment of each tweet.
Data Preprocessing
Loading the Data
url = "https://library.startlearninglabs.uw.edu/DATASCI410/Datasets/twitter_data.csv"
df = pd.read_csv(url, sep=",")
df.columns = ["sentiment_label","tweet_text"]
Process Text
import re
import string
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from wordcloud import WordCloud, STOPWORDS
def preprocess(text, steps):
"""
Processes text based on a list of provided steps.
Args:
- text (str): The input string to be processed.
- steps (list): A list of operations to apply on the text.
Returns:
- text (str): Processed text.
"""
# Remove non-ASCII characters
if 'remove_non_ascii' in steps:
text = ''.join([x for x in text if ord(x) < 128])
# Convert to lowercase
if 'lowercase' in steps:
text = text.lower()
# Remove punctuation
if 'remove_punctuation' in steps:
text = ''.join(char for char in text if char not in string.punctuation)
# Remove numbers
if 'remove_numbers' in steps:
text = re.sub("\d+", "", text)
# Remove extra whitespaces
if 'strip_whitespace' in steps:
text = ' '.join(text.split())
# Remove stopwords
if 'remove_stopwords' in steps:
stops = set(stopwords.words('english'))
text = ' '.join([word for word in text.split() if word not in stops])
# Stem words
if 'stem_words' in steps:
lmtzr = WordNetLemmatizer()
text = ' '.join([lmtzr.lemmatize(word) for word in text.split()])
return text
steps = ['remove_non_ascii', 'lowercase', 'remove_punctuation', 'remove_numbers',
'strip_whitespace', 'remove_stopwords', 'stem_words']
# Apply preprocessing to each tweet and store the result in a new column
df['clean_tweet'] = df['tweet_text'].apply(lambda s: preprocess(s, steps))
# Generate wordcloud for positive sentiment tweets
pos_clean_string = ','.join(df[df['sentiment_label'] == 4]['clean_tweet'])
wordcloud = WordCloud(
max_words=50,
width=2500,
height=1500,
background_color='black',
stopwords=STOPWORDS
).generate(pos_clean_string)
We can plot the word cloud to visualize the most common words in positive sentiment tweets.
fig = plt.figure(
figsize = (20, 10),
facecolor = 'k',
edgecolor = 'k')
plt.imshow(wordcloud, interpolation = 'bilinear')
plt.axis('off')
plt.tight_layout(pad=0)
plt.show()
Split Data into Training and Test Sets
# Declare the TFIDF vectorizer.
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5, max_features=6228, stop_words='english')
# Fit the vectorizer over the dataset
clean_texts = df['clean_tweet']
tf_idf_tweets = vectorizer.fit_transform(clean_texts)
# split the data
y_targets = np.array(df['sentiment_label'])
X_train, X_test, y_train, y_test = train_test_split(tf_idf_tweets, y_targets, test_size=40000,random_state=42)
Classification Models
model = SGDClassifier(loss='modified_huber', learning_rate='adaptive', penalty='elasticnet', alpha=2.9e-05, eta0=0.00164)
model.fit(X_train, y_train)
score = model.score(X_test, y_test)
print("Model Score:", score)
Model Score: 0.755.