Big Data
Home » Big Data » Machine Learning

Sentiment analysis using statistical machine learning

Posted by Sridhar S | Jul 04, 2012 | (0) | Add a Comment  |   Bookmark and Share

Sentiment analysis is a task of identifying whether the opinion expressed in a text source is positive or negative about a given topic, we can classify it further more to other nine identified human sentiments like love, laughter, compassion, anger, valor, fear, disgust, wonder and peace. Sentiment analysis in context of product reviews involves automatically classifying whether users have expressed positive or negative opinions regarding a product. However, it is usually early-on in a product life cycle that a company wants to quickly assess popular sentiment towards a product.

Under such circumstances, the only option available is to manually label a large number of product reviews to generate training data; a costly endeavor. We have explored some solutions to this problem. In particular we analyze the extent to which classification models trained on one set of products can be used for analysis of reviews of a different product. In particular we explore appropriate strategies for combining multiple classification models trained on different products.

So to answer the problem of predicting the sentiment polarity of a new sentiment in form of a product review using training data available for a different set of products, we are focusing on a statistical machine learning technique of Support Vector Machine classifiers. This approach of SVM classifiers amalgamation with vocabulary inter-sectional heuristic will help build a outperforming SVM based sentiment analysis tool.

We define the problem of  product sentiment analysis as follows

Given a significant number of labeled reviews for products P1 to Pn we want to learn a classification model for a new product Pn+1 for which only a small number of labeled reviews are provided. Let Di represent a set of labeled product reviews for product i. Then each element of Di is a two tuple (rij,lij), where rj is the jth review for product Pi and lj is its sentiment label (either positive:1 or negative:0).

Our goal is to use the available data in sets D1…Dn to achieve a high prediction accuracy on the test set of reviews Dn+1 for product Pn+1.

Leveraging SVM

 Support vector machines have been widely successful for various text classification tasks in past. We therefore start with SVM as our baseline model. Our SVM approach is based on the following intuition.

Given two different product domains, one could obtain an estimate of their similarity by looking at how well the feature vectors of their reviews, or in our case the vocabulary of their reviews, overlaps. While predicting the sentiment polarity of a review for a novel product, we would often prefer to use the classifiers that were trained on other similar products, than those trained on very different ones. Thus once domain similarity is calculated, one could assign different product classifiers weights proportional to their similarity scores. Mathematically, we formalize this idea as follows: Given a set of labelled product specific reviews D1 through Dn, we train n SVM classifiers C1 through Cn, one per product. Let the prediction score of classifier Ci on some review r for the target product Pn+1 (r ∈ Dn+1) be given by Ci(r) where:

C (r) > 0 i : If review sentiment is positive

C (r) < 0 i : If review sentiment is negative

 

Also, let the vocabulary of product Pi be given by Vi. We define vocabulary overlap between products Pi and Pn+1 to be :

 VO(i, n+1) = (|Vi ∩ Vn+1|) ÷ (|Vi |) where i varies from 1 through n. 

We further define normalized vocabulary gap to be:

 NormVO(i, n+1) = VO(i, n+1)  ÷ (minj=1 to n { VO(j, n+1) })

 Intuitively, the vocabulary overlap between two products provides the proportion of training domain product’s words that are common to the two products. Note that this makes the metric non-symmetric, i.e. VO(i,j) ≠VO(j,i). Normalized vocabulary overlap on the other hand is given by normalizing the minimum available vocabulary overlap for the products P1 through Pn.

Finally we define the ensemble SVM classifier as follows:

∑(i=1 to n) NormVO(i,n+1)  Ci(r)>0∶Classify reviewer' s sentiment as positive

∑(i=1 to n) NormVO(i,n+1)  Ci(r)<0∶Classify reviewer' s sentiment as negative

Overall the ensemble SVM classifier combines the predictions of all n SVM classifiers and weighs them based on their domain’s similarity with the target product. If the resulting prediction score is greater than 0, we classify the review as positive otherwise we classify it as negative. In the subsequent section we describe the dataset used and present our results.

For evaluation criteria we use classification accuracy. It is the percentage of total reviews in test set whose sentiment polarity was correctly predicted by the classifier. Mathematically, accuracy is the ratio of the sum of the true positives and the true negatives to the total number reviews taken in the test set i.e the sum of total positives and total negatives in the test set.       

      Accuracy = ((True Positives +True Negatives)*100) /Total Test Reviews

 In this post, we have shared an approach which would help sentiment classification using SVM. An experiment based on a newly crawled dataset of nearly 25K  reviews across four different products result showed that the proposed method of utilizing an ensemble of SVM classifiers amalgamation with vocabulary intersectional heuristic is useful for improving cross classification performance. In future we will extend this work by experimenting on a bigger dataset and developing novel classification models tuned specifically towards cross-product sentiment classification using another statistical classifier.

0 Comment for this post

Post a Comment

Required Information *
Name* Email*
Comments*  

*

In accordance with our comment policy, we encourage comments that are on topic, relevant and to-the-point. Once submitted, your comments will be published by the Impetus blog moderator. We will remove comments that include profanity, personal attacks, racial slurs, or threats of violence, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.