Personality Analysis Using Machine Learning

The goal of this project is to develop a software capable of performing sentiment analysis on bodies of text with a given author. This tool will mimic IBM Watson's Personality Insights Service - a popular online resource and application programming interface (API) for personality classification.

Background
There are many sentiment analysis tools available for public use, however, many have significant drawbacks, are not well documented, or perform classification with relatively poor accuracy. We plan to produce the initial framework needed to create a new sentiment analysis tool that remedies many of these problems. Specifically, we will focus on web mining Twitter tweets, storing tweets in a database with relevant statistical information, and training a machine from the collected data. In time, the results of this project may develop into a mobile app.

Deliverables

 * Twitter mining tool capable of collecting and combining numerous tweets from verified users/authors into text samples
 * Text samples passed through IBM Watson Personality Insights Service
 * Database with large collection of authors, indexed IBM Watson output, and statistics
 * Machine trained on collected data

Database
For this project we are using a MySQL database. Each entry for the database contains the author’s name, the original text file, the URL of the source, the JSON objects from our scrapper (Tweepy), and the JSON objects from IBM Watson’s output. Eventually we will add the Five Factor Model information results from our machine. Until then we will collect data in order to use it to train our machine.

Twitter Scraping
Sample is based on Twitter user profiles. A Python package called Tweepywas used to extract data from Twitter. Twitter restricts the number of API calls that can be made, so the scraper is be set up to run without intervention. Tweepy API information can be found HERE

Text Preprocessing
Input text will be processed using an algorithm called GloVe*[8]. GloVe uses a strategy in which relationships between words can be found by analyzing their co-occurrence. Specifically, GloVe optimizes a global log-bilinear language model with a weighted least squares objective such that the difference between the dot product and co-occurrence probability between any two word feature vectors is zero. The resulting trained word vectors can be used as features in various machine learning algorithms.

Machine Learning
Two machine learning (ML) approaches, Gaussian Process (GP) and Convolution Neural Network (CNN), were considered. Comparisons were made between the two to determine which approach would best suit the needs of the project. Due to the statistical complexities of GP [9], it was decided a CNN would be implemented.

A CNN personality classifier based on Majumder et al. [6] implemented by SenticNet was chosen. The package includes text pre-processing, which can take the place of GloVe. Test will have to be conducted to determine whether GloVe provides any meaningful benefit.

Training
The CNN was trained on Watson's output, which includes both percentile and raw scores for each personality trait. Mean absolute error was used as a metric to measure how far off our machine was from Watson. Below are the MAE values for each trait based on percentile and raw scores and data set sizes of 1000 and 6000 text samples. For comparison, IBM Watson achieved a MAE score of 0.12 averaged across all five traits.



Important Links
Tweepy Documentation Watson Personality Insights GloVe & GitHub Repository Gaussian Processes WEKA Intro to CNN's  Mairesse Feature-Based Personality Recognizer

Document Archive

 * [[Media:2018_Personality_Analysis_Using_Machine_Learning Minutes.pdf|Meeting Minutes]]
 * [[Media:2018_Personality_Analysis_Using_Machine_Learning_Design_Review.pdf|Final Design Review]]