Personality Analysis Using Machine Learning

The goal of this project is to develop a software capable of performing sentiment analysis on bodies of text with a given author. This tool will mimic IBM Watson's Personality Insights Service - a popular online resource and application programming interface (API) for personality classification.

Background
There are many sentiment analysis tools available for public use, however, many have significant drawbacks, are not well documented, or perform classification with relatively poor accuracy. We plan to produce the initial framework needed to create a new sentiment analysis tool that remedies many of these problems. Specifically, we will focus on web mining Twitter tweets, storing tweets in a database with relevant statistical information, and training a machine from the collected data. In time, the results of this project may develop into a mobile app.

Deliverables

 * Twitter mining tool capable of collecting and combining numerous tweets from verified users/authors into text samples
 * Text samples passed through IBM Watson Personality Insights Service
 * Database with large collection of authors, indexed IBM Watson output, and statistics
 * Machine trained on collected data

Database
For this project we are using a MySQL database. Each entry for the database contains the author’s name, the original text file, the URL of the source, the JSON objects from our scrapper (Tweepy), and the JSON objects from IBM Watson’s output. Eventually we will add the Five Factor Model information results from our machine. Until then we will collect data in order to use it to train our machine.

Twitter Scraping
Sample data will be based on Twitter user profiles. A Python package called Tweepy will be used to extract data from Twitter. Twitter restricts the number of API calls that can be made, so the scraper will be set up to run without intervention. Tweepy API information can be found HERE

Text Processing
Input text will be processed using an algorithm called GloVe*. This algorithm finds the co-occurrences of words in the context of other words. These co-occurrences can be used as features in machine learning models.

* Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation.

Machine Learning
Two machine learning (ML) approaches, Gaussian Process (GP) and Convolution Neural Network (CNN), are currently being considered. A comparison will be made to determine the best option (work in progress).