• Home
  • Contact
  • Send Mail
  • Subscribe to RSS Feed
  • Search Site

  • Home
  • Publications
  • Projects
  • Blog
  • CV
  • Contact

You are here: Alex Davies / Blog

How Big Data Can Help the Third Sector

06 Apr 2012 / 0 Comments / in Uncategorized/by alexdavies

McKinsey Global Institute ranks it as one of the largest sectors of growth in the coming years, describing “a tremendous wave of innovation productivity and growth … all driven by Big Data”. The Economist World in 2012 proclaims that “2012 will be the year in which the Big Data trend gets noticed”. Big Data is already here, and one of the key challenges now is determining how it can be brought to those to whom it could bring real positive benefit.

Big Data – the ability to process, analyse, and understand vast amounts of data – has long been the purview of insurance firms, cashed-up hedge funds, and tech juggernauts such as Google, Yahoo and Microsoft.  These companies were, for a time, the only ones who had access to such large amounts of data and knew that they could benefit greatly from analyzing it.

But this is no longer true. Most companies now realise they have huge quantities of data and records, sometimes only in paper form, that do little more than take up room in archives. In a world of data-driven decisions, this is a lot of value that is just left on the shelf.

This is a tragedy that is especially true in the charitable sector. Social enterprises and NGOs are generating piles of medical data, quality of life surveys, and micro-lending transaction data faster than they can properly make use of it. But with a looming 100,000 person talent gap in the us alone*, there will be 100,000 big-data-related roles that we don’t have the expertise to fill, and charities and NGOs risk being left out in the cold. The ability to process big data could make all the difference for charities, bringing them into the new realm of data, increasing efficiency and deploying resources where they can have the greatest impact.

However this doesn’t have to be the case. Data scientist Jeff Hammerbacker (CEO of Cloudera and one of the first 100 facebook employees) famously said,  upon leaving the social media giant: “The best minds of my generation are thinking about how to make people click ads. That sucks.” This is a widely-shared sentiment in the Big Data community, that talent is increasingly drawn to where the most money is (online advertising) and away from where it could have the most social impact. There is a rapidly growing Big Data community and many of its members want to donate their resources to aid charities and NGOs in dealing with their data. However, the gap between the charities with the data and the analysts who can unlock its potential remains dishearteningly wide. With a lack of widespread knowledge of Big Data and few existing links between the two communities,  there have been few avenues for productive collaboration. This partnership between analysts and charities is where the key investment needs to take place if the social sector is to ride the wave of big data, bringing together the analysts who want to assist the third sector with third-sector parties that could benefit immensely. While there is still a lot of work to be done, some groups are already taking on this challenge, and two of the most promising are Data without Borders and Kaggle.

Data without Borders was founded last year by New York Times Data Scientist Jake Porway in response to exactly this problem. He saw the chasm between charities and the Big Data community and started working to form real links between the two groups. Data without Borders has already organised multiple events that brought charities together with Big Data scientists who could help them unlock the potential of their data. Initial projects were run with charities such as UN Global Pulse and The Microfinance Information Exchange Market. If Data without Borders can reach a critical mass where many charities are aware and willing to partner their data, they could easily become the global force for charitable data analysis.

Kaggle is a less obvious project to bring data expertise to social enterprise. Kaggle does not focus on charities or NGOs. Rather, it runs competitions where 1000s of data scientists compete to provide the best solutions to problems with data. Kaggle has already achieved fame for helping NASA to map Dark Matter in the universe. Within only weeks, competitors unfamiliar with astronomy were able to significantly outperforming  the most cutting edge models for this cosmology problem. The platform they’ve created for competitions has the potential to instantly overcome the problem of bringing the two communities together.

While they have yet to run a competition on behalf of an NGO or charity, the platform shows great promise for rallying data scientists behind the larger data problems that these organisations face. While other competitions have traditionally offered cash prizes for the winner, charities and NGOs can appeal to the unsatisfied desire of data scientists to use their skills for the greater good. And it is there. Once charities and NGOs gain a clearer understanding of what they want to extract from their data, this will provide an amazing means  to bridge the gap between data-laden charities and analysts with an appetite for charity.

This is the beginning of a new era in the use of information and it is up to us to make sure that the third sector isn’t left behind.

*McKinsey Global Institute report on Big Data

What your emoticons mean on Twitter

03 Oct 2011 / 1 Comment / in Uncategorized/by alexdavies

Did you ever wonder what words were associated with emoticons that you use on Twitter? After collecting word lists for sentiment analysis based on the classic ‘:(‘ and ‘:)’, I had a look at a few other common emoticons with some surprising results. There are some wordle visualizations of the results below. It is interesting to see that emoticons that ostensibly represent the same emotion (such as :) and ^_^) are used in markedly different contexts. If you are easily offended, please ignore the last image. I take no responsibility for what people choose to say on Twitter.

 

A quick refresher (of traditional meanings):

:) – Happy (two eyes and a smiling mouth on its side)

:( – Sad (two eyes and a frowning mouth on its side)

:/ – Frustrated/Unamused (two eyes and a slanting mouth on its side)

<3 – Love (a heart on its side)

^_^ – Happy (east asian style, two happy eyes and a flat mouth)

-_- - Frustrated/Unamused/Upset (east asian style, two flat eyes and a flat mouth)

And more emoticons than you ever wanted to know.

 

Feel free to use these wherever and however you like; credit back where possible.


:)

 

:( 

 

<3


 

^_^

 

:/

 

-_-

 

[social_share/]

 

A word list for sentiment analysis of Twitter

02 Oct 2011 / 3 Comments / in Uncategorized/by alexdavies

After my presentation of my poster at the Social Web Mining Workshop at KDD, I had a number of requests for a word list that can be used for sentiment analysis of Twitter. It is worth pointing out at the outset that you will almost always want to generate your own list and not rely on a pre-generated one like this (see point 1). However, they can be useful for quick testing of ideas/applications where accuracy of the sentiment analysis is not critical. If you just want the list, get it here: Twitter sentiment analysis word list. For an explanation, read below.

1. When to use these lists

If you want to very quickly test a system that involves sentiment analysis, this is the place to start. Once you have verified that your system is feasible, you’ll almost certainly want to train your own set of words. This is because you get a lot of value from considering what exactly a happy sentiment means in your context. These lists are generated from only considering single emoticons to represent emotions. This is very general, can be applied to any language, but doesn’t capture the wide array of prior knowledge about your domain that you can usually incorporate.

2. How to use these lists

The file contains a list of ~5000 common words, each with their associated joint log probability for appearing in a happy tweet or sad tweet. ie p(w, happy) and p(w, sad).

Example: Deciding if a tweet is happy or sad.

This is by far the most common type of sentiment analysis performed on Twitter. For every tweet t that you have, you first need to tokenize it into a set of words w. I would recommend twokenize.py by Brendan O’Connor, which  I used for my work.

To make this concrete we will use the example tweet:

t = ”I am so happy about something”.

After tokenization, we should have:

w = ["i", "am", "happy", "about", "something"].

For every word, we simply look up it’s log-probability in our sentiment list. If it doesn’t turn up in the file, it probably wasn’t important anyway, and we ignore it. In this case we find the probabilities for the words in our tweet.

log( p(w, happy) ) = [<no entry>, <no entry>, -5.63706, ...]

log( p(w, sad) ) = [<no entry>, <no entry>,-8.43618, ...]

To get the log probability of the entire tweet, we simply add up all the elements of the array. (This is because log(p(x)) + log(p(y)) = log(p(x)p(y))).

The reason we are working in log-probabilities here is that often these probabilities will be very small, and if we were to multiply them they would quickly become too small for our machines ability to represent them. If we only want to make a decision about what the most likely sentiment is, we can simply choose the one with the highest log probability. Because the log function is monotonically increasing, if one has a higher log-probability, it also has a higher probability.

If we want to get the probabilities that a tweet is happy or sad, then we do a small amount of math and come up with:

 

p(s|\bar{w}) = \frac{p(\bar{w}|s)p(s)}{\sum_{s'} p(\bar{w}|s')p(s')}

Where

s

is the sentiment and

\bar{w}

is the set of words in the tweet.

p(s|\bar{w}) = \frac{p(\bar{w}|s)}{\sum_{s'} p(\bar{w}|s')}

Assuming that the prior probabilities of each sentiment are equal p(s) = p(s’).

p(s|\bar{w}) = \left(\frac{\sum_{s'}{p(\bar{w}|s')}}{p(\bar{w}|s)}\right)^{-1}

p(s|\bar{w}) = \left(\sum_{s'} {\frac{ p(\bar{w}|s')}{p(\bar{w}|s)}} \right)^{-1}

p(s|\bar{w}) = \left(\sum_{s'} {\frac{ e^{log(p(\bar{w}|s'))}}{e^{log(p(\bar{w}|s))}}} \right)^{-1}

p(s|\bar{w}) = \left(\sum_{s'} {e^{log (p(\bar{w}|s') - log(p(\bar{w}|s)}} \right)^{-1}

p(s|\bar{w}) = \left(\sum_{s'} {e^{\sum_{w \in \bar{w}} {log (p(w|s')) - log(p(w|s))}}} \right)^{-1}

p(s|\bar{w}) = \left(\sum_{s'} {e^{\sum_{w \in \bar{w}} {log (p(w,s')) - log(p(w,s)) - log(p(s')) + log(p(s))}}} \right)^{-1}

Again assuming the prior probabilities of each sentiment are equal:

p(s|\bar{w}) = \left(\sum_{s'} {e^{\sum_{w \in \bar{w}} {log (p(w,s')) - log(p(w,s))}}} \right)^{-1}

Therefore, since we only have two sentiments in this case:

p(happy|\bar{w}) = \left(e^{\sum_{w \in \bar{w}} {log (p(w,sad)) - log(p(w,happy))}} + 1 \right)^{-1}

 

This attached python code does exactly this; playing with it will give you an easy idea of how to extend this.

3. Important considerations

  • Sentiment analysis is hard. You will probably be disappointed with the results of any current sentiment analysis technique on an individual tweet, but when looking at aggregated data, you can still provide meaningful insights and find meaningful correlations.
  • These word lists are from a general sample of Twitter, without attempt to normalize for language. This means the list should work well for the current distribution of languages on Twitter, which is predominantly English, with a lot of Spanish and Indonesian.
  • The list will still be very noisy. This list was trained from a set of ~5 million tweets, with nearly all spam removed (to the best of my ability). However, there is still a lot of garbage on Twitter so don’t be surprised if you see entries that don’t seem to make any sense.
  • The list is ordered by the happy sentiment log probability, which is not the same as which words are most happy. That quantity is hard to define but is more similar to the difference between the happy and sad log probabilities.
  • I don’t claim that this list will be useful for any particular application and it is provided as is, with no support.

4. The list

Once again, the list and python code is available here: Twitter sentiment analysis word list.

 

About me

My name is Alex Davies and I’m a PhD student studying in the Machine Learning Group at Cambridge University under Zoubin Ghahramani.

My huffington post blog is here.


My work has been featured in:






And I have worked with/for:



© Copyright - Alex Davies - Wordpress Theme by Kriesi.at
  • scroll to top
  • Send us Mail
  • Subscribe to our RSS Feed