News

Tracing and Analysing Mobility with Twitter and Neo4j

A Warm welcome to our new collaborator Fabio

Fabio Lamanna is a Ph.D. Civil Transportation Engineer, in love with all things related to mobility, data and network dynamics. He’s been working as consultant at public administrations and private companies on transportation networks analysis, urban planning and traffic science. He spent two years working at IFISC in Palma de Mallorca, as a post-doc researcher playing and working with data coming from social networks, cities and transportation systems.

He joined LARUS a couple of months ago, and we would love to introduce him with a first post about his work on Twitter data, mobility and Neo4j.

Introduction

For a couple of years I have been exploring new data sources in the field of transportation engineering, that may integrate and consolidate traditional information about mobility of people. Twitter shed into the light of being a very good data source in the estimation of displacements of users, comparable to CENSUS data. Those datasets are usually characterized by being enough “big” to require new technologies of the side of (among others) storage and management. Thanks to the LARUS‘s support, I went deeply involved into the “graph database” technology, focusing and being certified on Neo4j, the world leader in the field.

In this article I present a case study related to the analysis of Twitter users “overlapping” the 25 busiest airports in Europe in the last three years. The dataset has been built over users that at least once are passing through an airport area, emitting a tweet. The user is then traced in his movements back and forth in time.

Modelling and Importing Data in Neo4j

The original data stream from the Twitter API provides a .json file, reduced to a single .csv with selected fields/types:

  • user_ID – user id;
  • tweet_ID – tweet id;
  • datetime – date and time of the emission of each tweet;
  • longitude – coordinate;
  • latitude – coordinate;
  • twitter_string – string of the tweet.

Another database provides information about which users/tweets have been emitted in the airport area only (fields user_ID e tweet_ID). Here we are interested in the text of the string, because of a possible correlation among the “Sentiment” and keywords used by users with strikes, delays etc, and to determinate a sort of quality of service whether airlines and/or routes names are found in the tweets.

Data (.csv) have been then imported in Neo4j thanks to the batch LOAD CSV function.

The data model is based on the following nodes and relationships:

(:User)-[:VISITED]->(:Loc)
(:User)-[:WRITES]->(:Tweet)
(:Tweet)-[:EMITTED]->(:Loc)
(:Loc)-[:IS_WITHIN]->(:Airport)

datamodel

 

Analysis and Queries on the Database

Here we present some queries which show the potential of Neo4j and of the py2neo Python package; we can easily provide information about automatic tweets/users (bot) and keywords analysis. As a subset, we are referring to the tweets emitted in the last three years within the Zurich (ZRH) airport.

Filtering Automatic (bot) users

Within Neo4j we can count the total number of tweets emitted by each user, and how many users are tweeting the same tweet:

MATCH (u:User)-[:WRITES]->(t:Tweet)-[:EMITTED_IN]->(l:Loc)-[:IS_WITHIN]->(a:Airport)
RETURN u AS user, count(t) AS nTweets, t AS tweet, count(u) AS nUsers
ORDER BY nTweets DESC

We then highlight the first user in the ranking (as the first bot suspected) with:

MATCH (u:User {user_id:520225342})-[:WRITES]->(t:Tweet)
return t.twitter_string AS string

A simple look on data allow us to identify the user as a bot; in this case it’s an automatic generator of weather information within the airport area. In order to speed-up the looking-up of such behaviours, we may evaluate some similarities measures on the text. The Hamming Distance gives a simple approach on that topic, measuring how many digits or characters are needed to transform a string into another of the equal length. The metric can be build outside Neo4j with Python, thanks to py2neo; the latter package can directly query, get results and do analyses from within the script. Results show that the sequences of the first 30 characters of each string are almost the same according to the following distribution:

Distribution of the Hamming Distance over the first 30 characters of potentials bot‘s tweets.

 

Keywords Analysis

It is useful to extract from the database some structures and patterns that correspond to common words that each user links to its travel experience. Here we show a simple approach finding the word “delay” in the tweets emitted from users OUTSIDE the airport area:

MATCH p_OUT=(:User)-[:WRITES]->(t:Tweet)-[:EMITTED_IN]->(l:Loc)
WHERE NOT (l)-[:IS_WITHIN]->(:Airport) AND t.twitter_string CONTAINS 'delay'
return p_OUT

delayOUT

and, on the other side, WITHIN Zurich airport:

MATCH p_IN=(:User)-[:WRITES]->(t:Tweet)-[:EMITTED_IN]->(:Loc)-[:IS_WITHIN]->(:Airport)
WHERE t.twitter_string CONTAINS 'delay'
return p_IN

delayZRH

 

It is therefore extremely easy to select subsets of users in function of both the content of the message and the visited locations. With two simple queries we splitted the data into a subgraph of users to apply a “sentiment” analysis on (p_IN), and another one useful to investigate users’ movements in space and time (p_OUT).

Outlook

We presented some of the advantages that a graph database like Neo4j may have within the field of transportation engineering. With very simple queries we extracted very fast some subset of data to be later analysed within Neo4j or in external scripts and environments. This drives Neo4j as an optimal partner even for transportation engineers, where data have to be stored and queried in a fast and easy way. Further analysis are now going into tracing users’ consecutive locations, to find out a likely place of residence, and therefore towards a potential Origin/Destination matrix development.

More information about modelling, import and data analysis are available on my GitHub page and on my website.

Acknowledgements

Twitter data have been retrieved for research purposes by IFISC. Thanks to LARUS for the introduction to graph database technologies and for the support in the data management.

Un pensiero su “Tracing and Analysing Mobility with Twitter and Neo4j

Lascia un commento

Il tuo indirizzo email non sarĂ  pubblicato. I campi obbligatori sono contrassegnati *