Elena Daehnhardt, Mining Microblogs for Culture-awareness in Web Adaptation
12 Mar 2019
Prior studies in sociology and human-computer interaction indicate that persons from different countries and cultural origins tend to have their preferences in real-life communication and the usage of web and social media applications. For adapting web applications to personal cultural needs, we might ask persons to share their locations and provide them with the relevant design and content. There are also automatic means for getting approximate user locations based on the IP address and using web client information. This functionality is often inaccurate (consider travelling persons or expatriates) or unavailable. Social media seems to provide much of information on user traits, which can be further used in adapting applications to personal needs.
I have suggested to infer user cultural origins based on microblogging patterns reflecting how social media features are used. For instance, persons microblogging from The Netherlands and Germany are more likely to share links, while persons from Japan reply the most. However, this is one of the cases how we can infer user origins, there are more cues shared on social media that can reveal user origins. I have experimented with several methods of finding user origins based on Twitter data. Please read on to find out “how”.
Why this is important? The location data is not always at hand. If we look at the Twitter data, only a small percent of users share their geographic locations. In our experiments, only two percent of users revealed their geographic locations. Most of our users had open profiles without attaching any geographic locations.
Moreover, user location often does not match user origins. Many people travel, live abroad and often do not even match to their cultural stereotype. This work is however is against of any kind of stereotyping and profiling. Instead, it demonstrates the application of machine learning tools dor inferring cultural origins, which might differ from the expected result or from the location provided by the user software agent. Anyway, the inferred origins might be a close enough approximation for providing state-of-the-art user experience. It is important to mention, that individual users should be aware how easily their personal characteristics such as location can be mined out of their social data.
My PhD thesis findings reveal statistically significant differences in Twitter feature usage in respect of geographic locations of users. These differences in microblogger behaviour and user language defined in user profiles enabled us to infer user country origins with an accuracy of more than 90%. Other user origin predictive solutions proposed in the thesis do not require other data sources and human involvement for training the models, enabling the high accuracy of user country inference when exploiting information extracted from a user followers’ network, or with data derived from Twitter profiles.
With origin predictive models, we analysed communication and privacy preferences and built a culture-aware recommender system. Our analysis of friend responses shows that Twitter users tend to communicate mostly within their cultural regions. Usage of privacy settings showed that privacy perceptions differ across cultures. Finally, we created and evaluated movie recommendation strategies considering user cultural groups, and addressed a cold-start scenario with a new user. We believe that the findings discussed give insights into the sociological and web research, in particular on cultural differences in online communication.
I used several computer tools to code my experiments and do statistical hypothesis testing. Writing scripts and prototype required some knowledge of software engineering, particularly running long jobs while collecting bid data, data wrangling with Pyton Pandas (social media is generally “dirty” data which requires cleaning), MySQL and Redis for data storage, Bash scripting for automatic routines such as dataset archiving, Celery for running asynchronous tasks amongst other tools. Even though my initial intent was to use Java, I quickly changed to Python and never looked back. Python has everything needed for running statistical tests, rapidly applying machine learning algorithms such as provided by scikit-learn, or quire recent developments of scientific community such as Factorisation Machines by Rendle, and also very recent TensorFlow.
Needless to say, that this kind of prototyping requires a great level of agility and outlook into the current developments in industry and research. Unit testing and Git for version control were very much needed. The Jupiter Notebook is another great tool that helped my to run my code tests, update my performance graphs, include some tables and LaTex formula, descriptions and present if in an interactive form locally and on GitHub.
Not every test however was initially coded. Microsoft Azure platform provided a great set of machine learning tools, which assisted me in very quick tests of initial ideas. Their generous gift of server resources was very helpful in the very critical moment when project could get into the scope creep. From my experience, rapid tests and prototyping are excellent means for checking hypothesis and pilot tests. Specially, in research we have a set of hypothesis to be proven, however, most of the hypothesis are challenging to know beforehand. For instance, would be factoring in user cultural origins be beneficial for movie recommendations? Movie recommendation tests in Microsoft Azure showed me that not all recommendation algorithms could benefit from the inferred user origins, and the recommendation performance was further a cornerstone of my thesis chapter “Culture-aware Social Recommenders”.
Overall, I have learned about the social media, machine learning, programming patterns, worked with different data structures, updated my knowledge of statistics, and did some publications and talks at international conferences. In short, the main technical skills acquired or updated during five years of this project were:
Python programming language for coding the experiments, data collecting, ats analysis and visualising. Python is not only fully equipped with Artificial Intelligence tools, it is also very easy to learn.
Microsoft Azure Machine learning platform, scikit-learn, Factor Machines, Random Forest, Regression models and other tools.
Python Pandas and stats packages were very useful for brushing up my datasets and their statical analysis. I could not do without t-tests, Welsh tests and ANOVA.
Agile software development and maintenance were corner stores of the project. I think that many researchers in academy or industry benefit from the tools such as GIT for version control and Docker assisting in on-the-fly portable development. To compliment my agile approach with automatic builds and scalability, I am learning now Jenkins. I am going to share my experience in one of the following posts.
Running cron jobs, maintaining and cleaning my databases in MySQL and Redis storage, and reading related papers took most of my free time. Additionally, I have learned about how to control system resources such as memory consumption and CPU load.
Since the recommender experiments required that I periodically retrain my movie rating prediction model, Celery helped me to run training scripts.
I would say that basic human skills such as talking or writing are really technical. However, wen we present the research results at conferences or workshops, we follow certain practices. We need to follow presentation structure and be ready for any kind of feedback. Sometimes the critical feedback is the most useful, it is however, important to take things personally and learn from them :)
I have found that an unprecedented volume of the available tools can make the research and development easier, however, often adds an overhead when thinking about the system architecture. How could we select the tools which play well together? In my opinion, community support and well-documented software are essential when deciding on the application tools.
With Twitter data, statistical and machine learning tools, my research advances our understanding of microblogging in respect of cultural differences and demonstrates possible solutions of inferring and exploiting cultural origins for building adaptive web applications. Big data available on the Social Web makes it possible to better undrestand user needs and create state-of-the-art applications tailored to user cultural or personal needs. There are many open source and commercial tools that assist in and drive the research in the Artificial Intelligence applications. It is important, however, to be aware that human privacy and security becoming even more fragile with the development of automated solutions. Personal data management should be in hands of its owners.