top of page

Author profiling track

Author profiling is the task of extracting as much information as possible from authors to generate a descriptive profile of them. Previous campaigns on author profiling, e.g., those from the PAN labs, have focused on profiling authors in terms of their gender, age and even personality traits. This track is going several steps further, by focusing on profiling dimensions that have not received enough attention from the community. Specifically, it consists on determining the occupation and place of residence of users from their tweets. This is a much more challenging problem that will require of adapting existing methodologies or proposing new ones for the analysis of tweets. Additionally, the track focuses on tweets generated by Mexican users, which poses additional challenges related to the treatment of a variety of Spanish with many cultural particularities. The data set for this track was collected between June and November 2016 according to the following methodology. Firstly, two human taggers extract a set of twitter accounts representative for different regions of Mexico, for example, they selected some accounts from politicians, famous places as well as universities and city councils. Then, they searched for followers of these accounts such that the information of gender, occupation and place of residence was available; granted by the same users in one of their social networks. The categories for each of the two profiling dimensions together with the distribution of samples available in the corpus are described in Tables 1 and 2.

Table 1. Distribution of samples in the author profiling data set for the occupation dimension.

Table 2. Distribution of samples in the author profiling data set for the place of residence dimension.

For this track, we will split data into training (70%) and testing (30%) partitions. The former will be used by participants for developing their methods, and the latter will be used to determine the winners of the challenge. For ranking participants we will use the macro average f1 measure.

​

The training data file is password-protected; to obtain the password you first need to be registered as participant.

​

Data and evaluation

bottom of page