Brand new tweet-ids allow for the new collection of tweets in the Fb API that will be older than 9 days (i

Brand new tweet-ids allow for the new collection of tweets in the Fb API that will be older than 9 days (i

This site Footnote 2 was utilized as a means to get tweet-ids Footnote step three , this web site will bring experts that have metadata of a beneficial (third-party-collected) corpus from Dutch tweets (Tjong Kim Done and you may Van den Bosch, 2013). age., this new historical restrict when requesting tweets considering a pursuit inquire). The fresh R-plan ‘rtweet‘ and you can subservient ‘lookup_status‘ function were utilized to collect tweets inside the JSON style. The fresh new JSON document constitutes a desk towards the tweets‘ recommendations, including the production day, the fresh new tweet text message, therefore the provider (we.age., type of Myspace buyer).

Studies clean and you can preprocessing

The JSON Footnote 4 files were converted into an R data frame object. Non-Dutch tweets, retweets, and automated tweets (e.g., forecast-, advertisement-relatea, and traffic-related tweets) were removed. In addition, we excluded tweets based on three user-related criteria: (1) we removed tweets that belonged to the top 0.5 percentile of user activity because we considered them non-representative of the normal user population, such as users who created more than 2000 tweets within four weeks. (2) Tweets from users with early access to the 280 limit were removed. (3) Tweets from users who were not represented in both pre and post-CLC datasets were removed, this procedure ensured a consistent user sample over time (within-group design, Nusers = 109,661). All cleaning procedures and corresponding exclusion numbers are presented in Table 2.

This new tweet texts had been changed into ASCII encoding. URLs, range getaways, tweet headers, screen brands, and you will sources to help you monitor labels was basically removed. URLs increase the profile number whenever found into the tweet. But not, URLs do not add to the reputation number if they are found at the conclusion a tweet. To quit good misrepresentation of your genuine profile limitation you to users suffered with, tweets that have URLs (although not mass media URLs such added pictures otherwise films) was excluded.

Token and you can bigram data

The fresh new R package Footnote 5 ‘quanteda‘ was applied so you can tokenize the fresh tweet messages on the tokens (i.age., isolated terminology, punctuation s. Additionally, token-frequency-matrices were computed having: the brand new volume pre-CLC [f(token pre)], new cousin regularity pre-CLC[P (token pre)], the fresh regularity blog post-CLC [f(token blog post)], brand new relative frequency article-CLC and you may T-score. New T-decide to try is similar to a standard T-statistic and exercises new statistical difference between mode (we.elizabeth., the new cousin phrase wavelengths). Bad T-scores imply a comparatively higher occurrence off a good token pre-CLC, while positive T-scores mean a somewhat large occurrence of an excellent token article-CLC. The T-score equation found in the analysis is actually displayed as Eq. (1) and (2). N is the total number regarding tokens for each and every dataset (we.age., pre and post-CLC). So it equation lies in the process getting linguistic computations of the Church mais aussi al. (1991; Tjong Kim Sang, 2011).

Part-of-address (POS) investigation

The new R package Footnote 6 ‘openNLP‘ was utilized to help you classify and you may amount POS categories from the tweets (i.age., adjectives, adverbs, posts, conjunctives, interjections, nouns, numeral, prepositions, pronouns, punctuation, verbs, and you will various). This new POS tagger operates having fun with a max entropy (maxent) likelihood design to expect the fresh POS category predicated on contextual features (Ratnaparkhi, 1996). The fresh Dutch maxent model useful for the fresh new POS classification try educated toward CoNLL-X Alpino Dutch Treebank investigation (Buchholz and you can ). The new openNLP POS model might have been claimed with an accuracy get off 87.3% whenever useful English social media investigation (Horsmann ainsi que al., 2015). An ostensible limit of most recent research ’s the reliability regarding the latest POS sugar baby website canada tagger. Yet not, equivalent analyses was performed for pre-CLC and blog post-CLC datasets, meaning the accuracy of your own POS tagger is going to be consistent more than one another datasets. Therefore, we guess there are no clinical confounds.

Schreibe einen Kommentar