

Unnest_tokens(word, headline_text, token = "ngrams", n = 3) %>%įilter(word %in% three_word_topics$keyword) %>% Mutate(n_two_word_topics = map_int(two_word_topics, ~length(.x))) %>% Mutate(two_word_topics = map(two_word_topics, ~list(.x))) %>% Unnest_tokens(word, headline_text, token = "ngrams", n = 2) %>%įilter(word %in% two_word_topics$keyword) %>% The tidytext package now becomes really useful for us to detect our categories in the ngrams of our headlines: tidy_two_word_topics % Let's split our categories according to the number of words they contain: two_word_topics % Labs(title = "Most common noun-verb phrases of least two words in ABC News Headlines") Mutate(keyword = fct_reorder(keyword, freq)) %>% Let's look at the most common NOUN-VERB phrases, they definitely look suitable as categories: noun_verb_stats %>% Now we can extract our NOUN-VERB phrases using keywords_phrases() headlines_annotated$phrase_tag % To do this we need to download the latest version of our model, note that the filename changes after each update so ensure not to hard code the filename library("udpipe") So we can join this dataset with others later on, I'm also adding a unique ID for each headline: abc_news_2017 % We'll only consider headlines from 2017 as I don't want to lock up my computer for longer than a coffee break. We're going to use the udpipe package to extract and tally NOUN-VERB pairs in our dataset to identify potential categories. Fortunately, someone else on Community asked about importing very large JSON files here: The read_json function will take a very long time to import this file.

json file that's 98Mb in size with 1.1million headlines. Mutate(authors = sample(babynames$name, nrow(.), replace = TRUE)) Mutate(publish_date = ymd(publish_date)) %>% This tibble is then converted to JSON with toJSON and exported for analysis. The code below augments the data with dates and randomly sampled names for authors from the babynames package. There's a Kaggle Dataset containing a million headlines from ABC News available as a. Because I've not really done much of it before, I thought I'd have a go at this for you.
