R 토픽 모델링 - PathFinder

R 토픽 모델링이란 LDA 알고리즘을 활용해 문서를 특정 토픽으로 분류하고 문서들을 나누는 기법입니다. 문서의 의미와 맥락을 이해하는 되며 LDA는 토픽 모델링에 가장 많이 사용되는 방법입니다.

1. 토픽 모델링 개념 알아보기

LDA 모델 알아보기

LDA(Latent Dirichlet Allocation, 잠재 디리클레 할당)는 가장 널리 사용되는 토픽 모델링 알고리즘입니다.

토픽은 여러 단어의 혼합으로 구성된다

한 토픽에 여러 단어가 서로 다른 확률로 포함된다.
같은 단어가 여러 토픽에 여러 다른 확률로 포함된다.

문서는 여러 토픽의 혼합으로 구성된다

2. R 토픽 모델링 LDA 모델 만들기

1) 전처리하기

기본적인 전처리

중복 문서 제거하기
- dplyr 패키지의 distinct()를 이용해 중복 댓글 제거
- 다른 변수들 모두 보유하도록 .keep_all = T 입력
짧은 문서 제거하기

# 기생충 기사 댓글 불러오기
library(readr)
library(dplyr)

raw_news_comment <- read_csv("news_comment_parasite.csv") %>%
  mutate(id = row_number())

library(stringr)
library(textclean)

# 기본적인 전처리
news_comment <- raw_news_comment %>%
  mutate(reply = str_replace_all(reply, "[^가-힣]", " "),
         reply = str_squish(reply)) %>%

  # 중복 댓글 제거
  distinct(reply, .keep_all = T) %>%

  # 짧은 문서 제거 - 3 단어 이상 추출
  filter(str_count(reply, boundary("word")) >= 3)

명사 추출하기

library(tidytext)
library(KoNLP)

# 명사 추출
comment <- news_comment %>%
  unnest_tokens(input = reply,
                output = word,
                token = extractNoun,
                drop = F) %>%
  filter(str_count(word) > 1) %>%

  # 댓글 내 중복 단어 제거
  group_by(id) %>%
  distinct(word, .keep_all = T) %>%
  ungroup() %>%
  select(id, word)

comment

빈도가 높은 단어 제거하기

빈도가 높은 단어가 있으면 토픽의 특징을 파악하기 어렵다.

count_word <- comment %>%
  add_count(word) %>%
  filter(n <= 200) %>%
  select(-n)

불용어 제거하기, 유의어 처리하기

4.1 불용어, 유의어 확인하기

“들이”, “하다”, “하게” 처럼 의미를 알 수 없는 단어는 제거해야 한다. 분석 대상에서 제외할 단어를 불용어(stop word)라고 한다.

# 불용어, 유의어 확인하기
count_word %>%
  count(word, sort = T) %>%
  print(n = 200)

4.2 불용어 목록 만들기

# 불용어 목록 만들기
stopword <- c("들이", "하다", "하게", "하면", "해서", "이번", "하네",
              "해요", "이것", "니들", "하기", "하지", "한거", "해주",
              "그것", "어디", "여기", "까지", "이거", "하신", "만큼")

4.3 불용어 제거하고 유의어 수정하기

dplyr 패키지의 recode()를 이용해 유의어를 수정함.
recode()는 특정 값을 다른 값으로 수정하는 함수

# 불용어, 유의어 처리하기
count_word <- count_word %>%
  filter(!word %in% stopword) %>%
  mutate(word = recode(word,
                       "자랑스럽습니" = "자랑",
                       "자랑스럽" = "자랑",
                       "자한" = "자유한국당",
                       "문재" = "문재인",
                       "한국의" = "한국",
                       "그네" = "박근혜",
                       "추카" = "축하",
                       "정경" = "정경심",
                       "방탄" = "방탄소년단"))

불용어 만들기

# tibble 구조로 불용어 목록 만들기
stopword <- tibble(word = c("들이", "하다", "하게", "하면", "해서", "이번", "하네",
                            "해요", "이것", "니들", "하기", "하지", "한거", "해주",
                            "그것", "어디", "여기", "까지", "이거", "하신", "만큼"))

# 불용어 목록 저장하기
library(readr)
write_csv(stopword, "stopword.csv")

# 불용어 목록 불러오기
stopword <- read_csv("stopword.csv")

# 불용어 제거하기
count_word <- count_word %>%
  filter(!word %in% stopword$word)

불용어 제거 → anti_join()을 이용

count_word <- count_word %>%
  anti_join(stopword, by = "word")

2) LDA 모델 만들기

Document-Term Matrix 만들기

LDA 모델은 DTM(Document-Term Matrix, 문서 단어 행렬)을 이용해 만듬.
DTM은 행은 문서, 열은 단어로 구성해 빈도를 나타낸 행렬임.

1.1 문서별 단어 빈도 구하기

# 문서별 단어 빈도 구하기
count_word_doc <- count_word %>%
  count(id, word, sort = T)

count_word_doc

1.2 DTM 만들기 – cast_dtm()

tidytext 패키지의 cast_dtm()은 문서별 단어 빈도를 DTM으로 만드는 함수
cast_dtm()을 사용하려면 tm 패키지를 설치해야 함.
- document : 문서 구분 기준
- term : 단어
- value : 단어 빈도

install.packages("tm")

# DTM 만들기
dtm_comment <- count_word_doc %>%
  cast_dtm(document = id, term = word, value = n)

dtm_comment

# DTM 내용 확인하기
as.matrix(dtm_comment)[1:8, 1:8]

LDA 모델 만들기 – LDA()

install.packages("topicmodels")
library(topicmodels)

# 토픽 모델 만들기
lda_model <- LDA(dtm_comment,
                 k = 8,   # 토픽 수.
                 method = "Gibbs",
                 control = list(seed = 1234))
lda_model

# 모델 내용 확인
glimpse(lda_model)

lda_model
- beta : 단어가 각 토픽에 등장할 확률.
- gamma : 문서가 각 토픽에 등장할 확률.

3. 토픽별 주요 단어 살펴보기

베타 : 단어가 토픽에 등장할 확률

1) 토픽별 단어 확률, beta 추출하기 – tidy()

beta 추출하기

term_topic <- tidy(lda_model, matrix = "beta")
term_topic

beta 살펴보기

# 토픽별 단어 수
term_topic %>%
  count(topic)

# 토픽 1의 beta 합계
term_topic %>%
  filter(topic == 1) %>%
  summarise(sum_beta = sum(beta))

특정 단어의 토픽별 확률 살펴보기

term_topic %>%
  filter(term == "작품상")

2) 토픽별 주요 단어 살펴보기

특정 토픽에서 beta가 높은 단어 살펴보기

term_topic %>%
  filter(topic == 6) %>%
  arrange(-beta)

모든 토픽의 주요 단어 살펴보기 – term()

terms(lda_model, 20) %>%
  data.frame()

3) 토픽별 주요 단어 시각화하기

토픽별로 beta가 가장 높은 단어 추출하기

# 토픽별 beta 상위 10개 단어 추출
top_term_topic <- term_topic %>%
  group_by(topic) %>%
  slice_max(beta, n = 10)

막대 그래프 만들기

install.packages("scales")
library(scales)
library(ggplot2)

ggplot(top_term_topic,
       aes(x = reorder_within(term, beta, topic),
           y = beta,
           fill = factor(topic))) +
  geom_col(show.legend = F) +
  facet_wrap(~ topic, scales = "free", ncol = 4) +
  coord_flip() +
  scale_x_reordered() +
  scale_y_continuous(n.breaks = 4,
                     labels = number_format(accuracy = .01)) + 
  labs(x = NULL) +
  theme(text = element_text(family = "nanumgothic"))

4. 문서를 토픽별로 분류하기

감마는 문서가 각 토픽에 등장할 확률
감마를 이용하면 문서를 토픽별로 분류할 수 있음

1) 문서별 토픽 확률, gamma 추출하기 – tidy()

gamma 추출하기

doc_topic <- tidy(lda_model, matrix = "gamma")
doc_topic

gamma 살펴보기 – 확률값이므로 모두 더하면 1이 됨

doc_topic %>%
  count(topic)

# 문서 1의 gamma 합계
doc_topic %>%
  filter(document == 1) %>%
  summarise(sum_gamma = sum(gamma))

2) 문서를 확률이 가장 높은 토픽으로 분류하기

문서별로 확률이 가장 높은 토픽 추출하기

# 문서별로 확률이 가장 높은 토픽 추출
doc_class <- doc_topic %>%
  group_by(document) %>%
  slice_max(gamma, n = 1)

doc_class

원문에 확률이 가장 높은 토픽 번호 부여하기

# integer로 변환
doc_class$document <- as.integer(doc_class$document)

# 원문에 토픽 번호 부여
news_comment_topic <- raw_news_comment %>%
  left_join(doc_class, by = c("id" = "document"))

# -------------------------------------------------------------------------
# 결합 확인
news_comment_topic %>%
  select(id, topic)

토픽별 문서 수 살펴보기

news_comment_topic %>%
  count(topic)

news_comment_topic <- news_comment_topic %>%
  na.omit()

문서를 한 토픽으로만 분류하기

# 토픽 당 문서의 빈도가 2이상 추출
doc_topic %>%
  group_by(document) %>%
  slice_max(gamma, n = 1) %>%
  count(document) %>%
  filter(n >= 2)

# 문서를 한 토픽으로만 분류
# slice_sample() : gamma가 가장 높은 행 하나만 무작위로 추출
set.seed(1234)
doc_class_unique <- doc_topic %>%
  group_by(document) %>%
  slice_max(gamma, n = 1) %>%
  slice_sample(n = 1)

doc_class_unique

# 문서 빈도 구하기
doc_class_unique %>%
  count(document, sort = T)

3) 토픽별 문서 수와 단어 시각화하기

토픽별 주요 단어 목록 만들기

# 토픽별 확률이 가장 높은 중요 단어 6개씩 추출, 확률이 동점인 단어는 제외 with_ties = F
top_terms <- term_topic %>%
  group_by(topic) %>%
  slice_max(beta, n = 6, with_ties = F) %>%
  summarise(term = paste(term, collapse = ", "))

top_terms

토픽별 문서 빈도 구하기

count_topic <- news_comment_topic %>%
  count(topic)

count_topic

문서 빈도에 주요 단어 결합하기

count_topic_word <- count_topic %>%
  left_join(top_terms, by = "topic") %>%
  mutate(topic_name = paste("Topic", topic))

count_topic_word

토픽별 문서 수와 주요 단어로 막대 그래프 만들기

ggplot(count_topic_word,
       aes(x = reorder(topic_name, n),
           y = n,
           fill = topic_name)) +
  geom_col(show.legend = F) +
  coord_flip() +

  geom_text(aes(label = n) ,                # 문서 빈도 표시
            hjust = -0.2) +                 # 막대 밖에 표시

  geom_text(aes(label = term),              # 주요 단어 표시
            hjust = 1.03,                   # 막대 안에 표시
            col = "white",                  # 색깔
            fontface = "bold",              # 두껍게
            family = "nanumgothic") +       # 폰트

  scale_y_continuous(expand = c(0, 0),      # y축-막대 간격 줄이기
                     limits = c(0, 820)) +  # y축 범위
  labs(x = NULL)

5. 토픽 이름 짓기

1) 토픽별 주요 문서 살펴보고 토픽 이름짓기

원문을 읽기 편하게 전처리하고 gamma가 높은 순으로 정렬하기

comment_topic <- news_comment_topic %>%
  mutate(reply = str_squish(replace_html(reply))) %>%
  arrange(-gamma)

comment_topic %>%
  select(gamma, reply)

주요 단어가 사용된 문서 살펴보기

# 토픽 1 내용 살펴보기
comment_topic %>%
  filter(topic == 1 & str_detect(reply, "작품")) %>%
  head(50) %>%
  pull(reply)

comment_topic %>%
  filter(topic == 1 & str_detect(reply, "진심")) %>%
  head(50) %>%
  pull(reply)

comment_topic %>%
  filter(topic == 1 & str_detect(reply, "정치")) %>%
  head(50) %>%
  pull(reply)

토픽 이름 목록 만들기

# 토픽 이름 목록 만들기
name_topic <- tibble(topic = 1:8,
                     name = c("1. 작품상 수상 축하, 정치적 댓글 비판",
                              "2. 수상 축하, 시상식 감상",
                              "3. 조국 가족, 정치적 해석",
                              "4. 새 역사 쓴 세계적인 영화",
                              "5. 자랑스럽고 감사한 마음",
                              "6. 놀라운 4관왕 수상",
                              "7. 문화계 블랙리스트, 보수 정당 비판",
                              "8. 한국의 세계적 위상"))

2) 토픽 이름 목록 만들기

# 토픽 이름 결합하기
top_term_topic_name <- top_term_topic %>%
  left_join(name_topic, name_topic, by = "topic")

top_term_topic_name

# -------------------------------------------------------------------------
# 막대 그래프 만들기
ggplot(top_term_topic_name,
       aes(x = reorder_within(term, beta, name),
           y = beta,
           fill = factor(topic))) +
  geom_col(show.legend = F) +
  facet_wrap(~ name, scales = "free", ncol = 2) +
  coord_flip() +
  scale_x_reordered() +

  labs(title = "영화 기생충 아카데미상 수상 기사 댓글 토픽",
       subtitle = "토픽별 주요 단어 Top 10",
       x = NULL, y = NULL) +

  theme_minimal() +
  theme(text = element_text(family = "nanumgothic"),
        title = element_text(size = 12),
        axis.text.x = element_blank(),
        axis.ticks.x = element_blank())

6. 최적의 토픽 수 정하기

토픽 수를 정하는 방법 알아보기

모델의 내용을 보고 해석 가능성을 고려해 토픽 수 정하기
여러 모델의 성능 지표를 비교해 토픽 수 정하기
1. 하이퍼파라미터 튜닝(hyperparameter tuning) : 여러 모델의 성능 지표를 비교해 최적값을 찾는 작업
두 가지 방법을 함께 사용하기

하이퍼파라미터 튜닝으로 토픽 수 정하기

토픽 수 바꿔가며 LDA 모델 여러 개 만들기 – FindTopcisNumber()
1. dtm : Doucument Term Matrix. 여기서는 dtm_comment 입력
2. topics : 비교할 최소-최대 토픽 수. 2:20 입력
3. return_models : 모델 저장 여부. 기본값 FALSE. 저장하기 위해서는 TRUE.
4. control = list(seed=1234) : 값 고정, 난수 고정

install.packages("ldatuning")
library(ldatuning)

models <- FindTopicsNumber(dtm = dtm_comment,
                           topics = 2:20,     # 19개 모델
                           return_models = T,
                           control = list(seed = 1234))

# Griffinths2004 모델의 성능 지표.
models %>%
  select(topics, Griffiths2004)

최적 토픽 수 정하기
1. models을 FindTopicsNumber_plot()에 적용하면 성능 지표를 이용해 선 그래프를 만듬.
2. x축은 토픽 수를 의미, y축은 성능 지표를 0~1로 최대-최소 정규화(min-max normalization)한 값
3. 모델의 성능이 좋을수록 y축의 값이 큼.
4. 후보 모델 중 성능이 가장 좋으면 1, 가장 나쁘면 0임.

FindTopicsNumber_plot(models)

모델 추출하기

# 토픽 수가 8개인 모델 추출하기
optimal_model <- models %>%
  filter(topics == 8) %>%
  pull(LDA_model) %>%              # 모델 추출
  .[[1]]                           # list 추출

# optimal_model
tidy(optimal_model, matrix = "beta")

# lda_model
tidy(lda_model, matrix = "beta")

R을 활용한 감정 분석 내용은 여기 링크를 참고바랍니다.

LDA 모델 R 토픽 토픽 모델링