1
00:00:01,120 --> 00:00:08,639
Hi my name is julia silge. 
I'm a data scientist and software engineer at RStudio.

2
00:00:06,080 --> 00:00:13,519
and I'd like to thank the the organizers of the R user group in Korea

3
00:00:11,200 --> 00:00:18,080
so much for having me. 
Today speak to you

4
00:00:15,519 --> 00:00:24,000
I am so happy to be speaking specifically today about creating

5
00:00:21,119 --> 00:00:29,359
features for machine learning from text data for a couple of reasons

6
00:00:27,199 --> 00:00:34,480
Having a better understanding of what we do to take text data

7
00:00:31,679 --> 00:00:36,880
and then to make it appropriate

8
00:00:34,480 --> 00:00:42,879
as an input for machine learning algorithms has many benefits

9
00:00:39,680 --> 00:00:45,039
both if you are directly getting ready

10
00:00:42,879 --> 00:00:50,719
to train a model or if you're at the beginning of some text analysis project

11
00:00:48,399 --> 00:00:55,440
or if you are trying to understand the behavior of a model that

12
00:00:52,800 --> 00:01:00,480
you're interacting with some way which is something that we do in our work as data scientists

13
00:00:57,920 --> 00:01:06,640
or in our in our daily lives more and more

14
00:01:03,359 --> 00:01:09,360
so when we build models for text either supervised or unsupervised

15
00:01:06,640 --> 00:01:15,759
we start with something

16
00:01:11,600 --> 00:01:17,600
like this this is some example text data

17
00:01:15,759 --> 00:01:21,040
that I'll use a couple of times during

18
00:01:17,600 --> 00:01:24,560
this talk that describes some animals

19
00:01:21,040 --> 00:01:26,720
I'm using some text data so

20
00:01:24,560 --> 00:01:30,960
you know to me as an english

21
00:01:26,720 --> 00:01:34,400
speaker looks familiar like

22
00:01:30,960 --> 00:01:36,720
I am as someone who uses a human

23
00:01:34,400 --> 00:01:39,520
language so I look at this and I can

24
00:01:36,720 --> 00:01:43,920
read it I could speak it aloud and I understand

25
00:01:41,280 --> 00:01:45,759
I can interpret it what it means

26
00:01:43,920 --> 00:01:50,000
so this kind of data this sort of

27
00:01:45,759 --> 00:01:51,200
natural language data is being generated

28
00:01:50,000 --> 00:01:56,880
all the time in all kinds of languages

29
00:01:53,600 --> 00:01:59,840
in all kinds of contexts so

30
00:01:56,880 --> 00:02:05,040
whether you work in healthcare in tech in finance

31
00:02:02,640 --> 00:02:08,479
basically any kind of organization this

32
00:02:05,040 --> 00:02:11,920
sort of text data is being generated

33
00:02:09,360 --> 00:02:17,599
by customers by clients by internal stakeholders

34
00:02:15,200 --> 00:02:24,319
inside of a business by people taking surveys

35
00:02:20,000 --> 00:02:26,720
via social media via business processes

36
00:02:24,319 --> 00:02:29,200
and in all this natural language

37
00:02:26,720 --> 00:02:33,519
there's information latent in

38
00:02:30,879 --> 00:02:35,360
that text data that can be used to make better decisions

39
00:02:33,519 --> 00:02:44,319
However, computers are not great at looking at this and doing

40
00:02:41,280 --> 00:02:47,440
math on language as it's represented like this

41
00:02:45,440 --> 00:02:50,560
and instead language has to be

42
00:02:47,440 --> 00:02:53,680
dramatically transformed to some kind of

43
00:02:50,560 --> 00:02:55,440
machine readable numeric representation

44
00:02:53,680 --> 00:02:57,280
that looks more like this what I'm

45
00:02:55,440 --> 00:02:59,760
showing here on the screen to be ready

46
00:02:57,280 --> 00:03:01,840
for almost any kind of model

47
00:02:59,760 --> 00:03:05,040
so I spent a fair amount of time working

48
00:03:01,840 --> 00:03:07,680
on software for people to be able to do

49
00:03:05,040 --> 00:03:15,120
exploratory data analysis, visualization, summarization

50
00:03:11,760 --> 00:03:17,920
tasks like that with text data in a tidy

51
00:03:15,120 --> 00:03:22,159
format where we have one observation per row

52
00:03:19,040 --> 00:03:27,280
and I love using tidy data principles for text analysis

53
00:03:24,480 --> 00:03:32,640
especially during those exploratory phases of an analysis

54
00:03:29,680 --> 00:03:34,879
when it comes time to build a model

55
00:03:32,640 --> 00:03:40,480
often what the underlying mathematical implementation really needs

56
00:03:38,000 --> 00:03:44,959
is typically something like this which is a

57
00:03:43,040 --> 00:03:52,319
way to this particular representation is called the document term matrix

58
00:03:49,760 --> 00:03:53,840
so the exact representation may differ from

59
00:03:52,319 --> 00:03:55,760
what I've shown here

60
00:03:53,840 --> 00:03:58,400
what I have here is we're weighting

61
00:03:55,760 --> 00:04:04,720
things by counts so each row in this matrix is a document

62
00:04:02,640 --> 00:04:07,439
each column is a is a word.

63
00:04:04,720 --> 00:04:10,080
A token and the numbers represent

64
00:04:07,439 --> 00:04:12,319
counts how many times does each document

65
00:04:10,080 --> 00:04:13,519
use each word you could weight it in a different way

66
00:04:12,319 --> 00:04:19,359
using say TF-IDF instead of counts

67
00:04:16,720 --> 00:04:20,799
or you might keep sequence information

68
00:04:19,359 --> 00:04:25,199
if you're interested in building a deep learning model but basically

69
00:04:23,120 --> 00:04:27,919
for all kinds of text modeling

70
00:04:25,199 --> 00:04:30,639
from simpler models like Naive Bayes

71
00:04:27,919 --> 00:04:33,199
models which work well for text

72
00:04:30,639 --> 00:04:35,680
to word embeddings to really the most

73
00:04:33,199 --> 00:04:38,320
state-of-the-art kind of work that's

74
00:04:35,680 --> 00:04:41,120
happening today like transformers for text data

75
00:04:39,520 --> 00:04:44,720
we have to heavily

76
00:04:41,120 --> 00:04:47,280
feature engineer and process language to

77
00:04:44,720 --> 00:04:49,360
get it to some kind of representation

78
00:04:47,280 --> 00:04:53,680
that's suitable for machine learning algorithms

79
00:04:50,960 --> 00:04:56,160
so I work on an open source framework in R

80
00:04:54,560 --> 00:05:01,120
for modeling and machine learning that's called Tidymodels and the examples that

81
00:04:58,880 --> 00:05:04,800
I'll be showing today use Tidymodels code

82
00:05:02,320 --> 00:05:10,639
some of the specific goals of the Tidymodels project are to provide

83
00:05:07,360 --> 00:05:13,120
a consistent flexible framework for real

84
00:05:10,639 --> 00:05:19,199
world modeling practitioners people who are you know doing

85
00:05:16,240 --> 00:05:21,360
that are dealing with real world data

86
00:05:19,199 --> 00:05:22,800
those who are just starting out to

87
00:05:21,360 --> 00:05:27,120
those who are very experienced in modeling and

88
00:05:24,479 --> 00:05:32,240
the goal is to harmonize the heterogeneous

89
00:05:28,880 --> 00:05:37,360
interfaces that exist within R and to encourage good statistical practice

90
00:05:35,360 --> 00:05:41,360
I'm glad to get to show you some of what I work on

91
00:05:38,880 --> 00:05:43,840
and build and how we apply it to text

92
00:05:41,360 --> 00:05:47,120
modeling but a lot of what I will talk

93
00:05:43,840 --> 00:05:50,880
about today isn't very specific to Tidymodels

94
00:05:48,160 --> 00:05:52,800
or even to R. I know this is an R user

95
00:05:50,880 --> 00:05:57,440
group but what we're going to talk about and focus on

96
00:05:54,720 --> 00:06:00,479
is a little more conceptual and basic

97
00:05:57,440 --> 00:06:06,800
how do we transform text into predictors for machine learning

98
00:06:04,560 --> 00:06:11,280
I am excited though to talk about Tidymodels and Tidymodels if you have not

99
00:06:08,800 --> 00:06:13,440
used it before is a meta package

100
00:06:11,280 --> 00:06:16,319
in a similar way that the Tidyverse is

101
00:06:13,440 --> 00:06:19,720
a meta package so if you've ever typed

102
00:06:16,319 --> 00:06:23,199
library Tidyverse and then you've used

103
00:06:19,720 --> 00:06:25,680
ggplot2 for visualization

104
00:06:23,199 --> 00:06:29,680
dplyr for data manipulation

105
00:06:25,680 --> 00:06:33,039
Tidymodels works in a similar way

106
00:06:30,560 --> 00:06:36,000
there are different packages inside of it

107
00:06:33,039 --> 00:06:39,840
that are used for different purposes

108
00:06:37,280 --> 00:06:43,440
so the pre-processing or the feature

109
00:06:39,840 --> 00:06:48,000
engineering is part of a broader model process

110
00:06:45,280 --> 00:06:51,120
you know it that process starts really

111
00:06:48,000 --> 00:06:54,080
with with exploratory data analysis

112
00:06:52,000 --> 00:06:56,240
that helps us decide what kind of model

113
00:06:54,080 --> 00:06:59,280
we will build and then it comes to

114
00:06:56,240 --> 00:07:03,120
completion I think I would argue with

115
00:06:59,280 --> 00:07:06,479
model evaluation when you

116
00:07:03,120 --> 00:07:09,280
measure how well your model performed

117
00:07:06,479 --> 00:07:11,919
Tidymodels as a piece of software is

118
00:07:09,280 --> 00:07:14,479
made up of our packages each of which

119
00:07:11,919 --> 00:07:20,800
has a specific focus like our sample

120
00:07:17,199 --> 00:07:23,440
is for re-sampling data to be able to

121
00:07:20,800 --> 00:07:25,919
create bootstrap resamples

122
00:07:23,440 --> 00:07:27,680
cross-validation resamples all different

123
00:07:25,919 --> 00:07:31,919
kinds of resamples you might want to use to

124
00:07:29,520 --> 00:07:34,800
train and evaluate models

125
00:07:31,919 --> 00:07:38,560
the tune package is for hyper parameter

126
00:07:34,800 --> 00:07:40,800
tuning as you might guess from the name

127
00:07:38,560 --> 00:07:43,520
one of these packages is for feature

128
00:07:40,800 --> 00:07:45,680
engineering for a data preprocessing

129
00:07:43,520 --> 00:07:47,120
feature engineering and it is the one

130
00:07:45,680 --> 00:07:52,720
that is called recipes

131
00:07:49,360 --> 00:07:55,280
so in Tidymodels we capture this idea

132
00:07:52,720 --> 00:07:57,199
of data pre-processing and feature

133
00:07:55,280 --> 00:08:01,199
engineering in the concept of a

134
00:07:57,199 --> 00:08:05,360
pre-processing recipe that has steps so you choose

135
00:08:03,360 --> 00:08:10,319
ingredients or variables

136
00:08:07,599 --> 00:08:13,680
that you're going to use then you define the steps

137
00:08:11,599 --> 00:08:16,080
that go into your recipe

138
00:08:13,680 --> 00:08:18,319
then you prepare them using training

139
00:08:16,080 --> 00:08:21,520
data and then you can apply that to any

140
00:08:18,319 --> 00:08:25,759
data set like testing data or new data at prediction time

141
00:08:23,039 --> 00:08:28,319
so the variables or ingredients that we

142
00:08:25,759 --> 00:08:30,080
use in modeling come in all kinds of

143
00:08:28,319 --> 00:08:34,560
shapes and sizes including text data

144
00:08:32,000 --> 00:08:37,919
so some of the techniques and approaches

145
00:08:34,560 --> 00:08:39,200
that we use for pre-processing text data

146
00:08:37,919 --> 00:08:44,480
are the same um as for any other kind of data that

147
00:08:41,519 --> 00:08:46,640
you might use like non-text data

148
00:08:44,480 --> 00:08:48,480
numeric data categorical data some for

149
00:08:46,640 --> 00:08:51,040
some of it is the same

150
00:08:48,480 --> 00:08:53,040
but some of what you need to know to be

151
00:08:51,040 --> 00:08:58,080
able to do a good job

152
00:08:55,120 --> 00:09:00,800
in this process for text is different

153
00:08:58,080 --> 00:09:03,680
and is specific to the nature of what

154
00:09:00,800 --> 00:09:05,279
language data is like

155
00:09:03,680 --> 00:09:08,480
and so I've written a book with my

156
00:09:05,279 --> 00:09:11,360
co-author Emile Hvitfeldt on supervised

157
00:09:08,480 --> 00:09:14,720
machine learning for text analysis and R

158
00:09:11,360 --> 00:09:18,160
and fully the first third of the book

159
00:09:14,720 --> 00:09:20,800
focuses on how we transform

160
00:09:18,160 --> 00:09:23,600
the natural language that we have in

161
00:09:20,800 --> 00:09:25,920
text data into features for modeling

162
00:09:23,600 --> 00:09:28,800
the middle section is about how we use

163
00:09:25,920 --> 00:09:30,800
these features in

164
00:09:28,800 --> 00:09:33,360
simpler or more traditional machine

165
00:09:30,800 --> 00:09:35,839
learning models like regularized

166
00:09:33,360 --> 00:09:40,240
regression or support vector machines and

167
00:09:37,360 --> 00:09:42,800
then the last third of the book

168
00:09:40,240 --> 00:09:46,000
talks about how we use deep learning

169
00:09:42,800 --> 00:09:48,640
models with text data so deep learning

170
00:09:46,000 --> 00:09:51,279
models still require these kinds of

171
00:09:48,640 --> 00:09:54,480
transformations from natural language

172
00:09:51,279 --> 00:10:01,600
into features as input for these kinds of models but

173
00:09:58,880 --> 00:10:05,920
deep learning models are often able to

174
00:10:03,040 --> 00:10:07,920
inherently learn structure of features

175
00:10:05,920 --> 00:10:10,240
from text in ways that those

176
00:10:07,920 --> 00:10:12,800
more traditional or simpler machine

177
00:10:10,240 --> 00:10:15,279
learning models are not

178
00:10:12,800 --> 00:10:17,440
so this book is now complete and

179
00:10:15,279 --> 00:10:20,560
available as of this month as of november

180
00:10:18,480 --> 00:10:23,600
folks are getting their first paper

181
00:10:20,560 --> 00:10:26,320
copies and also this book is available

182
00:10:23,600 --> 00:10:30,720
in its entirety at smalltar.com

183
00:10:28,560 --> 00:10:33,519
so if you're new to dealing with text

184
00:10:30,720 --> 00:10:35,760
data understanding these

185
00:10:33,519 --> 00:10:38,720
fundamental pre-processing approaches

186
00:10:35,760 --> 00:10:41,760
for text will set you up for being able

187
00:10:38,720 --> 00:10:43,839
to train effective models

188
00:10:41,760 --> 00:10:45,200
if you're really experienced with text

189
00:10:43,839 --> 00:10:47,680
data if you've dealt with it a lot

190
00:10:45,200 --> 00:10:51,360
already you've probably noticed like we have

191
00:10:48,959 --> 00:10:58,000
that the existing you know resources or literature whether

192
00:10:54,480 --> 00:11:01,680
that's books or tutorials or blog posts

193
00:10:58,000 --> 00:11:05,279
is quite sparse when it comes to

194
00:11:02,880 --> 00:11:07,600
detailed thoughtful explorations of how

195
00:11:05,279 --> 00:11:11,120
these pre-processing steps work

196
00:11:07,600 --> 00:11:14,000
and how choices made in these feature

197
00:11:11,120 --> 00:11:17,519
engineering steps impact our model output

198
00:11:15,600 --> 00:11:20,000
so let's walk through

199
00:11:17,519 --> 00:11:23,360
several of some of these like basic

200
00:11:21,360 --> 00:11:25,440
feature engineering approaches and how

201
00:11:23,360 --> 00:11:28,800
they work and what they do let's start out with

202
00:11:27,040 --> 00:11:31,920
tokenization

203
00:11:28,800 --> 00:11:33,839
so typically one of the first steps in

204
00:11:31,920 --> 00:11:36,800
transfer information from natural

205
00:11:33,839 --> 00:11:40,480
language to machine learning feature for

206
00:11:38,320 --> 00:11:44,480
really any kind of text analysis

207
00:11:40,480 --> 00:11:47,040
including exploratory data analysis

208
00:11:44,480 --> 00:11:52,320
or building a model. Anything is tokenization

209
00:11:49,040 --> 00:11:55,360
in tokenization we take an input some

210
00:11:52,320 --> 00:11:57,120
string some character vector and some

211
00:11:55,360 --> 00:11:59,440
kind of token type

212
00:11:57,120 --> 00:12:02,079
some meaningful unit of text

213
00:11:59,440 --> 00:12:05,600
we're interested in a word

214
00:12:02,079 --> 00:12:07,680
and we split the input pieces into

215
00:12:05,600 --> 00:12:09,519
tokens that correspond to the type

216
00:12:07,680 --> 00:12:12,480
we're interested in

217
00:12:09,519 --> 00:12:15,120
so most commonly the meaningful unit or

218
00:12:12,480 --> 00:12:17,839
type of token that we want to split text

219
00:12:15,120 --> 00:12:19,040
into units of is a word

220
00:12:17,839 --> 00:12:21,440
so this might seem

221
00:12:19,040 --> 00:12:24,480
straightforward or obvious but it turns

222
00:12:21,440 --> 00:12:28,160
out it's difficult to clearly define

223
00:12:24,480 --> 00:12:30,880
what a word is for many or even most languages

224
00:12:29,519 --> 00:12:35,839
so many languages do not use white space

225
00:12:34,000 --> 00:12:38,240
between words at all

226
00:12:35,839 --> 00:12:40,959
which presents a challenge

227
00:12:38,240 --> 00:12:44,480
for tokenization even languages that

228
00:12:40,959 --> 00:12:48,160
do use white space like english and korean

229
00:12:45,519 --> 00:12:53,120
often have particular examples that are ambiguous

230
00:12:49,839 --> 00:12:55,040
like contractions in english like didn't

231
00:12:53,120 --> 00:12:56,800
which should be

232
00:12:55,040 --> 00:12:59,200
you know maybe more accurately

233
00:12:56,800 --> 00:13:01,519
considered two words the way

234
00:12:59,200 --> 00:13:04,079
particles are used in Korean

235
00:13:01,519 --> 00:13:06,240
and how pronouns and negation words are

236
00:13:04,079 --> 00:13:08,000
written in romance languages like

237
00:13:06,240 --> 00:13:09,920
italian and french where they're stuck

238
00:13:08,000 --> 00:13:12,399
together and really maybe they should be

239
00:13:09,920 --> 00:13:13,839
considered two words

240
00:13:12,399 --> 00:13:15,519
once you have figured out what you're

241
00:13:13,839 --> 00:13:17,839
going to do and you make some choices

242
00:13:15,519 --> 00:13:20,480
and you tokenize your text then it's on

243
00:13:17,839 --> 00:13:23,839
its way to being able to be used

244
00:13:20,480 --> 00:13:26,880
in exploratory data analysis or

245
00:13:23,839 --> 00:13:28,399
unsupervised algorithms or as features

246
00:13:26,880 --> 00:13:30,320
for predictive modeling which is what

247
00:13:28,399 --> 00:13:32,720
we're talking about here and what these

248
00:13:30,320 --> 00:13:35,200
results show here so these results are

249
00:13:32,720 --> 00:13:38,000
from a regression model trained on

250
00:13:35,200 --> 00:13:40,639
descriptions of media

251
00:13:38,000 --> 00:13:45,199
from artwork in the Tate collection in the UK so

252
00:13:43,279 --> 00:13:47,680
what we're predicting in what we are

253
00:13:45,199 --> 00:13:52,320
predicting is when what year

254
00:13:48,880 --> 00:13:56,320
was a piece of art created based on the

255
00:13:53,120 --> 00:13:58,560
the medium that the artwork was created

256
00:13:56,320 --> 00:14:00,079
with and the medium is described with a

257
00:13:58,560 --> 00:14:02,079
little bit of text

258
00:14:00,079 --> 00:14:07,440
so we see here that artwork created using graphite

259
00:14:04,079 --> 00:14:09,279
watercolor and engraving was more likely

260
00:14:07,440 --> 00:14:10,959
to be created earlier

261
00:14:09,279 --> 00:14:13,360
that though that is more likely to come

262
00:14:10,959 --> 00:14:17,839
from older art and artwork that is created using

263
00:14:15,360 --> 00:14:22,880
photography screen point or sorry screen print

264
00:14:20,639 --> 00:14:27,760
screen printing and and dung

265
00:14:24,720 --> 00:14:29,920
and glitter are more likely to be

266
00:14:27,760 --> 00:14:32,560
created later. there this is more likely

267
00:14:29,920 --> 00:14:37,519
to come from contemporary art modern art

268
00:14:33,440 --> 00:14:41,519
so the the way that we tokenize this text

269
00:14:39,440 --> 00:14:43,839
you know we started with natural human

270
00:14:41,519 --> 00:14:47,519
generated texts of people writing out

271
00:14:43,839 --> 00:14:49,760
the descriptions of the the media

272
00:14:47,519 --> 00:14:53,120
that these art pieces of art were created with

273
00:14:50,959 --> 00:14:55,440
and the way we tokenized that natural

274
00:14:53,120 --> 00:14:57,760
human generated text that we started

275
00:14:55,440 --> 00:14:59,839
with has a big impact on what we learned

276
00:14:57,760 --> 00:15:01,839
from it if we tokenized in a different way

277
00:15:00,720 --> 00:15:04,880
we would have gotten different

278
00:15:01,839 --> 00:15:07,760
results in terms of performance like how

279
00:15:05,839 --> 00:15:10,320
accurately we were able to predict

280
00:15:07,760 --> 00:15:12,240
predict the year and also in terms of

281
00:15:10,320 --> 00:15:15,279
how we interpret the model like what is

282
00:15:12,240 --> 00:15:17,279
it that we're able to learn from it

283
00:15:15,279 --> 00:15:19,760
so this is one kind of tokenization to

284
00:15:17,279 --> 00:15:21,760
the single word but we also

285
00:15:19,760 --> 00:15:24,000
we all another way to tokenize instead

286
00:15:21,760 --> 00:15:27,440
of breaking up into single words or

287
00:15:24,000 --> 00:15:31,279
unigrams we can tokenize to n-grams

288
00:15:27,440 --> 00:15:35,199
so an n-gram is a continuous sequence of

289
00:15:31,279 --> 00:15:37,920
N items from a given sequence of texts

290
00:15:35,199 --> 00:15:41,680
so this shows that same piece of little

291
00:15:37,920 --> 00:15:44,880
bit of text i'm describing this animal

292
00:15:41,680 --> 00:15:47,440
divided up into bi-grams or n-grams of

293
00:15:44,880 --> 00:15:49,360
two tokens so notice how the words in

294
00:15:47,440 --> 00:15:54,959
the bi-grams overlap so the word collard

295
00:15:51,519 --> 00:15:58,480
appears in both of the first bigrams the collared

296
00:15:56,000 --> 00:16:00,800
collared peccary peccary also

297
00:15:58,480 --> 00:16:03,920
referred to so n-gram

298
00:16:00,800 --> 00:16:07,040
tokenization slides along the text to

299
00:16:03,920 --> 00:16:11,120
create overlapping sets of tokens

300
00:16:07,040 --> 00:16:14,480
this shows tri-grams for the same thing

301
00:16:11,120 --> 00:16:17,680
so using uni-grams one word is

302
00:16:14,480 --> 00:16:20,560
faster and more efficient but we don't

303
00:16:17,680 --> 00:16:23,360
capture information about word order

304
00:16:20,560 --> 00:16:25,839
I'm using a higher value

305
00:16:23,360 --> 00:16:30,320
two or three or even more keeps

306
00:16:28,160 --> 00:16:32,240
more complex information about word

307
00:16:30,320 --> 00:16:36,160
order and concepts

308
00:16:33,120 --> 00:16:41,440
that are described in multi-word phrases

309
00:16:37,680 --> 00:16:43,839
but the vector space of tokens

310
00:16:41,440 --> 00:16:47,199
increases dramatically

311
00:16:43,839 --> 00:16:50,480
that corresponds to a reduction in token

312
00:16:47,199 --> 00:16:52,880
counts we don't count each token as very

313
00:16:50,480 --> 00:16:55,120
many times and that means depending on

314
00:16:52,880 --> 00:16:56,480
your particular data set

315
00:16:55,120 --> 00:17:00,399
you might not be able to get good results

316
00:16:58,160 --> 00:17:03,519
so combining different degrees of

317
00:17:00,399 --> 00:17:06,400
n-grams can allow you to extract

318
00:17:03,519 --> 00:17:09,679
different levels of detail from text so uni-grams

319
00:17:07,760 --> 00:17:11,439
can tell you which individual words have

320
00:17:09,679 --> 00:17:14,079
been used a lot of times

321
00:17:11,439 --> 00:17:18,319
some of those words might be overlooked

322
00:17:15,760 --> 00:17:19,919
in bi-gram or tri-gram crowns if they

323
00:17:18,319 --> 00:17:23,360
don't co-appear with other words as often

324
00:17:23,520 --> 00:17:28,960
this plot compares model performance for

325
00:17:26,720 --> 00:17:32,080
a Lasso regression model predicting the

326
00:17:28,960 --> 00:17:34,640
year of supreme court opinions the

327
00:17:32,080 --> 00:17:37,840
United States supreme court opinions

328
00:17:34,640 --> 00:17:39,679
with three different degrees of n-grams

329
00:17:37,840 --> 00:17:42,880
what we're doing here is we are taking

330
00:17:39,679 --> 00:17:44,240
the text of the writings of the United

331
00:17:42,880 --> 00:17:49,760
States supreme court and we're predicting

332
00:17:47,120 --> 00:17:51,840
when did it when was that text

333
00:17:49,760 --> 00:17:54,080
written so can we predict how old a

334
00:17:51,840 --> 00:17:58,799
piece of text is from the contents of the text

335
00:17:55,360 --> 00:18:01,840
so holding the number of tokens

336
00:17:58,799 --> 00:18:08,080
constant at a thousand using uni-grams alone

337
00:18:04,480 --> 00:18:10,160
performs best for this corpus of

338
00:18:08,080 --> 00:18:13,600
opinions from the United States supreme court

339
00:18:11,280 --> 00:18:16,240
this is not always the case depending on

340
00:18:13,600 --> 00:18:18,799
the kind of model you use the data set

341
00:18:16,240 --> 00:18:21,679
itself we might see the best performance

342
00:18:18,799 --> 00:18:22,640
combining uni-grams and bi-grams or maybe

343
00:18:21,679 --> 00:18:25,760
some other option

344
00:18:23,679 --> 00:18:28,720
in this case if we wanted to incorporate

345
00:18:25,760 --> 00:18:30,480
some of that more complex information

346
00:18:28,720 --> 00:18:34,000
that we have in the bi-grams and the

347
00:18:30,480 --> 00:18:36,799
tri-grams we probably would need to

348
00:18:34,000 --> 00:18:40,160
increase the number of

349
00:18:36,799 --> 00:18:43,039
tokens in the model quite a bit

350
00:18:40,160 --> 00:18:46,000
so keep in mind when you look at results

351
00:18:43,039 --> 00:18:48,960
like these that identifying n-grams is

352
00:18:46,000 --> 00:18:51,039
computationally expensive

353
00:18:48,960 --> 00:18:54,640
this is especially compared to the

354
00:18:51,760 --> 00:18:57,600
amount of like a model the improvement

355
00:18:54,640 --> 00:19:00,640
in model performance that we often see like if we

356
00:18:58,720 --> 00:19:03,600
if we see some you know modest

357
00:19:00,640 --> 00:19:05,760
improvement by adding in bigrams it's

358
00:19:03,600 --> 00:19:08,640
important to keep in mind how much

359
00:19:05,760 --> 00:19:11,039
improvement we see relative to

360
00:19:08,640 --> 00:19:13,200
how long it takes to

361
00:19:11,039 --> 00:19:15,840
identify bi-grams and then train that

362
00:19:13,200 --> 00:19:18,480
model so for example for this data set

363
00:19:15,840 --> 00:19:20,799
of supreme court opinions where we held

364
00:19:18,480 --> 00:19:24,480
the number of tokens constant so the

365
00:19:20,799 --> 00:19:27,120
model training had the same number of tokens in it

366
00:19:25,919 --> 00:19:35,200
using bi-grams plus uni-grams takes twice as long to train

367
00:19:33,280 --> 00:19:38,240
to do the feature engineering and the

368
00:19:35,200 --> 00:19:42,000
training than only uni-grams and adding

369
00:19:38,240 --> 00:19:44,320
in tri-grams as well takes almost five

370
00:19:42,000 --> 00:19:46,799
times as long as training on uni-grams

371
00:19:44,320 --> 00:19:50,640
alone so this is a computationally

372
00:19:46,799 --> 00:19:50,640
expensive thing to do

373
00:19:51,760 --> 00:19:56,559
going in the other direction

374
00:19:53,760 --> 00:19:58,960
we can tokenize to units smaller than

375
00:19:56,559 --> 00:20:00,880
words so like these are what are called

376
00:19:58,960 --> 00:20:05,440
character shingles so we take words

377
00:20:03,280 --> 00:20:08,480
the collared peccary

378
00:20:05,440 --> 00:20:10,880
and we can instead of looking at words

379
00:20:08,480 --> 00:20:13,120
we can go down and look at sub word

380
00:20:10,880 --> 00:20:16,000
information there's multiple different

381
00:20:13,120 --> 00:20:17,440
ways to break words up into sub words

382
00:20:16,000 --> 00:20:20,880
that are appropriate for machine learning

383
00:20:18,480 --> 00:20:23,440
and often these kinds of approaches

384
00:20:20,880 --> 00:20:25,200
or algorithms have the benefit of being

385
00:20:23,440 --> 00:20:31,360
able to encode unknown or new words

386
00:20:29,039 --> 00:20:33,760
at prediction time so when it when it's

387
00:20:31,360 --> 00:20:36,400
time to make a prediction on new data

388
00:20:33,760 --> 00:20:39,200
it's not it's not uncommon for there to

389
00:20:36,400 --> 00:20:41,200
be new vocabulary words at that time and

390
00:20:39,200 --> 00:20:42,640
if we didn't see them in the training

391
00:20:41,200 --> 00:20:45,600
data you know what are we going to do

392
00:20:42,640 --> 00:20:49,360
about those new words when we train

393
00:20:45,600 --> 00:20:52,000
using subword information often we can

394
00:20:49,360 --> 00:20:54,960
handle those new words if we saw the subword

395
00:20:52,880 --> 00:20:58,400
in our training data set so

396
00:20:55,760 --> 00:21:01,440
using this kind of subword information

397
00:20:58,400 --> 00:21:04,720
is a way to incorporate morphological

398
00:21:01,440 --> 00:21:06,240
sequences into our models of you know

399
00:21:04,720 --> 00:21:09,280
various kinds of this is something that applies to

400
00:21:08,080 --> 00:21:13,360
various languages not just english

401
00:21:11,919 --> 00:21:16,240
so these results are for a

402
00:21:13,360 --> 00:21:20,000
classification model with a data set of

403
00:21:16,240 --> 00:21:22,960
very short texts it's just the names of

404
00:21:20,000 --> 00:21:27,120
post offices in the United States so super short

405
00:21:24,240 --> 00:21:28,799
and the goal of the model was to predict

406
00:21:27,120 --> 00:21:35,280
the post office located in hawaii

407
00:21:32,880 --> 00:21:37,280
in the middle of the pacific ocean or

408
00:21:35,280 --> 00:21:41,360
it located in the rest of the united states

409
00:21:38,159 --> 00:21:46,480
so I created features for the model that are

410
00:21:42,159 --> 00:21:49,200
subwords of these post office names

411
00:21:46,480 --> 00:21:53,120
and we end up learning that the names

412
00:21:49,200 --> 00:21:57,440
that start with h and p or contain that ale

413
00:21:54,240 --> 00:22:00,240
sub word are more likely to be in hawaii

414
00:21:57,440 --> 00:22:06,480
and the sub words a and d and ri and ing are are

415
00:22:04,559 --> 00:22:09,360
more likely to come from the post office

416
00:22:06,480 --> 00:22:11,440
that are outside of hawaii

417
00:22:09,360 --> 00:22:14,480
so this is an example of how we

418
00:22:11,440 --> 00:22:17,280
tokenized differently and we're able to

419
00:22:14,480 --> 00:22:19,600
learn something new we're able to learn something

420
00:22:18,559 --> 00:22:24,320
different so in Tidymodels we collect all these

421
00:22:21,840 --> 00:22:26,880
kinds of decisions about tokenization

422
00:22:24,320 --> 00:22:29,919
and code that looks like this

423
00:22:26,880 --> 00:22:33,760
so we start with a recipe that specifies what

424
00:22:30,799 --> 00:22:35,679
variables or ingredients that we'll use

425
00:22:33,760 --> 00:22:38,640
and then we define these preprocessing

426
00:22:35,679 --> 00:22:41,600
steps so even at this first

427
00:22:38,640 --> 00:22:44,159
and arguably you know simple and basic

428
00:22:41,600 --> 00:22:47,120
step the choices that we make affect our

429
00:22:44,159 --> 00:22:50,400
modeling results in a big way

430
00:22:47,120 --> 00:22:52,080
the next pre-processing um step that

431
00:22:50,400 --> 00:22:56,240
I want to talk about is stop words

432
00:22:53,600 --> 00:23:00,320
so once we have split text

433
00:22:56,240 --> 00:23:03,360
into tokens we often find that not

434
00:23:00,320 --> 00:23:06,720
all words carry the same amount of

435
00:23:03,360 --> 00:23:09,520
information if maybe any information at

436
00:23:06,720 --> 00:23:11,840
all actually for a machine learning task

437
00:23:09,520 --> 00:23:13,919
so common words that carry little or

438
00:23:11,840 --> 00:23:16,080
perhaps no meaningful information are

439
00:23:13,919 --> 00:23:18,880
called stopwords

440
00:23:16,080 --> 00:23:20,400
so this is one of the stopword lists

441
00:23:18,880 --> 00:23:22,720
that's available for

442
00:23:20,400 --> 00:23:28,000
Korean so it's common advice and practice

443
00:23:24,960 --> 00:23:31,200
to say hey just remove just remove

444
00:23:28,000 --> 00:23:35,039
remove these stopwords for a lot of

445
00:23:31,919 --> 00:23:36,640
natural language processing tasks

446
00:23:35,039 --> 00:23:39,600
what I'm showing here

447
00:23:36,640 --> 00:23:41,919
is the entirety of one of the shorter

448
00:23:39,600 --> 00:23:44,320
english stopword lists that's used

449
00:23:41,919 --> 00:23:52,080
really broadly so you know it's words like I

450
00:23:47,200 --> 00:23:54,400
me my pronouns conjunctions and of the

451
00:23:52,080 --> 00:23:57,679
and these are very common words

452
00:23:54,400 --> 00:24:00,400
that are not considered super important

453
00:23:57,679 --> 00:24:04,400
the decision though to just remove

454
00:24:00,400 --> 00:24:07,279
stopwords is often more involved and

455
00:24:04,400 --> 00:24:08,960
perhaps more fraught than what you'll

456
00:24:07,279 --> 00:24:12,000
than what you'll find reflected in a lot

457
00:24:08,960 --> 00:24:15,840
of resources that are out there

458
00:24:12,000 --> 00:24:18,880
so almost all the time real world NLP

459
00:24:15,840 --> 00:24:22,400
practitioners use pre-made stopword lists

460
00:24:20,000 --> 00:24:26,400
so this plot visualizes

461
00:24:22,400 --> 00:24:28,720
set intersections for three common

462
00:24:26,400 --> 00:24:31,200
stopword lists in english

463
00:24:28,720 --> 00:24:32,960
in what is called an upset plot

464
00:24:31,200 --> 00:24:37,440
so the three lists are called the

465
00:24:32,960 --> 00:24:40,159
snowball list smart and the iso list

466
00:24:37,440 --> 00:24:41,840
so you can see the the lengths of the

467
00:24:40,159 --> 00:24:43,679
list are represented by the length of

468
00:24:41,840 --> 00:24:46,400
the bars and then we see the

469
00:24:43,679 --> 00:24:48,799
intersections which words are in common

470
00:24:46,400 --> 00:24:51,760
on these lists by the by the vertical

471
00:24:48,799 --> 00:24:55,760
bars so the lengths of the list are quite different

472
00:24:52,960 --> 00:24:58,320
and also notice they don't all contain

473
00:24:55,760 --> 00:25:00,720
the same sets of words

474
00:24:58,320 --> 00:25:03,120
the important thing to remember about

475
00:25:00,720 --> 00:25:06,880
stopword lexicons is that

476
00:25:04,240 --> 00:25:10,000
they are not created in some

477
00:25:06,880 --> 00:25:16,080
neutral perfect setting but instead they are

478
00:25:12,799 --> 00:25:20,480
they are context specific

479
00:25:16,960 --> 00:25:23,600
they they can be biased both of these

480
00:25:20,480 --> 00:25:27,039
things are true because they are lists

481
00:25:23,600 --> 00:25:28,559
created from large data sets of language

482
00:25:27,039 --> 00:25:32,159
so they reflect the characteristics of the data used in

483
00:25:30,640 --> 00:25:37,440
their creation so this is

484
00:25:34,000 --> 00:25:40,559
the ten words that are in the english

485
00:25:37,440 --> 00:25:43,039
language smart lexicon but not in the

486
00:25:40,559 --> 00:25:45,600
English snowball lexicon

487
00:25:43,039 --> 00:25:47,520
so notice that they're all contractions

488
00:25:45,600 --> 00:25:50,000
but that's not because the snowball

489
00:25:47,520 --> 00:25:52,240
exchange doesn't include contractions

490
00:25:50,000 --> 00:25:56,960
it has a lot of them also notice that it has

491
00:25:55,440 --> 00:26:01,360
that she's is on this list and so that means that

492
00:25:58,960 --> 00:26:03,360
that list has he's but it does not have

493
00:26:01,360 --> 00:26:05,600
the list she's

494
00:26:03,360 --> 00:26:07,679
so this is an example of that

495
00:26:05,600 --> 00:26:11,440
The bias I mentioned that occurs because

496
00:26:07,679 --> 00:26:13,520
these lists are created from large data

497
00:26:11,440 --> 00:26:19,760
sets of text lexicon creators look at the most

498
00:26:16,400 --> 00:26:22,480
frequent words in some big corpus of

499
00:26:19,760 --> 00:26:24,720
language they make a cut off

500
00:26:22,480 --> 00:26:28,400
and then some decisions about what to include or

501
00:26:26,080 --> 00:26:29,919
exclude you know

502
00:26:28,400 --> 00:26:33,840
based on the list that they

503
00:26:29,919 --> 00:26:36,799
have and you end up here so because

504
00:26:33,840 --> 00:26:39,760
in many large data sets of language

505
00:26:36,799 --> 00:26:45,679
you have more representation of

506
00:26:42,720 --> 00:26:47,440
men you end up with a

507
00:26:45,679 --> 00:26:51,840
situation like this where a stopword

508
00:26:47,440 --> 00:26:54,799
list will have he's but not she's

509
00:26:51,840 --> 00:26:57,840
so many decisions when it comes to

510
00:26:54,799 --> 00:27:01,120
modeling or analysis with language

511
00:26:57,840 --> 00:27:03,919
we as practitioners have to decide

512
00:27:01,120 --> 00:27:07,679
what is appropriate for our particular domain

513
00:27:05,279 --> 00:27:10,720
it turns out this is even true when it

514
00:27:07,679 --> 00:27:12,960
comes to picking a stopword list

515
00:27:10,720 --> 00:27:14,720
so in Tidymodels we can implement a

516
00:27:12,960 --> 00:27:19,200
pre-processing step like removing stopwords

517
00:27:17,120 --> 00:27:21,760
by adding an additional step to our

518
00:27:19,200 --> 00:27:24,720
recipe so first we specified what

519
00:27:21,760 --> 00:27:28,000
variables we would use then we tokenized

520
00:27:24,720 --> 00:27:30,720
the text and now we are removing

521
00:27:28,000 --> 00:27:32,640
stopwords here using just the default

522
00:27:30,720 --> 00:27:35,360
step since we are not passing in any

523
00:27:32,640 --> 00:27:37,600
other arguments we could though

524
00:27:35,360 --> 00:27:39,919
use a non-default step or even a custom

525
00:27:37,600 --> 00:27:42,399
list if that was most appropriate to our domain

526
00:27:42,880 --> 00:27:48,960
this plot compares the model performance

527
00:27:46,799 --> 00:27:52,399
for predicting the year of

528
00:27:50,159 --> 00:27:55,039
that same data set of

529
00:27:52,399 --> 00:27:58,240
supreme court opinions with three

530
00:27:55,039 --> 00:28:02,080
different stopword lexicons of different lengths

531
00:27:59,279 --> 00:28:04,480
so the snowball lexicon contains the

532
00:28:02,080 --> 00:28:06,960
smallest number of words and in this

533
00:28:04,480 --> 00:28:10,320
case it results in the best performance

534
00:28:06,960 --> 00:28:12,799
so removing fewer stopwords results in

535
00:28:10,320 --> 00:28:16,399
the best performance here so this

536
00:28:12,799 --> 00:28:19,440
specific result is not generalizable to

537
00:28:16,399 --> 00:28:22,080
all data sets and contexts but the fact

538
00:28:19,440 --> 00:28:23,919
that removing different sets of

539
00:28:22,080 --> 00:28:26,799
stopwords can have noticeably different

540
00:28:23,919 --> 00:28:29,919
effects on your model that is quite

541
00:28:26,799 --> 00:28:33,200
transferable so the only way to know

542
00:28:29,919 --> 00:28:35,360
what is the best thing to do is to try

543
00:28:33,200 --> 00:28:37,600
several options and see so machine

544
00:28:35,360 --> 00:28:41,360
learning in general.

545
00:28:39,279 --> 00:28:42,480
this is an empirical field right like we

546
00:28:41,360 --> 00:28:47,440
don't know we don't often have reasons a priori to

547
00:28:45,679 --> 00:28:49,919
know what will be the best thing to do

548
00:28:47,440 --> 00:28:52,240
and so typically we have to

549
00:28:49,919 --> 00:28:54,960
try a different option to see what will

550
00:28:52,240 --> 00:29:01,039
be the best thing all right. then the the third

551
00:28:58,799 --> 00:29:04,000
pre-processing step that I want to talk about

552
00:29:02,080 --> 00:29:07,760
for text is stemming

553
00:29:05,279 --> 00:29:10,640
so when we deal with text often

554
00:29:07,760 --> 00:29:14,399
documents contain different versions of

555
00:29:10,640 --> 00:29:18,559
one base word often called a stem so what

556
00:29:15,440 --> 00:29:20,000
if say for an english example

557
00:29:18,559 --> 00:29:22,000
if we aren't interested in the difference

558
00:29:20,000 --> 00:29:26,480
between animals plural and animal

559
00:29:24,480 --> 00:29:31,279
singular and we want to treat them both together

560
00:29:27,919 --> 00:29:33,200
so that idea is at the heart of stemming

561
00:29:31,279 --> 00:29:37,279
so there's no one

562
00:29:33,200 --> 00:29:40,320
right way or correct way to stem text so

563
00:29:37,279 --> 00:29:42,480
this plot shows three approaches for

564
00:29:40,320 --> 00:29:44,880
stemming in English

565
00:29:42,480 --> 00:29:49,039
starting from hey let's just remove a final s

566
00:29:46,240 --> 00:29:51,760
to more complex rules about plural

567
00:29:49,039 --> 00:29:54,480
handling plural endings that middle

568
00:29:51,760 --> 00:29:57,520
one it is called the s stemmer

569
00:29:54,480 --> 00:30:01,360
it's a set of it's like a little set of rules

570
00:29:58,640 --> 00:30:02,640
and that last one is the best known

571
00:30:01,360 --> 00:30:04,960
one probably the best-known

572
00:30:02,640 --> 00:30:07,279
implementation of stemming in English

573
00:30:04,960 --> 00:30:09,919
called the Porter algorithm

574
00:30:07,279 --> 00:30:12,559
so you can see here that Porter stemming

575
00:30:09,919 --> 00:30:14,880
is the most different from the other two

576
00:30:12,559 --> 00:30:16,799
in the top 20 words here from the data

577
00:30:14,880 --> 00:30:19,840
set of animal descriptions that I've

578
00:30:16,799 --> 00:30:21,520
been using we see how the word species

579
00:30:19,840 --> 00:30:25,600
was treated differently animal predator

580
00:30:24,240 --> 00:30:31,440
this sort of collection of words

581
00:30:28,240 --> 00:30:33,200
live living life lives that was treated

582
00:30:31,440 --> 00:30:38,960
differently so practitioners are typically

583
00:30:36,240 --> 00:30:40,559
interested in stemming text data because

584
00:30:38,960 --> 00:30:46,960
it buckets tokens together that we believe

585
00:30:43,679 --> 00:30:51,600
belong together in in a way that

586
00:30:46,960 --> 00:30:53,360
we understand that as human users

587
00:30:51,600 --> 00:30:58,320
of language so we can use approaches like this

588
00:30:56,799 --> 00:31:04,159
which are pretty like step-by-step

589
00:31:01,519 --> 00:31:07,039
rules based this is typically called

590
00:31:04,159 --> 00:31:09,600
stemming or and it's fairly

591
00:31:07,039 --> 00:31:12,080
algorithmic in nature like

592
00:31:09,600 --> 00:31:14,720
first do this then do this then do this

593
00:31:12,080 --> 00:31:17,840
or you can use lemmatization

594
00:31:15,519 --> 00:31:23,519
which is usually based on large dictionaries

595
00:31:19,440 --> 00:31:26,559
of words and it incorporates like a

596
00:31:23,519 --> 00:31:31,120
linguistic understanding of what words belong together

597
00:31:28,559 --> 00:31:34,960
so most of the existing approaches for

598
00:31:31,120 --> 00:31:38,559
this kind of task in Korean are

599
00:31:35,919 --> 00:31:41,279
are limited lemmatizers based on these

600
00:31:38,559 --> 00:31:45,679
dictionaries and that are trained

601
00:31:42,480 --> 00:31:47,600
using large data sets of language

602
00:31:45,679 --> 00:31:50,559
so this seems like it's going to be a helpful thing to do

603
00:31:49,039 --> 00:31:53,360
when you hear about this you're like

604
00:31:50,559 --> 00:31:54,880
oh yeah sounds good,  sounds smart

605
00:31:53,360 --> 00:32:00,480
especially because with text data we are typically

606
00:31:56,919 --> 00:32:04,080
overwhelmed with features with numbers of tokens

607
00:32:02,320 --> 00:32:05,840
this is typically the situation

608
00:32:04,080 --> 00:32:09,440
when we're dealing with text data

609
00:32:05,840 --> 00:32:16,080
so here we have these animal description data

610
00:32:12,080 --> 00:32:18,240
and I made a matrix representation of it

611
00:32:16,080 --> 00:32:20,159
like we would typically use in some

612
00:32:18,240 --> 00:32:24,000
machine learning algorithm

613
00:32:20,159 --> 00:32:28,880
and look how many features there are

614
00:32:25,200 --> 00:32:31,600
16,000 almost 17,000 features

615
00:32:29,679 --> 00:32:33,519
that's the number of features that

616
00:32:31,600 --> 00:32:36,640
would be going into the model

617
00:32:33,519 --> 00:32:39,919
look at the sparsity

618
00:32:36,640 --> 00:32:43,440
98 percent sparse that's high very

619
00:32:39,919 --> 00:32:45,360
sparse data so this is the sparsity of

620
00:32:43,440 --> 00:32:47,760
the data that will go into the machine

621
00:32:45,360 --> 00:32:51,600
learning algorithm to build our

622
00:32:47,760 --> 00:32:53,760
supervised machine learning model

623
00:32:51,600 --> 00:32:56,480
if we stem the words

624
00:32:53,760 --> 00:33:00,159
if I use here an approach for stemming

625
00:32:56,480 --> 00:33:02,240
we reduce the number of word features by

626
00:33:00,159 --> 00:33:06,480
many thousands the sparsity unfortunately did not

627
00:33:04,640 --> 00:33:09,519
change as much but we reduced the number

628
00:33:06,480 --> 00:33:11,600
of features by a lot by bucketing those

629
00:33:09,519 --> 00:33:15,200
words together that our stemming algorithm

630
00:33:12,720 --> 00:33:16,880
belong together so you know

631
00:33:15,200 --> 00:33:19,519
common sense says

632
00:33:16,880 --> 00:33:21,679
reducing the number of words features

633
00:33:19,519 --> 00:33:24,880
so dramatically is going to perform

634
00:33:22,720 --> 00:33:29,039
improve the performance of our machine learning model

635
00:33:26,640 --> 00:33:32,559
but that is that does assume that we

636
00:33:29,039 --> 00:33:35,039
have not lost any important information

637
00:33:32,559 --> 00:33:39,279
by by stemming and it turns out that stemming or

638
00:33:37,120 --> 00:33:44,399
lemmatization can often be very helpful in some

639
00:33:41,440 --> 00:33:48,320
contexts but the typical algorithms used for these

640
00:33:45,519 --> 00:33:51,039
are somewhat aggressive

641
00:33:48,320 --> 00:33:53,840
and they have been built to favor sensitivity

642
00:33:52,399 --> 00:33:59,279
or recall or the true positive rate and

643
00:33:56,880 --> 00:34:01,760
this is at the expense of the

644
00:33:59,279 --> 00:34:04,640
specificity or the precision or the true

645
00:34:01,760 --> 00:34:07,200
negative rate so in a supervised machine

646
00:34:04,640 --> 00:34:09,679
learning context what this does is this

647
00:34:07,200 --> 00:34:13,839
affects a model's positive predictive

648
00:34:11,359 --> 00:34:19,200
value the precision or its ability to

649
00:34:16,399 --> 00:34:20,560
to not incorrectly label true negatives

650
00:34:19,200 --> 00:34:25,760
as positive I hope I got that right

651
00:34:22,800 --> 00:34:28,960
so you know to make this more concrete

652
00:34:25,760 --> 00:34:31,359
stemming can increase a model's ability

653
00:34:28,960 --> 00:34:33,760
to find the positive examples

654
00:34:31,359 --> 00:34:36,320
of say the animal descriptions that are

655
00:34:33,760 --> 00:34:38,399
associated with say a certain diet if

656
00:34:36,320 --> 00:34:40,720
that's what we're modeling however if

657
00:34:38,399 --> 00:34:43,200
text is over stemmed

658
00:34:40,720 --> 00:34:45,520
the resulting model loses its ability to

659
00:34:43,200 --> 00:34:47,200
label the negative examples

660
00:34:45,520 --> 00:34:49,760
say the descriptions that are not about

661
00:34:47,200 --> 00:34:52,240
that diet that's what we're looking for

662
00:34:49,760 --> 00:34:54,960
and this can be a real challenge when

663
00:34:52,240 --> 00:34:57,200
training models with text data kind of

664
00:34:54,960 --> 00:34:59,359
finding that that balance there because

665
00:34:57,200 --> 00:35:01,280
often we don't have a

666
00:34:59,359 --> 00:35:05,119
dial that we can change on these

667
00:35:01,280 --> 00:35:08,400
stemming on these stemming algorithms

668
00:35:05,119 --> 00:35:10,800
so even just very basic pre-processing

669
00:35:08,400 --> 00:35:13,200
for text like what I'm showing here in

670
00:35:10,800 --> 00:35:15,280
this feature engineering recipe can be

671
00:35:13,200 --> 00:35:17,440
computationally expensive

672
00:35:15,280 --> 00:35:20,079
and the choices that a practitioner

673
00:35:17,440 --> 00:35:24,400
makes like whether or not to remove

674
00:35:20,079 --> 00:35:27,599
stopwords or to stem text can have dramatic

675
00:35:24,400 --> 00:35:30,320
impact on how machine learning models

676
00:35:27,599 --> 00:35:32,320
of all kinds perform whether those are

677
00:35:30,320 --> 00:35:34,000
simpler models

678
00:35:32,320 --> 00:35:36,000
more traditional machine learning models

679
00:35:34,000 --> 00:35:42,560
or deep learning models what this means is that

680
00:35:39,040 --> 00:35:45,520
the price the prioritization that we

681
00:35:42,560 --> 00:35:47,920
as practitioners give to like learning

682
00:35:45,520 --> 00:35:50,079
teaching and writing about feature

683
00:35:47,920 --> 00:35:53,040
engineering steps for text really

684
00:35:50,079 --> 00:35:56,320
contributes to better more robust

685
00:35:53,040 --> 00:35:58,560
statistical practice in our field

686
00:35:56,320 --> 00:36:01,359
I mentioned before the sparsity of text

687
00:35:58,560 --> 00:36:04,079
data and I want to come back to that

688
00:36:01,359 --> 00:36:06,720
because it is one of text data's really

689
00:36:04,079 --> 00:36:14,320
defining characteristics because of just how language works

690
00:36:10,400 --> 00:36:16,640
we use a few words a lot of times

691
00:36:14,320 --> 00:36:19,599
and then a lot of words only just a

692
00:36:16,640 --> 00:36:22,640
couple of times only a few a few times

693
00:36:19,599 --> 00:36:25,119
and with a real set of natural language

694
00:36:22,640 --> 00:36:27,520
you end up with relationships that look

695
00:36:25,119 --> 00:36:30,640
like this that look like these plots in

696
00:36:27,520 --> 00:36:32,400
terms of how the sparsity changes as you

697
00:36:30,640 --> 00:36:35,680
add more documents

698
00:36:32,400 --> 00:36:38,320
and more unique words to a corpus

699
00:36:35,680 --> 00:36:43,040
so the sparsity goes up real fast as you

700
00:36:38,320 --> 00:36:45,760
add more unique words and the memory

701
00:36:43,040 --> 00:36:51,920
that is required to handle

702
00:36:48,480 --> 00:36:53,520
this set of documents goes up very fast

703
00:36:51,920 --> 00:36:57,520
so even if you use specialized data

704
00:36:56,320 --> 00:37:03,280
structures meant to store sparse data like sparse

705
00:37:00,079 --> 00:37:05,920
matrices you still end up growing the

706
00:37:03,280 --> 00:37:07,599
memory required to handle these data

707
00:37:05,920 --> 00:37:13,119
sets in a very non-linear way it still grows up very

708
00:37:09,920 --> 00:37:15,520
fast so this means it can take a very

709
00:37:13,119 --> 00:37:19,839
long time to train your model or even that

710
00:37:16,960 --> 00:37:21,760
you outgrow the memory

711
00:37:19,839 --> 00:37:23,839
available on your machine you have to go

712
00:37:21,760 --> 00:37:26,079
to the cloud to an expensive

713
00:37:23,839 --> 00:37:31,200
big memory situation this can be a real challenge

714
00:37:27,680 --> 00:37:34,320
and this challenge

715
00:37:31,200 --> 00:37:39,359
it is what has behind the motivating of vector

716
00:37:37,200 --> 00:37:43,520
languages for models so

717
00:37:40,240 --> 00:37:46,480
linguists have worked for a long time

718
00:37:43,520 --> 00:37:49,680
on vector languages for models that can

719
00:37:46,480 --> 00:37:55,280
reduce the number of dimensions representing text data

720
00:37:51,680 --> 00:37:58,160
based on how people use language

721
00:37:55,280 --> 00:38:04,960
so this quote here goes all the way back to 1957.

722
00:38:02,960 --> 00:38:09,280
so the idea here is that we use

723
00:38:07,119 --> 00:38:12,800
like the data is very sparse

724
00:38:09,280 --> 00:38:15,440
but we don't use words

725
00:38:12,800 --> 00:38:17,359
randomly it's not independent the words

726
00:38:15,440 --> 00:38:19,599
are not used independently of each other

727
00:38:17,359 --> 00:38:22,240
but rather there's relationships that

728
00:38:19,599 --> 00:38:25,920
exist between how words are used together

729
00:38:23,359 --> 00:38:31,040
and we can use those relationships to create

730
00:38:27,599 --> 00:38:33,599
to transform our sparse high dimensional

731
00:38:31,040 --> 00:38:37,359
space into a special dense

732
00:38:34,560 --> 00:38:40,160
low dimensional space lower

733
00:38:37,359 --> 00:38:42,480
we still has like 100

734
00:38:40,160 --> 00:38:46,160
dimensions but much lower than the many thousands

735
00:38:44,000 --> 00:38:48,160
hundreds tens hundreds of thousands

736
00:38:46,160 --> 00:38:49,839
of space so the idea here we use

737
00:38:48,160 --> 00:38:55,599
statistical modeling maybe just

738
00:38:51,839 --> 00:38:58,240
word counts plus matrix factorization

739
00:38:55,599 --> 00:39:00,480
maybe fancier math that involves neural

740
00:38:58,240 --> 00:39:03,680
networks to take this really high

741
00:39:00,480 --> 00:39:05,839
dimensional space and we create a new

742
00:39:03,680 --> 00:39:08,480
lower dimensional lower dimensional

743
00:39:05,839 --> 00:39:12,000
space that is special because the new

744
00:39:08,480 --> 00:39:15,040
space is created based on vectors that

745
00:39:12,000 --> 00:39:18,560
incorporate information

746
00:39:15,040 --> 00:39:21,839
about which words are used together so

747
00:39:18,560 --> 00:39:26,480
you shall know a word by the company it keeps

748
00:39:24,079 --> 00:39:29,359
so you need a big data set of text to

749
00:39:26,480 --> 00:39:32,000
create or learn these kinds of word

750
00:39:29,359 --> 00:39:35,119
vectors or word embeddings

751
00:39:32,000 --> 00:39:37,040
so this table that I'm showing right now

752
00:39:35,119 --> 00:39:40,560
it's from a set of embeddings that

753
00:39:37,040 --> 00:39:48,000
I created using a data set or a corpus of complaints

754
00:39:44,640 --> 00:39:51,680
complaints to the United States consumer

755
00:39:49,200 --> 00:39:53,839
financial protection bureau

756
00:39:51,680 --> 00:39:56,640
so this is a government body in the

757
00:39:53,839 --> 00:40:00,720
United States where people can complain and say

758
00:39:57,920 --> 00:40:03,680
what is wrong with something to do with

759
00:40:00,720 --> 00:40:07,599
a financial product like a credit card

760
00:40:04,720 --> 00:40:09,760
a mortgage a student loan

761
00:40:07,599 --> 00:40:11,119
something to do with like a financial

762
00:40:09,760 --> 00:40:13,119
product they're like something went

763
00:40:11,119 --> 00:40:15,359
wrong with my credit card something went

764
00:40:13,119 --> 00:40:17,520
wrong with my mortgage that company is

765
00:40:15,359 --> 00:40:24,000
not being fair so you come and you complain to it

766
00:40:20,000 --> 00:40:27,839
so I took all those complaints and built

767
00:40:25,440 --> 00:40:30,160
it's our high dimensional space and

768
00:40:27,839 --> 00:40:32,400
build a low dimensional space

769
00:40:30,160 --> 00:40:37,680
and we can look in that space and understand

770
00:40:33,920 --> 00:40:40,240
what words are related to each other

771
00:40:37,680 --> 00:40:44,079
in this space so in the new space

772
00:40:40,240 --> 00:40:48,000
defined by the embeddings the word month

773
00:40:44,079 --> 00:40:50,880
is closest to words like year months plural

774
00:40:49,680 --> 00:40:56,480
monthly installments payment so these are words

775
00:40:54,560 --> 00:41:01,280
that are that makes sense in the context of

776
00:40:57,839 --> 00:41:05,119
financial products like credit cards or mortgages

777
00:41:03,359 --> 00:41:09,680
in the new space defined by these embeddings

778
00:41:06,720 --> 00:41:11,760
the word error is closest to the words

779
00:41:09,680 --> 00:41:17,680
like mistake clerical like a clerical mistake

780
00:41:14,960 --> 00:41:20,880
problem glitch or there was a glitch on my

781
00:41:19,040 --> 00:41:26,240
mortgage statement so we see these kinds of

782
00:41:23,440 --> 00:41:28,319
or miscommunication misunderstanding you

783
00:41:26,240 --> 00:41:30,880
know like these are these are words that

784
00:41:28,319 --> 00:41:33,599
are used in similar ways so

785
00:41:31,599 --> 00:41:37,200
you don't have to create embeddings yourself

786
00:41:34,800 --> 00:41:39,520
because it requires quite a lot of data

787
00:41:37,200 --> 00:41:42,240
to make them so you can use word

788
00:41:39,520 --> 00:41:44,880
embeddings that are pre-trained

789
00:41:42,240 --> 00:41:47,520
i.e created by someone else

790
00:41:44,880 --> 00:41:49,839
based on some huge corpus of data that

791
00:41:47,520 --> 00:41:54,079
they have access to and you probably don't

792
00:41:51,359 --> 00:41:55,599
so let's look at one of those data sets

793
00:41:54,079 --> 00:42:02,800
let's look at this table shows the results for the same word error

794
00:42:00,160 --> 00:42:04,560
but for the glove embeddings so the

795
00:42:02,800 --> 00:42:07,359
glove embeddings are a set of

796
00:42:04,560 --> 00:42:09,920
pre-trained embeddings that are created

797
00:42:07,359 --> 00:42:12,800
based on a very large data set that's like

798
00:42:11,040 --> 00:42:16,640
all of wikipedia all of the google news data set

799
00:42:15,280 --> 00:42:24,800
just like huge swaths of the internet have been

800
00:42:21,440 --> 00:42:27,280
fed in to create these embeddings

801
00:42:24,800 --> 00:42:31,680
so some of the closest words here are similar

802
00:42:29,520 --> 00:42:34,800
to those that are before but we no

803
00:42:31,680 --> 00:42:38,640
longer have some of that domain specific

804
00:42:34,800 --> 00:42:40,839
flavor like clerical discrepancy

805
00:42:38,640 --> 00:42:43,280
and now we have like

806
00:42:40,839 --> 00:42:46,720
miscommunication you know but and now we

807
00:42:43,280 --> 00:42:48,960
have calculation and probability

808
00:42:46,720 --> 00:42:51,599
which people were not talking about with

809
00:42:48,960 --> 00:42:54,319
their financial product complaints

810
00:42:51,599 --> 00:42:58,240
so this really highlights

811
00:42:54,319 --> 00:43:00,800
how these how these work here before

812
00:42:58,240 --> 00:43:02,640
we we created our own and we were able

813
00:43:00,800 --> 00:43:06,800
to learn relationships that were

814
00:43:02,640 --> 00:43:10,400
specific to this context and here we go

815
00:43:06,800 --> 00:43:13,119
to a more general set that that

816
00:43:10,400 --> 00:43:16,720
was learned somewhere else

817
00:43:13,119 --> 00:43:19,599
so embeddings are trained or learned

818
00:43:16,720 --> 00:43:22,079
from a large corpus of text data and the

819
00:43:19,599 --> 00:43:24,240
characteristics of that corpus become

820
00:43:22,079 --> 00:43:27,359
part of the embeddings

821
00:43:24,240 --> 00:43:30,160
so machine learning in general you know

822
00:43:27,359 --> 00:43:32,240
is exquisitely sensitive to whatever it

823
00:43:30,160 --> 00:43:35,839
is that's in your training data and this

824
00:43:32,240 --> 00:43:38,160
is never more obvious than when

825
00:43:35,839 --> 00:43:40,319
dealing with text data

826
00:43:38,160 --> 00:43:42,400
and perhaps with word embeddings is

827
00:43:40,319 --> 00:43:43,760
just like one of these classic examples

828
00:43:42,400 --> 00:43:48,160
where this is true it turns out that

829
00:43:46,079 --> 00:43:54,400
this shows up in how any human

830
00:43:51,040 --> 00:43:56,480
prejudice or bias in the corpus

831
00:43:54,400 --> 00:44:01,440
becomes imprinted into the embeddings

832
00:43:59,040 --> 00:44:04,720
so in fact when we look at some of these

833
00:44:01,440 --> 00:44:07,040
most commonly available embeddings that

834
00:44:04,720 --> 00:44:14,640
are out there bias is we we see that

835
00:44:12,200 --> 00:44:17,520
african-american first names that are

836
00:44:14,640 --> 00:44:19,680
more common for african americans in the

837
00:44:17,520 --> 00:44:23,280
United States they're associated with

838
00:44:19,680 --> 00:44:25,520
more unpleasant feelings than European

839
00:44:23,280 --> 00:44:30,240
American first names in these embedding spaces

840
00:44:27,680 --> 00:44:31,599
women's first names are more associated

841
00:44:30,240 --> 00:44:38,640
with family and men's first names are more associated with career

842
00:44:36,079 --> 00:44:40,560
and terms associated with women are more

843
00:44:38,640 --> 00:44:42,640
associated with the arts and terms

844
00:44:40,560 --> 00:44:44,079
associated with men are more associated

845
00:44:42,640 --> 00:44:51,440
with science so it turns out actually bias is so

846
00:44:48,319 --> 00:44:54,160
ingrained in word embeddings that the

847
00:44:51,440 --> 00:44:56,800
word embeddings themselves can be used

848
00:44:54,160 --> 00:45:01,440
to quantify change

849
00:44:58,480 --> 00:45:05,040
in social attitudes over time

850
00:45:01,440 --> 00:45:08,160
so word embeddings are

851
00:45:05,040 --> 00:45:11,119
maybe an exaggerated or extreme example

852
00:45:08,160 --> 00:45:13,599
but it turns out that all the feature

853
00:45:11,119 --> 00:45:16,720
engineering decisions that we make when

854
00:45:13,599 --> 00:45:18,480
it comes to text data have a significant

855
00:45:16,720 --> 00:45:21,440
effect on our results

856
00:45:18,480 --> 00:45:26,000
both in terms of the model performance that we see

857
00:45:22,560 --> 00:45:29,520
and also in terms of how appropriate or

858
00:45:26,000 --> 00:45:31,599
fair our models are

859
00:45:29,520 --> 00:45:34,560
so given all that when it comes to

860
00:45:31,599 --> 00:45:37,040
pre-processing your text data

861
00:45:34,560 --> 00:45:39,680
creating these features that you need

862
00:45:37,040 --> 00:45:43,520
you have a lot of options and quite a

863
00:45:39,680 --> 00:45:46,560
bit of responsibility so my advice is

864
00:45:43,520 --> 00:45:50,480
always start with simpler models that

865
00:45:46,560 --> 00:45:50,480
you can understand quite deeply

866
00:45:50,640 --> 00:45:56,720
be sure to adopt good statistical

867
00:45:53,520 --> 00:45:59,040
practices as you train and tune your

868
00:45:56,720 --> 00:46:04,720
models so you aren't fooled about model

869
00:46:01,839 --> 00:46:07,119
performance improvements

870
00:46:04,720 --> 00:46:10,400
when you try different approaches

871
00:46:07,119 --> 00:46:12,839
and also to use model explainability

872
00:46:10,400 --> 00:46:16,079
tools and frameworks so you can

873
00:46:12,839 --> 00:46:18,000
understand any less straightforward

874
00:46:16,079 --> 00:46:20,240
models that you try

875
00:46:18,000 --> 00:46:22,640
so my co-workers and I have written

876
00:46:20,240 --> 00:46:24,720
about all of these topics and how to

877
00:46:22,640 --> 00:46:26,880
use them with Tidymodels if that's what

878
00:46:24,720 --> 00:46:30,319
you like to use and we will continue to do so

879
00:46:28,720 --> 00:46:33,040
with that i will say

880
00:46:30,319 --> 00:46:34,880
thank you so very much and I want to be

881
00:46:33,040 --> 00:46:38,960
sure to again

882
00:46:36,560 --> 00:46:42,560
thank the organizers of the R user group

883
00:46:38,960 --> 00:46:45,119
in Korea I want to thank my teammates on

884
00:46:42,560 --> 00:46:47,440
the Tidymodels team at Rstudio as

885
00:46:45,119 --> 00:46:51,240
well as my co-author EMIL HVITFELDT.

