hi my name is Julius Elite I'm a data scientist and software engineer at our studio and I'd like to thank of the or the organizers of the our user groups in Korea so much for having me today to speak to you I'm so happy to be speaking specifically today about creating features for machine learning from texting up for a couple of reasons having a better understanding of what we do to take Text data and then to make it appropriate as I'm in put for machine learning algorithm has many benefits both if you are directly getting ready to train a model or if you're at the beginning of some text analysis project I'm or if you are trying to understand the behavior of a model that you're interacting with some way which is something that we do in our work and say the scientist or in are you in our
he lives more and more so when we build models for text either super vised or unsupervised we start with something like this this is some example Text data that I'll use a couple of times during the stocks that describes some animals some animals I'm using some Text data to me as an English speaker looks familiar I got my eye on someone who uses a human language so I look at this and I can read it I could speak out loud and I understand I can interpret it what it means
do this kind of data to sort of natural-language data is being generated all the time in all kinds of languages in all kinds of contacts so what are you work in healthcare I'm in Tech in finance basically any kind of organization to sort of Text data is being generated by by by customers by by clients by internal stakeholders inside of Business by people taking surveys by social media business processes and in all this natural language there's information latent in that text that can be used to make better decisions however computers are not great at at you know looking at this and doing math on language as a
Representatives like this and instead language has to be transformed magically transformed to some kind of machine readable numeric representation that looks more like this what I'm showing her on the screen to be ready for almost any kind of model and so I spent a fair amount of time working on software for people to be able to do exploratory data analysis visualization summarization pass like that with texting apps in a Thai tea format where we have one observation per row and I love using tidy data principles for text analysis especially during those exploratory phases of analysis when it comes time to build a model
often what the underlying mathematical implementation really needs is typically something like this which is a way to do this particular representation is called the documents her Matrix exact representation May differ from what I shown here what I have here is we're waiting things by counts so each row in this Matrix is a document each column is what is a word a token and the numbers represent counts how many times of each document used each word you could wait it in a different way using say tf-idf instead of cows or you might teach sequence information if you are interested in building a deep learning model but basically for all kinds of text modeling
from simpler models like naive Bayes models which work well for tax to word embeddings to really the most state-of-the-art kind of work that's happening today like a Transformers 4 Text data we have two heavily feature engineer and process language to get it to some kind of representation if it's suitable for machine learning algorithms do I work on an open-source framework in our for modeling and machine learning this called the models and the examples that I'll be showing today used Heidi model code some of the specifics of the project are to provide a consistent flexible framework for real world modeling practitioners you know doing they're they're dealing with real-world data those were just starting out to those who are very experienced in modeling and the goal
is to harmonize the heterogeneous interfaces that exists within our and to encourage good statistical practice I'm glad to get to show you some of what I work on and build and how we apply it to text modeling but a lot of what I will talk about today isn't very specific to tidymodels or even to our I know this isn't our user group but what we're going to talk about and focus on our is a little more conceptual and basic how we transform text into predictors for machine learning I am excited to talk about tidymodels and Tiny models if you have not use it for is a metal package in a similar way that the title is a metal package so if you've ever typed library tidyverse and then you use ggplot2 for visualization
deep layer for data manipulation tidymodels Works in a similar way
 there are different packages inside of it that are used for different purposes so the pre-processing or the feature engineering is part of a broader model process that process starts really with with exploratory data analysis that helps us decide what kind of model we will build and then it comes to completion I think I would argue with model evaluation when you add you you measure how well your model perform
 Heidi models as a piece of software is made up of our packages Each of which has a specific Focus like our sample is for sampling data create a bootstrap free samples cross-validation re samples all different kinds of resampled you might want to use to train and evaluate models the two packages for hyperparameter tuning as you might guess from the name one of these packages is for feature engineering for a data preprocessing feature engineering and it is the one that is called recipes
 do in tiny models we capture this idea of data preprocessing and feature Engineering in the concept of a pre-processing recipe that has steps so you choose I'm ingredients or variables that you're going to use then you define the steps that go into your recipe then you prepare them using training data and then you can apply that to any dataset like testing data or new data at prediction time so the variables are greedy that we use in modeling come in all kinds of shapes and sizes including test data so some of the techniques and approaches that we used for free processing text Data are the same as for any other kind of data that you might use line on text Data numeric data categorical data some up for some of that is the same but
 of what you need to know to be able to do a good job of Independence process for text is different and is specific to the nature of what language data is like
 so I've written a book with my co-author and you'll be felt on I'm supervised machine learning for text analysis in R and fully the first third of the book focuses on how we transform the natural language that we have in texted it and just teachers for modeling the middle section is about how we use these features in a simpler or more traditional machine learning models like regularize regression or support Vector machines and then the last third of the book talks about how we use deep learning models with Text data so deep learning model still require these kinds of Transformations from natural language into features for a template for these kinds of models but the funding models are often able to ensure
 we learn structure of features from text in ways that those more traditional or simpler machine learning models are not till this book is now complete and available as of this month's as of November folks are getting their first paper copies and also this book is available in its entirety at small tarte.com do if you're new to dealing with text Edith understanding these these fundamental pre-processing approaches her text will set you up for being able to train active models if you're really experienced was texting if you've dealt with a lot already and you probably noticed like we have that the existing you know resources are literature whether that's books or tutorials or blog posts is is quite sparse when it comes to detail
 thoughtful explorations of how these pre-processing steps work and how choices made in these feature engineering steps impact our our model output
 so let's walk through several of some of these like basic feature engineering approaches and how they work and what they do let's start out with tokenization so typically one of the first steps in transformation from natural language to machine learning feature for really any kind of text analysis including exploratory data analysis or building a model anything I missed tokenization in tokenization we take an input some string some character vector and some kind of token type some meaningful unit of texts that were interested in like like a word and we split the empire into pieces into tokens that correspond to the type that were interested in so most commonly the meaningful unit or type of token that we want a split text in
 unit that is a word as straightforward or obvious but it turns out it's difficult to clearly Define what a word is for many or even most languages so so many languages do not use white space between words at all which you know a challenge for tokenization I'm even languages that do you use white space like like English and Korean often have particular examples that are ambiguous like I'm contraction is an English like didn't which should be you know maybe more accurately considered two words of the way the way particles are used in Korean and how pronouns and the gation words are written in Romance languages like Italian and French were there stuck together and really maybe they should be considered two words
 what you have figured out what you're going to do and you make some choices and you tokenize your texts then it's on its way to being able to be used in exploratory data analysis or unsupervised algorithms or as features for predictive modeling which is what we're talking about here and what these results show here so these results are from a regression model train on descriptions of media from artwork in the tape collection in the UK so what we're predicting and what we are predicting is when what year was I'm a piece of art created space on the the medium that the artwork was created with and the medium is inscribed with a little bit of text do we see here that artwork created using graphite watercolor and Engraving was more likely to be created earlier that. Is more likely to come from older
 an artwork has created using photography screen points or start screen print screen printing and dung and I'm glitter are are more likely to be crated later there's is more likely to confirm Contemporary Art of Modern Art so the way that we talk and eyes this text we have you started with natural human-generated text of people riding out the descriptions of the the media that these art pieces of artwork created West and the way we talk and eyes that natural human and generated text we started with has a big impact on what we learn from it if we tokenize in a different way we would have gotten different results in terms of performance like how accurately we were able to predict predict a year and also in terms of how we interpret the model
 like what is it that were able to learn from it
 so this is one kind of tokenization single word but we also we all another way to talk and eyes instead of breaking up into single words or you know grams we can talk and eyes to engrams do an engram is a continuous sequence of n items from a given sequence of text so this shows that same piece of little bit of text I'm describing this animal divided up into by grams or in grams of 2 tokens so notice how the words in the bike Rams overlapped so the word Collard appears in both of the first by Graham the colored colored pecker a peccary also also referred referred to sew and Grant open ization slides text to create overlapping sets of tokens
 this shows trigrams for the same thing so using Yuna grams one word is faster and more efficient but we don't capture information about word order I'm using a higher value for n you know two or three or more complex information about word order Concepts that are described in multi word phrases but the vector space of tokens increases dramatically that corresponds to a reduction in token counts we don't count each token asthma and very many times and that means depending on your particular data set
 you might not be able to get good results
 do combining different degrees of engrams can allow you to extract different levels of detail from text so you two grams can tell you which individual words have been used a lot of times and some of those words might be overlooked in by Grandma tries to try on clowns if they don't come up here with other words as often
 this plot Compares model performance for a lasso regression model predicting the year of Supreme Court opinions the United States Supreme Court opinions I'm with three different degrees of engrams what we're doing here is we are taking the text of the writings of the United States Supreme Court and we're predicting when when did when was that text written so can we predict how old a piece of text is from the contents of the text so holding the number of tokens constant a thousand using unigram zalone performs best for this Corpus of opinions from the United States Supreme Court this is not always the case depending on the kind of model you used the data set itself we might see the best performance combining you know grams and buy grams or maybe some other options
 in this case if we wanted to incorporate some of that more complex information that we have in the bike runs in the trigrams we probably would need to increase the number of tokens in the model quite a bit
 do keep in mind when you look at results like these that identifying and grams is computationally expensive especially compared to the amount of like a model Improvement in model performance that we often see like if we if we see something you know modest Improvement by adding in a by grams is important to keep in mind how much improvement we see relative to how long it takes to identify by grams and then train that model so for example for this data is that a supreme court opinions held the number of tokens constant so the model training had the same number of tokens in it using by grams plus uniforms takes twice as long to train to do the feature engineering and the training then only Yuna grams and adding and trigrams as well
 takes almost five times as long as training on YouTube grams alone so this is a copy Tatian Ali expensive thing to do
 going in the other direction we can talk nice to unit smaller than words like these are what are called character shingles so we take words the collared peccary and we can instead of looking at words we can go down and look at subword information there's multiple different ways to break words up into Subway been appropriate for machine learning and all these kinds of approaches are algorithms have the benefit of being able to encode unknown or newer prediction time do you want it when it's time to make a prediction on new data if not it's not uncommon for there to be new vocabulary words at that time and if we didn't see them in the training day. You know what are we going to do about those new words when we train using subword information off and we can handle those new words if we
 all the sub word entertaining dataset so I'm using this kind of sub word information is a way to incorporate morphological sequences into our models of various kinds of this is something that applies to you know various languages not just English so these results are for a classification model with a dataset of very short text it's just the names of post offices in the United States so super short and the goal of the models to predict the did the poet is the post office located in Hawaii in the middle of the Pacific Ocean or is it located in the rest of the United States so I created features for the model that are some words
 are the post office names and we end up learning that the names that start with h and pee or contain that all a sub word are more likely to be in Hawaii and the stop words and e and r i in ing are more likely to become from the post office that are outside of Hawaii so this is an example of how we tokenized differently and we're able to learn something new to learn something different so in tiny models we collect all these kinds of decisions about tokenization and code that looks like this until we start with a recipe that specifies what variables or ingredients that will use and then we Define these pre-processing steps so even at this first and arguably the most simple and basic steps the choices that we make
 fact our modeling results in a big way the next pre-processing stuff that I want to talk about is soft words so once we have split text into tokens we often find that not all words carry the same amount of information if maybe any information at all actual for a machine-learning tasks so common words that carry little or perhaps full information are called Stuff words so this is one of the stop word list that's available for I'm Korean stores, advice and practice two to say hey just remove just remove MovieStop words for for a lot of a natural language processing tasks
 what I'm doing here is the entirety of one of the shorter English top word list that's used really Bradley so you know it's where it's like I me my pronouns conjunctions and the fees are very common words that are not considered super important
 the decision go to just remove a soft words is often more involved and perhaps more fraught then what you'll then what do find reflected in a lot of resources that are out there so almost all the time I'll real world and I'll pee practitioners use pre-made stop wordless so this plot visualizes set intersections for three common software list and English in what is called an upset plot so the three less are called the snowball list smart and ISO list so you can see the that the length of the list are represented by the length of the bars and then we see the intersections which words are in common on these lists by the by the vertical bars so the length of the list are quite different and also noticed they don't all contain the same sets of word
 the important thing to remember about stopwords lexicons is that they they are not created in some neutral perfect setting but instead they are they are context-specific they they can be by us a both of these things are true because they are list created from large data sets of language so they reflect the characteristics of the data used in their creation so this is the 10 words that are in the English language smart lexicon but not in the English snowball lexicon so notice that they're all contractions but that's not because the snowball ex Khan doesn't include contractions it has a lot of them also noticed that it has the cheese is on
 just listen to that means that that list has keys but it does not have the list she's so this is an example of that of that bias I mentioned that occurs because these lists are created from large data sets of text lexicon creators look at the most frequent words in some big Corpus of language they make a cut off and then some decisions about what to include or extinct exclude you know based on the list that they have at and you end up here so because a many large data sets of language you have more represent in a representation of of of of of men I'm you end up with a situation like this where I stopped worthless will have keys but not she's like so many decisions when it comes to a modeling or analysis with language
 I'm we as practitioners have to decide what is appropriate for our particular domain it turns out this is even true when it comes to picking a stop red last Stone tiny models we can Implement a free frosting stop like removing stop words by adding an additional step to a recipe so first we specified what variables we were to use then we tokenize the text and now we are removing soft words here using just a default step since we are not passing any other argument we could do I use a non-default step or even a custom list that was most appropriate to our domain
 this plot I'm compares the model performance for predicting the year of that same data set of Supreme Court opinion with three different stuff word lexicons of different lengths so the snowball lexicon contains the smallest number of words and in this case it results in the best performance so removing fewer staff or its results in the best performance here so this specific results is not generalizable to all data sets and contacts but the fact that we're moving different sets of software it's can have noticeably different effects on your model on that is quite transferable so the only way to know what is the best thing to do is to try
 several options and see so machine learning in general right is a she is a empirical is an empirical feel right like we don't know we don't often have reasons off Fiore to know what will be the best thing to do is so typically we have to try a different option to see what will be the best thing for the 3rd pre-processing step that I want to talk about what for text is coming so what do we do is text often documents contain different versions of one base word often called a stamp so what if save for an English example say if we aren't interested in the difference between animals plural and animal singular and we want to treat them both together I'm so that idea is at the heart of stemming so there's no one
 right way or correct way to stem text through this plot shows three approaches Force coming in English starting from hey let's just remove a rhino F10 more complex rules about paneling plural endings that middle one it is called the S Summer it's a set of it's like a set of rules and the end that last one is the best known when it probably the best-known implementation of stemming in English called the porter algorithm she can see here that sport stemming is the most different from the other two in the top 20 words here from the dataset of animal descriptions that I've been using we see how the word species was treated differently animal predator the sort of collection of words live living life lives that was treated differently
 practitioners are typically interested in stemming text Data because it buckets tokens together that we believe belong belong together in a way that we understand that as as human users of language so we can use a purchase like this which which are pretty like step-by-step rules-based because it's typically called stemming or end is fairly algorithmic and nature like first two tests and do that send you the best or you can use lemmatization which is usually based on large dictionaries of words and incorporates like a linguistic understanding of what words belong together
 so most of the existing approaches for this kind of task task in Korean are our limited limit heiser's based on these dictionaries that are trained using large data sets of language
 this seems like it's going to be helpful thing to do when you hear about this you're like oh yeah sounds good sounds good sound smart especially cuz he was texting her we are typically overwhelmed with features with I'm tokens with numbers of tokens does it like this is typically the situation when we're dealing with text Data text Aida so here we have this
 this is Auntie's animal description data and I made a matrix representation of it like we would typically use and some machine learning algorithms and look how many features there are sixteen thousand almost 17,000 features that that's a number of features that would be going into the the model look at the spur City 98% sparse that's high very sparse stayed up so this is a sparsity of of the data that will go into the sea mining algorithm to build are supervised machine learning model if we stem the words if I use here I'm an approach for stemming we reduce the number of word features by many thousands of the sparsity unfortunately did not change as much but we reduce the number of features by a lot by bucketing
 those words together that are stemming algorithm they belong together so you know common sense says reducing the number of words features so dramatically is going to improve the performance of our machine learning model but that is that does assume that we have not lost any important information by my stomach
 and it turned out that stemming or lemmatization can often be very helpful in some contacts but the typical algorithms used for these are somewhat aggressive
 and they have been built to favor at sensitivity or recall or the true positive rates and this is at the expense of the specificity or the Precision or the true negative rates so in a supervised machine learning contacts what the size is this affects a Mazda models of positive predictive value that the Precision Orthopedics t22 not incorrectly labeled true negatives as positive I hope I got that right so you don't make this more concrete stepping can increase a model's ability to find the positive examples of say the animal descriptions that are associated with a certain diet if that's what we're modeling however if text is over steamed the resulting model loses its ability to label the negative examples say the descriptions that are not about that diet
 but we're looking for and this can be a real challenge when training models with text Ada kind of finding that that balance there cuz often we don't we often we don't have a dial that we can change on the stepping on these stemming algorithms so even and just very basic pre-processing for texts like what I'm showing here in this feature engineering recipe can be computationally expensive and the choices that a practitioner makes like whether or not to remove stuff words or two stem tax can't have dramatic impact on how machine learning models of all kinds perform whether those are similar models more traditional machine learning models are deep learning models what this means is that the the price of the prioritization that we as practitioners give to like learning teaching and writing about feature in
 hearing steps for text really contribute to better more robust physical practice in our field I mentioned before the sparsity of Text data and I want to come back to that because it is one of texts that has really defining characteristics because of just how language works we use a few words a lot of times and then a lot of words only just a couple of times only a few a few times and with a real set of natural-language you end up with relationships that look like vests that look like these plots in terms of how the sparsity changes as you add more documents and more unique words to a corpus so the varsity goes up real fast as you add more unique words and the memory
 that is required to handle you know that this deda set of documents goes up very fast so even if you use specialized data structures meant to store sparse data like sparse matrices you still end up growing the memory required to handle these data sets in a very low nonlinear way it's still grows up very fast so this means it can take a very long time to train your model or even that you know you you you outgrow the memory available on your machine you have to go to the cloud expensive me a big memory situation this can be a real challenge this challenge it is what has behind the motivating of vector languages for models so I'm linguists have worked for a long time
 invector languages for models that can reduce the number of Dimensions representing text Data based on how people use language so this quote here goes all the way back to 1957 so the idea here is that yeah we use like the Adidas very sparse you know but we don't use words randomly it's not independent the words are not used independently of each other but rather there's relationships that exist between how words are used together and we can use those relationships to create
 at to transform are sparse high-dimensional space into a special dense low dimensional space will lower lower we saw it was still it has like a hundred hundred Dimensions but much lower than the like many thousands hundreds tens hundreds of thousands of space so they be a here we used to testicle modeling maybe just word counts plus Matrix factorization may be fancier math that involves neural networks to take this really high dimensional space and we create a new lower men Channel lower dimensional space that is special because the new space is created based on vectors then incorporate information
 about which words are used together so you shall know a word by the company it keeps so you need a big data set of text to create or learn that these kind of word vectors or word and bedding table that I'm showing right now it's from a set of embeddings it I created using
 a dataset or a corpus of complaints complaints to the United States Consumer Financial Protection Bureau so this is a government body in the United States where people can complain and say what is wrong with something to do with financial products like like a credit card a mortgage a student loan something to do with like a financial product they're like something went wrong with my credit card something went wrong with my mortgage that company is not being fair to come and you complain to it so I took all those complaints and built a a switch are higher dimensional space and low dimensional space and we can look in that space and understand what words are related to each other in this space so in the new space
 find by the embeddings the word month is closest to words like year months plural monthly installments payment so these are words that are that makes sense in the context of financial products like credit cards or mortgages in the new space as defined by these embeddings the word error is closest to the word like mistake clerical like a clerical mistake problem glitch oh there was a glitch on my mortgage statement so we see these kinds of our miscommunication misunderstanding you know like these are all these are words that are used in similar ways so you don't have to create an Benny's yourself because it requires quite a lot of data to make them so you can use word embeddings
 are pre-trained I eat created by someone else based on some huge Corpus of data that they have access to you and you probably don't so let's look at one of those data sets let's look at this table shows the results for the same word before the glove and beddings so the glove and bedding Zara set up retrain embeddings that are created based on a very large dataset that's like all of Wikipedia all of the Google News dataset just just like huge swaths of the internet have been fed and create these embeddings so some of the closest words here are similar
 to those that are before but we no longer have some of that domain specific flavor like clerical discrepancy and now we have more like miscommunication and now we have a calculation and probability which people were not talking about with their financial product complaints so this really highlights how do you how do you work I'm here before we we created our own and we were able to learn relationships that were specific to this contacts and here we go to the to a more General set that that was learned somewhere else so and beddings are trained or learn from a large Corpus a text data and the characteristics of that Corpus become part of the embeddings so machine learning in general you know is exquisitely sensitive
 to whatever it is it's in your training day. And this is never more obvious than when dealing with text a dad and and perhaps with word embeddings is just like one of these classic examples where this is true it turns out that this this shows up in how any human Prejudice or bias in the Corpus become is imprinted into the embeddings
 so in fact when we look at some of these most commonly available at embeddings that are out there biased is we see that African-American first names that are more common for African-Americans in the United States there are associated with more unpleasant feelings then European American first names in these embedding spaces
 women's first names are more associated with family and men's first names are more associated with career and terms associated with women are more associated with the Arts and terms associated with men are more associated with science so it turns out actually bias is so ingrained in word embeddings that the word embeddings themselves can be used to quantify change in in social attitudes over time so weird in beddings are maybe an exaggerated or extreme example but it turns out that all the feature engineering decisions that we make when it comes to text Ada have a significant effect on our results both in terms of the model performance that we see and I'm also in terms of how appropriate or Fair are
 models are
 so given all that when it comes to processing your text me that I'm creating these features that you need you have a lot of options and quite a bit of responsibility so my advice is always start with simpler models that you can understand quite deeply be sure to adopt good statistical practices as you train and tune your models so you aren't you weren't fooled about model performance improvements when you try different approaches and also to use model explainability tools and Frameworks so you can understand any less straightforward models that you try so I my co-workers and I have written about all of these topics and how to use them with tiny bottles if that's what you like to use and we will continue to do so
 with that I will say thank you so very much and I want to be sure to again thank the place the organizers of the our user group in Korea I want to thank my teammates on the tidymodels team at our studio as well as my co-author Emil if he felt
NA