Hugging Face 웹사이트에서 데이터와 사전학습모델을 다운로드 받아 딥러닝 모델의 개발 생산성을 높일 수 있다.
Hugging Face에서 다른 설정을 특별히 하지 않게 되면 윈도우의 경우 다음 디렉토리에 데이터와 사전학습모델이 저장된다.
C:\Users\사용자명\.cache\huggingface\datasets\
datasets
에 포함된 데이텟을 파이썬에서 일별한다.
## 다운로드 받은 데이터셋을 삭제하지 않는 기능
from datasets import disable_caching
from datasets import load_dataset
disable_caching()
from datasets import list_datasets
= list_datasets()
datasets_list len(datasets_list)
7908
기계학습의 가장 대표적인 데이터셋이 붓꽃(iris) 데이터셋이 유명하기
때문에 datasets
패키지에 포함된 데이터 중 iris 데이터세을
다운로드받아 이를 기계학습 분석에 활용한다.
reticulate
패키지 파이썬 객체를 R에서 가져와서
iris
데이터셋을 확인한다.
library(reticulate)
library(tidyverse)
$datasets_list %>%
pyenframe() %>%
filter(str_detect(value, "iris"))
# A tibble: 2 × 2
name value
<int> <chr>
1 4381 Gifted/iris
2 5676 scikit-learn/iris
붓꽃 데이터를 팬다스 데이터프레임으로 변환한 후에 기계학습을 위한 준비 데이터로 되었는지 출력하여 확인한다.
= load_dataset('scikit-learn/iris',
iris ="reuse_dataset_if_exists",
download_mode='z:\dataset')
cache_dir
print(iris)
= iris['train'].to_pandas()
iris_pd
iris_pd
Id SepalLengthCm ... PetalWidthCm Species
0 1 5.1 ... 0.2 Iris-setosa
1 2 4.9 ... 0.2 Iris-setosa
2 3 4.7 ... 0.2 Iris-setosa
3 4 4.6 ... 0.2 Iris-setosa
4 5 5.0 ... 0.2 Iris-setosa
.. ... ... ... ... ...
145 146 6.7 ... 2.3 Iris-virginica
146 147 6.3 ... 1.9 Iris-virginica
147 148 6.5 ... 2.0 Iris-virginica
148 149 6.2 ... 2.3 Iris-virginica
149 150 5.9 ... 1.8 Iris-virginica
[150 rows x 6 columns]
마찬가지 방식으로 질의응답 데이터셋(squad), 영화평점
imdb
데이터를 다운로드한다. 저장장소로
cache_dir='z:\dataset'
와 같이 (시놀로지) NAS 디렉토리를
지정하여 저장한다.
= load_dataset('squad',
squad ="reuse_dataset_if_exists",
download_mode='z:\dataset')
cache_dir
= load_dataset('imdb',
imdb ="reuse_dataset_if_exists",
download_mode='z:\dataset') cache_dir
개와 고양이 이미지 분류가 첫 딥러닝 모형이라 이를 잘 구분하는
분류기를 제작하기 위해서 데이터셋을 다운로드한다. 먼저
고양이(cats
)가 포함된 데이터셋를 찾아 …
cats_vs_dogs
명칭을 갖는 데이터셋을 다운로드 한다.
$datasets_list %>%
pyenframe() %>%
filter(str_detect(value, "cats"))
# A tibble: 17 × 2
name value
<int> <chr>
1 78 cats_vs_dogs
2 1808 hf-internal-testing/cats_vs_dogs_sample
3 2063 huggingface/cats-image
4 2431 nateraw/auto-cats-and-dogs
5 2437 nateraw/cats-and-dogs
6 2438 nateraw/cats_vs_dogs
7 2458 ncats/EpiSet4BinaryClassification
8 2459 ncats/EpiSet4NER-v1
9 2460 ncats/GARD_EpiSet4TextClassification
10 3518 huggan/cats
11 3963 XiangPan/online_shopping_10_cats_62k
12 4250 davanstrien/human_cats
13 4707 ncats/EpiSet4NER-v2
14 5495 nateraw/cats-and-dogs-with-metadata
15 5497 nateraw/cats-and-dogs-metadata
16 5595 efederici/cats_vs_dogs
17 7649 n6L3/online_shopping_10_cats
대안으로 Hugging Face - Datasets 에서 검색을 넣어 가장 많은 다운로드를 기록한 데이터셋을 다운로드한다.
= load_dataset('hf-internal-testing/cats_vs_dogs_sample',
cats_vs_dogs ="reuse_dataset_if_exists",
download_mode='z:\dataset')
cache_dir
'train'].to_pandas() cats_vs_dogs[
image labels
0 {'bytes': None, 'path': 'z:\dataset\downloads\... 0
1 {'bytes': None, 'path': 'z:\dataset\downloads\... 0
2 {'bytes': None, 'path': 'z:\dataset\downloads\... 0
3 {'bytes': None, 'path': 'z:\dataset\downloads\... 0
4 {'bytes': None, 'path': 'z:\dataset\downloads\... 0
5 {'bytes': None, 'path': 'z:\dataset\downloads\... 0
6 {'bytes': None, 'path': 'z:\dataset\downloads\... 0
7 {'bytes': None, 'path': 'z:\dataset\downloads\... 0
8 {'bytes': None, 'path': 'z:\dataset\downloads\... 0
9 {'bytes': None, 'path': 'z:\dataset\downloads\... 0
10 {'bytes': None, 'path': 'z:\dataset\downloads\... 0
11 {'bytes': None, 'path': 'z:\dataset\downloads\... 0
12 {'bytes': None, 'path': 'z:\dataset\downloads\... 0
13 {'bytes': None, 'path': 'z:\dataset\downloads\... 0
14 {'bytes': None, 'path': 'z:\dataset\downloads\... 0
15 {'bytes': None, 'path': 'z:\dataset\downloads\... 0
16 {'bytes': None, 'path': 'z:\dataset\downloads\... 0
17 {'bytes': None, 'path': 'z:\dataset\downloads\... 0
18 {'bytes': None, 'path': 'z:\dataset\downloads\... 0
19 {'bytes': None, 'path': 'z:\dataset\downloads\... 0
20 {'bytes': None, 'path': 'z:\dataset\downloads\... 0
21 {'bytes': None, 'path': 'z:\dataset\downloads\... 0
22 {'bytes': None, 'path': 'z:\dataset\downloads\... 0
23 {'bytes': None, 'path': 'z:\dataset\downloads\... 0
24 {'bytes': None, 'path': 'z:\dataset\downloads\... 0
25 {'bytes': None, 'path': 'z:\dataset\downloads\... 1
26 {'bytes': None, 'path': 'z:\dataset\downloads\... 1
27 {'bytes': None, 'path': 'z:\dataset\downloads\... 1
28 {'bytes': None, 'path': 'z:\dataset\downloads\... 1
29 {'bytes': None, 'path': 'z:\dataset\downloads\... 1
30 {'bytes': None, 'path': 'z:\dataset\downloads\... 1
31 {'bytes': None, 'path': 'z:\dataset\downloads\... 1
32 {'bytes': None, 'path': 'z:\dataset\downloads\... 1
33 {'bytes': None, 'path': 'z:\dataset\downloads\... 1
34 {'bytes': None, 'path': 'z:\dataset\downloads\... 1
35 {'bytes': None, 'path': 'z:\dataset\downloads\... 1
36 {'bytes': None, 'path': 'z:\dataset\downloads\... 1
37 {'bytes': None, 'path': 'z:\dataset\downloads\... 1
38 {'bytes': None, 'path': 'z:\dataset\downloads\... 1
39 {'bytes': None, 'path': 'z:\dataset\downloads\... 1
40 {'bytes': None, 'path': 'z:\dataset\downloads\... 1
41 {'bytes': None, 'path': 'z:\dataset\downloads\... 1
42 {'bytes': None, 'path': 'z:\dataset\downloads\... 1
43 {'bytes': None, 'path': 'z:\dataset\downloads\... 1
44 {'bytes': None, 'path': 'z:\dataset\downloads\... 1
45 {'bytes': None, 'path': 'z:\dataset\downloads\... 1
46 {'bytes': None, 'path': 'z:\dataset\downloads\... 1
47 {'bytes': None, 'path': 'z:\dataset\downloads\... 1
48 {'bytes': None, 'path': 'z:\dataset\downloads\... 1
49 {'bytes': None, 'path': 'z:\dataset\downloads\... 1
딥러닝 가장 인기가 많은 mnist
데이터셋도 다운로드 받아
NAS 저장소에 저장한다. 공통적으로 .arrow
파일 확장자가를
갖는 파일로 저장된다.
= load_dataset('mnist',
mnist ="reuse_dataset_if_exists",
download_mode='z:\dataset')
cache_dir
mnist
DatasetDict({
train: Dataset({
features: ['image', 'label'],
num_rows: 60000
})
test: Dataset({
features: ['image', 'label'],
num_rows: 10000
})
})
허깅페이스 캐쉬 저장소를 지정하는 방식은 크게 3가지로 나눠진다.
import os
os.environ['TRANSFORMERS_CACHE'] = '/blabla/cache/'
export TRANSFORMERS_CACHE=/blabla/cache/
tokenizer = AutoTokenizer.from_pretrained("roberta-base", cache_dir="new_cache_dir/")
model = AutoModelForMaskedLM.from_pretrained("roberta-base", cache_dir="new_cache_dir/")
Hugging Face Course에 담기 내용을 바탕으로 자세히 살펴보자.
distilbert-base-uncased-finetuned-sst-2-english
학습된
모형으로 문장에 담긴 감성을 분석한다. 허깅페이스 캐쉬 디렉토리에 저장된
‘z:/dataset/hf’ 에서 모형을 가져와서 감성분석을 진행한다.
# ! pip install datasets transformers[sentencepiece]
import os
'TRANSFORMERS_CACHE'] = 'z:/dataset/hf'
os.environ[
from transformers import pipeline
<frozen importlib._bootstrap>:219: RuntimeWarning: scipy._lib.messagestream.MessageStream size changed, may indicate binary incompatibility. Expected 56 from C header, got 64 from PyObject
= pipeline("sentiment-analysis")
classifier
No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)
classifier(
["I've been waiting for a HuggingFace course my whole life.",
"I hate this so much!",
"We are very happy to show you the 🤗 Transformers library.",
"We hope you don't hate it."
] )
[{'label': 'POSITIVE', 'score': 0.9598049521446228}, {'label': 'NEGATIVE', 'score': 0.9994558691978455}, {'label': 'POSITIVE', 'score': 0.9997795224189758}, {'label': 'NEGATIVE', 'score': 0.5308623313903809}]
한단계 더 들어가 tokenizer와 model을 각기 달리하여 파이프라인을 구성한 후에 문장에 대한 감성을 파악할 수 있다.
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification
= "nlptown/bert-base-multilingual-uncased-sentiment"
model_name = AutoTokenizer.from_pretrained(model_name, cache_dir="z:/dataset/hf")
tokenizer
= "nlptown/bert-base-multilingual-uncased-sentiment"
model_name = AutoModelForSequenceClassification.from_pretrained(model_name,
pt_model ="z:/dataset/hf")
cache_dir
= pipeline("sentiment-analysis",
classifier = pt_model,
model = tokenizer)
tokenizer
classifier(
["지금 기분이 좋습니다.",
"나는 행복합니다",
"무척이나 슬프고 서럽습니다",
"영화가 그저그렇다."
]
)
# [{'label': '4 stars', 'score': 0.3802778422832489}, {'label': '5 stars', 'score': 0.5900374054908752}, {'label': '1 star', 'score': 0.5471497774124146}, {'label': '3 stars', 'score': 0.3373289406299591}]
주어진 문장이 어떤 주제로 분류되는지를
facebook/bart-large-mnli
사전학습모형으로 분류할 수 있다.
facebook/bart-large-mnli에서
자세한 사항을 확인할 수 있다.
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from transformers import pipeline
= AutoModelForSequenceClassification.from_pretrained('facebook/bart-large-mnli',
nli_model ="z:/dataset/hf")
cache_dir
= AutoTokenizer.from_pretrained('facebook/bart-large-mnli',
tokenizer ="z:/dataset/hf")
cache_dir
= pipeline('zero-shot-classification',
classifier = nli_model,
model = tokenizer)
tokenizer
= "one day I will see the world"
sequence_to_classify = ['travel', 'cooking', 'dancing']
candidate_labels =True)
classifier(sequence_to_classify, candidate_labels, multi_label# {'sequence': 'one day I will see the world', 'labels': ['travel', 'dancing', 'cooking'], 'scores': [0.994511067867279, 0.005706145893782377, 0.0018192846328020096]}
gpt2를 사용해서 텍스트를 생성할 수 있다.
from transformers import pipeline, set_seed
from transformers import GPT2Tokenizer, GPT2LMHeadModel
= GPT2Tokenizer.from_pretrained('gpt2', cache_dir="z:/dataset/hf")
gpt2_tokenizer = GPT2LMHeadModel.from_pretrained('gpt2', cache_dir="z:/dataset/hf")
gpt2_model
= pipeline('text-generation',
generator = gpt2_tokenizer,
tokenizer = gpt2_model)
model 777)
set_seed("The White man worked as a", max_length=12, num_return_sequences=3)
generator(# [{'generated_text': 'The White man worked as a janitor, a regular office'}, {'generated_text': 'The White man worked as a sales rep for the paper.'}, {'generated_text': 'The White man worked as a contractor for Waffle House in'}]
"In this course, we will teach you how to", max_length=20, num_return_sequences=3)
generator(
# [{'generated_text': 'In this course, we will teach you how to create a reusable, self-sustaining app'}, {'generated_text': 'In this course, we will teach you how to create a perfect web application.'}, {'generated_text': 'In this course, we will teach you how to use both the HN and the NN channels'}]
BertForMaskedLM
모형을 사용해서 [MASK]
빈칸
채워넣기를 할 수 있다.
from transformers import pipeline, set_seed
from transformers import BertTokenizer, BertForMaskedLM
= BertTokenizer.from_pretrained('bert-base-uncased', cache_dir="z:/dataset/hf/mask")
fill_mask_tokenizer = BertForMaskedLM.from_pretrained("bert-base-uncased", cache_dir="z:/dataset/hf/mask")
fill_mask_model
= pipeline('fill-mask',
unmasker = fill_mask_tokenizer,
tokenizer = fill_mask_model)
model
"The White man worked as a [MASK].", top_k = 3)
unmasker(# [{'score': 0.16597844660282135, 'token': 10533, 'token_str': 'c a r p e n t e r', 'sequence': 'the white man worked as a carpenter.'}, {'score': 0.09424988180398941, 'token': 7500, 'token_str': 'f a r m e r', 'sequence': 'the white man worked as a farmer.'}, {'score': 0.07112522423267365, 'token': 20987, 'token_str': 'b l a c k s m i t h', 'sequence': 'the white man worked as a blacksmith.'}]
"The black woman worked as a [MASK].", top_k = 3)
unmasker(# [{'score': 0.21546946465969086, 'token': 6821, 'token_str': 'n u r s e', 'sequence': 'the black woman worked as a nurse.'}, {'score': 0.19593699276447296, 'token': 13877, 'token_str': 'w a i t r e s s', 'sequence': 'the black woman worked as a waitress.'}, {'score': 0.09739767760038376, 'token': 10850, 'token_str': 'm a i d', 'sequence': 'the black woman worked as a maid.'}]
"The Asian worked as a [MASK].", top_k = 3)
unmasker(# [{'score': 0.12938520312309265, 'token': 7500, 'token_str': 'f a r m e r', 'sequence': 'the asian worked as a farmer.'}, {'score': 0.09072186052799225, 'token': 3836, 'token_str': 't e a c h e r', 'sequence': 'the asian worked as a teacher.'}, {'score': 0.039076220244169235, 'token': 10533, 'token_str': 'c a r p e n t e r', 'sequence': 'the asian worked as a carpenter.'}]
"The Korean worked as a [MASK].", top_k = 3)
unmasker(# [{'score': 0.11826611310243607, 'token': 7500, 'token_str': 'f a r m e r', 'sequence': 'the korean worked as a farmer.'}, {'score': 0.09532739222049713, 'token': 3836, 'token_str': 't e a c h e r', 'sequence': 'the korean worked as a teacher.'}, {'score': 0.04381617531180382, 'token': 5160, 'token_str': 'l a w y e r', 'sequence': 'the korean worked as a lawyer.'}]
SQuAD2.0
데이터셋에 기반한 deepset/roberta-base-squad2
모형을 바탕으로 질의응답을 구현할 수 있다.
from transformers import pipeline
from transformers import AutoModelForQuestionAnswering, AutoTokenizer
= AutoTokenizer.from_pretrained("deepset/roberta-base-squad2",
qna_tokenizer ="z:/dataset/pretrained/")
cache_dir
= AutoModelForQuestionAnswering.from_pretrained("deepset/roberta-base-squad2",
qna_model ="z:/dataset/pretrained/")
cache_dir
= pipeline('question-answering',
qNa =qna_model,
model=qna_tokenizer)
tokenizer
= '''
paragraph A new study estimates that if the US had universally mandated masks on 1 April, there could have been nearly 40% fewer deaths by the start of June. Containment policies had a large impact on the number of COVID-19 cases and deaths, directly by reducing transmission rates and indirectly by constraining people’s behaviour. They account for roughly half the observed change in the growth rates of cases and deaths.
'''
'question': 'Which country is this article about?',
qNa({'context': f'{paragraph}'})
# {'score': 0.019898569211363792, 'start': 35, 'end': 37, 'answer': 'US'}
'question': 'Which disease is discussed in this article?',
qNa({'context': f'{paragraph}'})
# {'score': 0.00024747499264776707, 'start': 206, 'end': 214, 'answer': 'COVID-19'}
Helsinki-NLP/opus-mt-ko-en 모델은 한글을 입력받아 영어로 번역한다.
from transformers import pipeline
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
= AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-ko-en",
translator_tokenizer ="z:/dataset/hf/")
cache_dir
= AutoModelForSeq2SeqLM.from_pretrained("Helsinki-NLP/opus-mt-ko-en",
translator_model ="z:/dataset/hf/")
cache_dir
= pipeline("translation",
translator = translator_tokenizer,
tokenizer = translator_model)
model
"대한민국 축구가 드디어 독일을 격파했습니다.")
translator(# [{'translation_text': 'South Korean football finally destroyed Germany.'}]
"President Yoon did a great work in Korean history")
translator(# [{'translation_text': 'President Yoon did a great job in Coran history'}]
CNN Daily Mail 데이터를 기반으로 하고 있는 facebook/bart-large-cnn 모델을 활용하여 문단을 요약한다.
from transformers import pipeline
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
= AutoTokenizer.from_pretrained("facebook/bart-large-cnn",
summary_tokenizer ="z:/dataset/hf/")
cache_dir
= AutoModelForSeq2SeqLM.from_pretrained("facebook/bart-large-cnn",
summary_model ="z:/dataset/hf/")
cache_dir
= pipeline("summarization",
summarizer = summary_model,
model = summary_tokenizer)
tokenizer
= '''
paragraph A new study estimates that if the US had universally mandated masks on 1 April, there could have been nearly 40% fewer deaths by the start of June. Containment policies had a large impact on the number of COVID-19 cases and deaths, directly by reducing transmission rates and indirectly by constraining people’s behaviour. They account for roughly half the observed change in the growth rates of cases and deaths.
'''
=50, min_length=30, do_sample=False) summarizer(paragraph, max_length
[{'summary_text': 'Study estimates that if the US had universally mandated masks on 1 April, there could have been nearly 40% fewer deaths by the start of June. Containment policies had a large impact on the number of COVID-19 cases and deaths'}]