phi-3.5-vision 사용해 보기 (눈 달린 AI?)

728x90

MS에서 좋은 모델이 나왔다길래 한번 사용해 봤다.

https://www.aipostkorea.com/news/articleView.html?idxno=2220

MS 소형 언어모델 '파이-3'에 눈 달린다…차트·그래프 등 AI가 보고 답해

마이크로소프트(MS)가 지난달 공개한 소형언어모델(SLM) 파이-3(Phi-3)에 비전(Vision) 기능을 추가했다. 이미지를 제공하면 AI가 자연어로 대답할 수 있게 됐다. 노트북이나 스마트폰 등 개인용 기기

www.aipostkorea.com

가장 최근에 나온 모델로 Test~

https://huggingface.co/microsoft/Phi-3.5-vision-instruct/tree/main

microsoft/Phi-3.5-vision-instruct at main

huggingface.co

flash attention 을 설치 해야 속도가 빠른것 같은데

설치가....한참이다.

그것도 Cuda 버전 안맞아서.

Cuda 12.4 로 다시 설치 하고 pytorch를 12.4 에 맞춰서 다시 설치 했다.

그래서 우선 아래 문구를 보고 셋팅을 바꿔 실행해 봤다.

# Note: set _attn_implementation='eager' if you don't have flash_attn installed

from PIL import Image
import requests
from transformers import AutoModelForCausalLM
from transformers import AutoProcessor

model_id = "microsoft/Phi-3.5-vision-instruct"

# Note: set _attn_implementation='eager' if you don't have flash_attn installed
model = AutoModelForCausalLM.from_pretrained(
  model_id,
  device_map="cuda",
  trust_remote_code=True,
  torch_dtype="auto",
  _attn_implementation='eager'
)

# for best performance, use num_crops=4 for multi-frame, num_crops=16 for single-frame.
processor = AutoProcessor.from_pretrained(model_id,
  trust_remote_code=True,
  num_crops=4
)

images = []
placeholder = ""

# Note: if OOM, you might consider reduce number of frames in this example.
# for i in range(1,20):
#     url = f"https://image.slidesharecdn.com/azureintroduction-191206101932/75/Introduction-to-Microsoft-Azure-Cloud-{i}-2048.jpg"
#     images.append(Image.open(requests.get(url, stream=True).raw))
#     placeholder += f"<|image_{i}|>\n"

image_url = "http://www.conslove.co.kr/news/photo/202110/70961_209640_2032.jpg"
images.append(Image.open(requests.get(image_url, stream=True).raw))
placeholder = "<|image_1|>\n"

messages = [
    {"role": "user", "content": placeholder+"이 이미지에 대해 설명 해 주세요."},
]

prompt = processor.tokenizer.apply_chat_template(
  messages,
  tokenize=False,
  add_generation_prompt=True
)
print(prompt)
inputs = processor(prompt, images, return_tensors="pt").to("cuda:0")

generation_args = {
    "max_new_tokens": 1000,
    "temperature": 0.2,
    "do_sample": False,
}

generate_ids = model.generate(**inputs,
  eos_token_id=processor.tokenizer.eos_token_id,
  **generation_args
)

# remove input tokens
generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]
response = processor.batch_decode(generate_ids,
  skip_special_tokens=True,
  clean_up_tokenization_spaces=False)[0]

print(response)

예제에서 이미지는 한장짜리로 바꾸고 인터넷에서 아무 포스터를 가져와 설명해 달라고 했다 .

이 이미지 였는데

영어로 답을 해 번역해봤다.

The image is a colorful illustration related to the Mid-Autumn Festival. It features a large banner with Korean text that translates to 'Mid-Autumn Festival 2021' and the date range '2021.10.5(Sat) - 11.7(Sun)'. Below the banner, there is a scene with a person walking on a path, a gazebo, and autumn leaves scattered around. The background shows a serene landscape with a lake, a full moon, and a clear sky. The overall theme of the image is festive and celebratory, with a focus on the traditional aspects of the Mid-Autumn Festival.

이 이미지는 추석과 관련된 다채로운 일러스트입니다. '추석 2021'이라는 한국어 텍스트와 날짜 범위 '2021.10.5(토) - 11.7(일)'가 적힌 큰 배너가 특징입니다. 배너 아래에는 길을 걷고 있는 사람, 정자, 그리고 흩어진 가을 잎사귀들이 있는 장면이 그려져 있습니다. 배경에는 호수와 보름달, 맑은 하늘이 펼쳐진 고요한 풍경이 보입니다. 전체적인 이미지의 주제는 축제 분위기로, 추석의 전통적인 면을 강조하고 있습니다.

날짜 일부를 틀리긴 했지만 생각보다 내용을 잘 인식 하고 있다.

한글이 잘 안되는건 아쉽다.

한글로 답하는 걸 찾아봄

아래처럼 다시 질문해 봤다.

<|user|>
<|image_1|>
이 포스터 안에 있는 행사의 장소와 일정을 알려줘.<|end|>
<|assistant|>

포스터에는 2021.10.15(화) - 11.7(월) 동안 축제가 열리는 장소와 일정이 있습니다.

그러나 포스터에 언급된 특정 장소나 일정은 없습니다.

이번에는 숫자는 잘 읽었고 장소는 알아내지 못했다.

이번에는 뉴스 기사에 있는 표를 한번 읽어 보자.

https://n.news.naver.com/mnews/article/032/0003317317?sid=101

SK이노·SK E&S 합병 주주총회 넘었다···100조 에너지 기업 탄생 가시화

SK이노베이션과 SK E&S 합병안이 양사 주주총회를 통과했다. 합병을 위한 마지막 관문인 주식매수청구권 문제를 넘어서면 오는 11월 자산 100조원 규모의 초대형 에너지 기업이 탄생한다. SK이노베

n.news.naver.com

이미지는 이렇게 생겼고 아래와 같이 질문했다.

과연 답은??

PC (RTX3060ti)에서 돌아가는 경량 모델 임에도 생각보다 좋은 성능을 보였다.

잘 쓸 수 있을 듯~ THX

Windows에서 flash attention 2 설치 방법은 아래 링크로~

https://devmeta.tistory.com/101

Windows Flash attenion 2 실행 시키기 (torch 2.1 / cuda 12.1)

파이토치 기본 Attention 보다 빠르다는Flash Attention 2를 윈도우에서 실행해 보는 중이다. https://dajeblog.co.kr/flashattention-v2-%EB%85%BC%EB%AC%B8-%EB%A6%AC%EB%B7%B0-%EA%B8%B0%EC%A1%B4-attention%EB%B3%B4%EB%8B%A4-59%EB%B0%B0-%

devmeta.tistory.com

728x90

저작자표시 (새창열림)

'AI' 카테고리의 다른 글

한글 젤 잘 보는 이미지 모델 Qwen2-VL (0)	2024.08.30
그림 제일 잘 그리는 AI - Flux 윈도우에 설치 (1)	2024.08.30
llama3.1 파인튜닝 퀵하게~ (with Colab) (0)	2024.08.26
한글 더 잘하는 llama3 찾아서 ollama에 연결하기 (feat. Bllossom ELO) (4)	2024.07.10
한국에서 만든 AI 서비스 : 뤼튼 (wrtn) 사용기 (무료 챗GPT) (0)	2024.06.23

DevMeta

phi-3.5-vision 사용해 보기 (눈 달린 AI?)

'AI' 카테고리의 다른 글

티스토리툴바

phi-3.5-vision 사용해 보기 (눈 달린 AI?)

'AI' 카테고리의 다른 글

'AI' Related Articles

티스토리툴바