python html.parser example URL에서 이미지 크롤링

9월 23, 2020

python에서 html parser가 필요해 자료를 찾아보다. 테스트를 겸해 URL에서 이미지 파일을 크롤링하는 코드를 만들어 봤다.

python에서 url의 html 가져오기

python의 urllib.request를 사용하면 특정 URL의 html을 쉽게 받아올 수 있다.

https://docs.python.org/3.7/library/urllib.request.html#urllib.request.Request

import urllib.request
with urllib.request.urlopen('https://ryanclaire.blogspot.com') as f:
print(f.read(100).decode('utf-8'))

python의 html.parser

html.parser.HTMLParser는 HTML 및 XHTML 형식의 구문을 분석하기 위해 만들어진 class이다.

https://docs.python.org/3.7/library/html.parser.html

from html.parser import HTMLParser

class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
print("Encountered a start tag:", tag)

def handle_endtag(self, tag):
print("Encountered an end tag :", tag)

def handle_data(self, data):
print("Encountered some data :", data)

HTMLParser를 상속받은 Class를 위와 같이 구현하면, tag의 시작과 끝에서 handle_starttag와 handle_endtag가 호출되어 tag의 시작과 끝 그리고 tag의 내용 등을 알 수 있는 코드를 만들 수 있다.

이미지 크롤링 코드

전체 코드는 아래와 같다. html에서 'img'와 'a' 태그에서 이미지 파일 확장자를 가진 URL을 따로 저장하고, 이 URL 리스트의 파일을 다운받아 파일로 저장하도록 했다.

import urllib.request
import urllib.error
from html.parser import HTMLParser

class MyHTMLParser(HTMLParser):

img_links = []

def isIMGPath(self, path):
imgtype = ['JPG','PNG','BMP','GIF']
if len(path)<4:
return False

path = path.upper()
for x in imgtype:
if path.rfind(x,len(path)-len(x)) > 0:
return True

return False


def handle_starttag(self, tag, attrs):
if tag !='a' and tag != 'img':
return
for attr in attrs:
if self.isIMGPath(attr[1]):
self.img_links.append(attr[1])

if __name__ == "__main__":
opener = urllib.request.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
url = 'https://ryanclaire.blogspot.com/2020/09/linux-esp8266-client-tcp-udp.html'

with opener.open(url) as f:
parser = MyHTMLParser()
parser.feed(f.read().decode())

for x in parser.img_links:
filename = x.split('/')[-1]
print('url:',x)
if x.rfind('http',0) is not 0:
x = 'https:'+x
try:
imgurl = urllib.request.urlopen(x)
except urllib.error.URLError:
imgurl = None
print('URLError :',x)

if imgurl is not None:
imgf = open(filename,'wb')
imgf.write(imgurl.read())
imgf.close()
imgurl.close()

실행 결과 블로그에서 이미지 파일이 다운되는 것을 볼 수 있다.

ry.cl. blog