[python] 20. scraper

ProgrammingLang/python

[python] 20. scraper

jinkwon.kim 2018. 11. 27. 21:33

728x90

[python] 20. scraper

1. 웹 scraper

- 도메인 이름을 받고 HTML 데이터를 가져옴

- 데이터를 파싱해 원하는 정보를 얻음

- 원하는 정보를 저장함

- 필요하다면 다른 페이지에서도 이 작업을 반복함

2. 필요한 모듈(모두 Third party 모둘)

- requests

- BeautifulSoup4 (핵심)

- lxml

> HTML 파싱능력이 뛰어남

3. BeautifulSoup4

- HTML(XML)을 파싱하게 좋게 파이썬 객체로 돌려준다.

- 잘못된 HTML 을 수정하여 반환해준다.

1) 설치

- pip install beautifulsoup4

2) import 방법

- from bs4 import BeautifulSoup

3) 사용법

(1) soup 객체 얻는 방법

- 아래 코드에서 BeautifulSoup(helloworld, 'lxml') 을통해 soup 객체를 얻을수 있다.

- 1번째 파라미터는 문자열이 있는 변수, 2번째 parser

from bs4 import BeautifulSoup

string = "<p>hellow world</p> <div>asdfasdf</div>"

soup_string = BeautifulSoup(string, 'lxml')

(2) soup 객체에서 지원하는 속성

- tag 객체

> soup객체.태그명

Ex)

p_tag = soup_string.p

print(p_tag.prettify())

결과 :

<p>

hellow world

</p>

- NavigableString 객체

4) soup에서 지원하는 함수

(1) find(tag, attribute)

- soup객체.find('div')

> 첫번재 div만 찾기

- soup객체.find(id='gitlist')

> id가 giftlist인 태그가 리턴이 된다.

(2) find_all(tag)

- 찾은 tag정보를 모드 list로 반환한다.

(3) select("css selector 문법")

- CSS의 selector 문법으로 태그를 가져온다.

- 결과는 list로 반환 된다

728x90

저작자표시 변경금지

'ProgrammingLang > python' 카테고리의 다른 글

[python] 21. 날짜 다루기, 프로세스(subprocess) 실행 (0)	2018.11.27
[python] 19. python Open API 와 JSON (5)	2018.11.26
[python] 18. CSV 파일 다루기 (0)	2018.11.22
[python] 17. python으로 엑셀 다루기 (7)	2018.11.22
[python] 16. class 와 객체 (0)	2018.11.21

현재글[python] 20. scraper

250x250

it 개발 전문 블로거 chun4foryou@gmail.com

JSP, eclipse, go, IPC, k8s, c++, html5, javascript, CKAD, typescript, AWS, 급한 k8s, 코테, python, jsp 홈페이지 만들기, 오블완, java, CKA, docker, 티스토리챌린지,

Today :
Yesterday :

즐거운인생 (미련없이 하자)