Toy Project/Crawler

[Crawler] 파이썬 웹 크롤링(기본편) 2 - beautifulsoup

파이썬 웹 크롤링(기본 편) 2 - beautifulsoup

BeautifulSoup은 HTML 및 XML 문서를 파싱 하고, 문서의 특정 부분에서 데이터를 추출하는 데 사용되는 파이썬 라이브러리이다. 이를 통해 웹 스크레이핑과 같은 작업을 수행할 수 있다. 주로 웹 페이지에서 원하는 정보를 추출하기 위해 사용된다. 또한, HTML 또는 XML 문서를 구문 분석하고, 해당 문서의 요소에 쉽게 접근할 수 있는 메서드와 속성을 제공한다. 이를 통해 사용자는 원하는 정보를 추출하거나 문서의 구조를 탐색하는 작업을 편리하게 수행할 수 있다.

pip install beautifulsoup4

위의 명령어를 입력하여 beautifulsoup 라이브러리를 설치할 수 있다.

설치가 완료되었다면 간단한 예제 코드로 beaitifulsoup의 사용법을 한 번 살펴보자.

import requests
from bs4 import BeautifulSoup

# naver 서버에 대화를 시도
response = requests.get("http://www.naver.com/")

# naver 에서 html 응답
html = response.text

# html 번역
soup = BeautifulSoup(html, 'html.parser')

# id값이 shortcutArea인 요소 1개를 가져옴
word = soup.select_one("#shortcutArea")
print(word)

requests 라이브러리를 사용하여 Naver의 웹 페이지에 HTTP GET 요청을 보내고, 받은 응답을 BeautifulSoup을 사용하여 파싱 하는 과정을 나타내는 예제 코드이다.

<div class="shortcut_area" id="shortcutArea" role="region"></div>

실행 결과

response = requests.get("http://www.naver.com/") - Naver의 웹 페이지에 GET 요청을 보내고, 서버에서의 응답을 response 변수에 저장한다.
html = request.text - 서버로부터 받은 응답 중 HTML 내용을 text 속성을 통해 추출하여 html 변수에 저장한다.
soup = BeautifulSoup(html, 'html.parser') - BeautifulSoup을 사용하여 HTML 문서를 파싱 한다. 첫 번째 파라미터로는 파싱 할 HTML 문서(위의 예제에서는 html 변수), 두 번째 파라미터로는 파서의 종류를 지정한다. 여기서는 기본 파서로 'html.parser'를 사용하였다.
word = soup.select_one("#shortcutArea") - soup 객체를 사용하여 CSS 선택자를 활용해 원하는 요소를 선택한다. 여기서는 id가 "shortcutArea"인 요소를 선택하고, 이를 word 변수에 저장한다.
print(word) - 선택한 요소를 출력한다.

종합하면, 이 코드는 Naver의 웹 페이지에서 id가 "shortcutArea"인 요소를 찾아서 출력하는 간단한 예제이다. CSS 선택자를 사용하여 특정 요소를 선택하는 방법을 익히고, 해당 요소의 내용이나 속성을 추출하는 기본적인 스크레이핑 작업을 수행하고 있다.(📣 Naver 페이지의 구성요소가 변경되어 위의 요소가 이 글을 보고 따라 하시는 독자분들에겐 없을 수 도 있다는 첨 참고 부탁한다.)

이젠 beautifulsoup의 주요 메서드는 어떤 것들이 있는지 한 번 알아보자.

find(name, attrs, recursive, string, **kwargs)

name - 태그의 이름이나 태그 이름의 리스트를 지정한다.
attrs - 속성과 값으로 이루어진 딕셔너리를 지정하여 해당 속성을 가진 태그를 검색한다.
recursive - 자식 요소만 탐색할지, 모든 자손 요소를 탐색할지를 결정한다.
string - 태그의 텍스트를 검색한다.
**kwargs - 기타 추가적인 속성을 지정할 수 있다.

import requests
from bs4 import BeautifulSoup

# HTML 예시
html_content = """
<html>
    <body>
        <div class="content">
            <h1>Main Title</h1>
            <p>Paragraph 1</p>
            <p>Paragraph 2</p>
            <ul>
                <li>Item 1</li>
                <li>Item 2</li>
            </ul>
        </div>
        <div class="sidebar">
            <h2>Sidebar Title</h2>
            <p>Additional information</p>
        </div>
    </body>
</html>
"""

# BeautifulSoup 객체 생성
soup = BeautifulSoup(html_content, 'html.parser')

# find 메서드를 사용하여 특정 태그를 찾기
title_tag = soup.find('h1')  # 태그 이름으로 찾기
paragraph_tag = soup.find('p', class_ = 'content')  # 클래스 속성으로 찾기
item_tag = soup.find('li', {'class': 'item', 'id': 'item2'})  # 여러 속성으로 찾기

# 결과 출력
print("Title :", title_tag.text if title_tag else "Not Found")
print("Paragraph :", paragraph_tag.text if paragraph_tag else "Not Found")
print("Item :", item_tag.text if item_tag else "Not Found")

Title : Main Title
Paragraph : Not Found
Item : Not Found

실행 결과

find_all(name, attrs, recursive, string, limit, **kwargs)

limit - 검색 결과의 최대 개수를 제한한다.
(나머지는 find() 함수와 동일)

import requests
from bs4 import BeautifulSoup

# HTML 예시
html_content = """
<html>
    <body>
        <div class="content">
            <h1>Main Title</h1>
            <p>Paragraph 1</p>
            <p>Paragraph 2</p>
            <ul>
                <li>Item 1</li>
                <li>Item 2</li>
            </ul>
        </div>
        <div class="content">
            <h2>Secondary Title</h2>
            <p>Another paragraph</p>
        </div>
    </body>
</html>
"""

# BeautifulSoup 객체 생성
soup = BeautifulSoup(html_content, 'html.parser')

# find_all 메서드를 사용하여 특정 태그 모두 찾기
paragraphs = soup.find_all('p')  # 모든 <p> 태그 찾기
content_divs = soup.find_all('div', class_ = 'content')  # 클래스 속성으로 찾기

# 결과 출력
print("Paragraphs : ")
for paragraph in paragraphs:
    print(paragraph.text)

print("\nContent Divs : ")
for div in content_divs:
    print(div.text)

Paragraphs :
Paragraph 1      
Paragraph 2      
Another paragraph

Content Divs :   

Main Title       
Paragraph 1      
Paragraph 2      

Item 1
Item 2



Secondary Title  
Another paragraph

실행 결과

select(css_selector)

CSS 선택자를 사용하여 요소를 선택한다.
반환값은 리스트이며, 선택한 모든 요소가 포함된다.

import requests
from bs4 import BeautifulSoup

# HTML 예시
html_content = """
<html>
    <body>
        <div id="main-content">
            <h1>Main Title</h1>
            <p class="paragraph">Paragraph 1</p>
            <p class="paragraph">Paragraph 2</p>
            <ul>
                <li>Item 1</li>
                <li>Item 2</li>
            </ul>
        </div>
        <div id="secondary-content">
            <h2>Secondary Title</h2>
            <p class="paragraph">Another paragraph</p>
        </div>
    </body>
</html>
"""

# BeautifulSoup 객체 생성
soup = BeautifulSoup(html_content, 'html.parser')

# select 메서드를 사용하여 CSS 선택자로 요소 선택
main_title = soup.select('#main-content h1')  # id가 'main-content'인 div 안의 h1 태그
paragraphs = soup.select('.paragraph')  # class가 'paragraph'인 모든 p 태그
secondary_title = soup.select('#secondary-content h2')  # id가 'secondary-content'인 div 안의 h2 태그

# 결과 출력
print("Main Title :", main_title[0].text)
print("\nParagraphs :")
for paragraph in paragraphs:
    print(paragraph.text)
print("\nSecondary Title :", secondary_title[0].text)

Main Title : Main Title

Paragraphs :
Paragraph 1
Paragraph 2
Another paragraph

Secondary Title : Secondary Title

실행 결과

prettify()

파싱 된 문서를 예쁘게 출력한다. 들여 쓰기와 줄 바꿈을 추가하여 가독성을 높인다.

import requests
from bs4 import BeautifulSoup

# HTML 예시
html_content = """
<html>
    <head>
        <title>Sample Page</title>
    </head>
    <body>
        <div id="main-content">
            <h1>Main Title</h1>
            <p class="paragraph">Paragraph 1</p>
            <p class="paragraph">Paragraph 2</p>
        </div>
    </body>
</html>
"""

# BeautifulSoup 객체 생성
soup = BeautifulSoup(html_content, 'html.parser')

# prettify 메서드를 사용하여 HTML을 정리된 형태로 출력
pretty_html = soup.prettify()

# 결과 출력
print(pretty_html)

<html>
 <head>
  <title>
   Sample Page
  </title>
 </head>
 <body>
  <div id="main-content">
   <h1>
    Main Title
   </h1>
   <p class="paragraph"> 
    Paragraph 1
   </p>
   <p class="paragraph"> 
    Paragraph 2
   </p>
  </div>
 </body>
</html>

실행 결과

get_text(separator, strip)

현재 선택한 요소의 텍스트를 추출한다.
separator - 각 요소의 텍스트를 구분하는 문자열을 지정한다.
strip - 공백을 제거할지 여부를 결정한다.

import requests
from bs4 import BeautifulSoup

# HTML 예시
html_content = """
<html>
    <head>
        <title>Sample Page</title>
    </head>
    <body>
        <div id="main-content">
            <h1>Main Title</h1>
            <p class="paragraph">Paragraph 1</p>
            <p class="paragraph">Paragraph 2</p>
        </div>
    </body>
</html>
"""

# BeautifulSoup 객체 생성
soup = BeautifulSoup(html_content, 'html.parser')

# get_text 메서드를 사용하여 텍스트 추출 (기본값)
text_default = soup.get_text()
print("Default Separator:")
print(text_default)

# get_text 메서드를 사용하여 텍스트 추출 (separator 지정)
text_with_separator = soup.get_text(separator = ' | ')
print("\nCustom Separator : ")
print(text_with_separator)

# get_text 메서드를 사용하여 텍스트 추출 (strip 적용)
text_with_strip = soup.get_text(strip = True)
print("\nWith Strip : ")
print(text_with_strip)


Sample Page        



Main Title
Paragraph 1        
Paragraph 2        





Custom Separator : 

 | 
 |
 | Sample Page |
 |
 |
 |
 | Main Title |
 | Paragraph 1 |
 | Paragraph 2 |
 |
 |
 |


With Strip :
Sample PageMain TitleParagraph 1Paragraph 2

실행 결과

find_parent(name, attrs, **kwargs)

name - 찾을 부모 요소의 이름을 지정한다. 문자열이나 정규표현식을 사용할 수 있다.
attrs - 찾을 부모 요소의 속성과 값을 딕셔너리로 지정한다.
**kwargs - 다양한 검색 조건을 키워드 인수로 추가할 수 있다.

from bs4 import BeautifulSoup

html_content = """
<html>
    <head>
        <title>Sample Page</title>
    </head>
    <body>
        <div id="main-content">
            <h1>Main Title</h1>
            <p class="paragraph">Paragraph 1</p>
            <p class="paragraph">Paragraph 2</p>
        </div>
    </body>
</html>
"""

soup = BeautifulSoup(html_content, 'html.parser')

# class가 'paragraph'인 p 요소의 부모를 찾음
parent_of_paragraph = soup.find("p", class_ = "paragraph").find_parent()
print("Parent Element :")
print(parent_of_paragraph)

# id가 'main-content'인 div 요소의 부모를 찾음 (존재하지 않음)
nonexistent_parent = soup.find("div", id = "main-content").find_parent()
print("\nNonexistent Parent Element :")
print(nonexistent_parent)

Parent Element :
<div id="main-content">
<h1>Main Title</h1>
<p class="paragraph">Paragraph 1</p>
<p class="paragraph">Paragraph 2</p>
</div>

Nonexistent Parent Element :        
<body>
<div id="main-content">
<h1>Main Title</h1>
<p class="paragraph">Paragraph 1</p>
<p class="paragraph">Paragraph 2</p>
</div>
</body>

실행 결과

find_previous_sibling(name, attrs, **kwargs)

name - 찾을 이전 형제 요소의 이름을 지정한다. 문자열이나 정규표현식을 사용할 수 있다.
attrs - 찾을 이전 형제 요소의 속성과 값을 딕셔너리로 지정한다.
**kwargs - 다양한 검색 조건을 키워드 인수로 추가할 수 있다.

from bs4 import BeautifulSoup

html_content = """
<html>
    <head>
        <title>Sample Page</title>
    </head>
    <body>
        <div id="main-content">
            <h1>Main Title</h1>
            <p class="paragraph">Paragraph 1</p>
            <p class="paragraph">Paragraph 2</p>
            <span>Span Element</span>
        </div>
    </body>
</html>
"""

soup = BeautifulSoup(html_content, 'html.parser')

# class가 'paragraph'인 p 요소의 이전 형제를 찾음
previous_sibling_of_paragraph = soup.find("p", class_= "paragraph").find_previous_sibling()
print("Previous Sibling Element : ")
print(previous_sibling_of_paragraph)

# span 요소의 이전 형제를 찾음
previous_sibling_of_span = soup.find("span").find_previous_sibling()
print("\nPrevious Sibling Element of Span : ")
print(previous_sibling_of_span)

Previous Sibling Element :
<h1>Main Title</h1>

Previous Sibling Element of Span :  
<p class="paragraph">Paragraph 2</p>

실행 결과

has_attr(name)

현재 태그가 지정된 속성을 가지고 있는지 확인한다.

from bs4 import BeautifulSoup

html_content = """
<html>
    <head>
        <title>Sample Page</title>
    </head>
    <body>
        <div id="main-content">
            <h1>Main Title</h1>
            <p class="paragraph">Paragraph 1</p>
            <p class="paragraph" custom_attr="example">Paragraph 2</p>
            <span>Span Element</span>
        </div>
    </body>
</html>
"""

soup = BeautifulSoup(html_content, 'html.parser')

# class가 'paragraph'인 p 요소가 custom_attr 속성을 가지고 있는지 확인
paragraph_with_attr = soup.find("p", class_="paragraph")
if paragraph_with_attr.has_attr("custom_attr"):
    print(f"{paragraph_with_attr} has 'custom_attr' attribute.")
else:
    print(f"{paragraph_with_attr} does not have 'custom_attr' attribute.")

<p class="paragraph">Paragraph 1</p> does not have 'custom_attr' attribute.

실행 결과

728x90

'Toy Project > Crawler' 카테고리의 다른 글

[Crawler] 파이썬 웹 크롤링(기본편) 5 - 검색어 변경하기 (0)	2023.12.16
[Crawler] 파이썬 웹 크롤링 실습(기본편) 4 - 네이버 뉴스 가져오기 (0)	2023.12.13
[Crawler] 파이썬 웹 크롤링(기본편) 3 - CSS 선택자 (0)	2023.12.09
[Crawler] 파이썬 웹 크롤링(기본편) 1 - requests (0)	2023.12.06

Contents

새소식

인기 검색어

[Crawler] 파이썬 웹 크롤링(기본편) 2 - beautifulsoup

파이썬 웹 크롤링(기본 편) 2 - beautifulsoup

'Toy Project > Crawler' 카테고리의 다른 글

당신이 좋아할만한 콘텐츠

티스토리툴바