시카고 맛집 분석 예제

Notice

Recent Posts

Recent Comments

Link

My GIT Address

250x250

« 2025/02 »
일	월	화	수	목	금	토
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28

Tags more

Archives

Today

Total

관리 메뉴

BASHA TECH

시카고 맛집 분석 예제 본문

Computer/Pandas

시카고 맛집 분석 예제

Basha 2022. 9. 30. 12:12

728x90

D:\big15\pandas-dev>D:/Anaconda3/Scripts/activate

(base) D:\big15\pandas-dev>conda activate pandas-dev

(pandas-dev) D:\big15\pandas-dev>pip install bs4
Collecting bs4
  Downloading bs4-0.0.1.tar.gz (1.1 kB)
  Preparing metadata (setup.py) ... done
Collecting beautifulsoup4
  Downloading beautifulsoup4-4.11.1-py3-none-any.whl (128 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 128.2/128.2 kB 3.8 MB/s eta 0:00:00
Collecting soupsieve>1.2
  Downloading soupsieve-2.3.2.post1-py3-none-any.whl (37 kB)
Building wheels for collected packages: bs4
  Building wheel for bs4 (setup.py) ... done
  Created wheel for bs4: filename=bs4-0.0.1-py3-none-any.whl size=1257 sha256=152f345064b31fd00b7485d66701b90fc3a1b675d65a2b0c642f4a0dc5eb6e8e
  Stored in directory: c:\users\tj\appdata\local\pip\cache\wheels\75\78\21\68b124549c9bdc94f822c02fb9aa3578a669843f9767776bca
Successfully built bs4
Installing collected packages: soupsieve, beautifulsoup4, bs4
Successfully installed beautifulsoup4-4.11.1 bs4-0.0.1 soupsieve-2.3.2.post1

(pandas-dev) D:\big15\pandas-dev>pip install tqbm
ERROR: Could not find a version that satisfies the requirement tqbm (from versions: none)
ERROR: No matching distribution found for tqbm

(pandas-dev) D:\big15\pandas-dev>pip install tqdm
Collecting tqdm
  Downloading tqdm-4.64.1-py2.py3-none-any.whl (78 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 78.5/78.5 kB 4.3 MB/s eta 0:00:00
Requirement already satisfied: colorama in d:\anaconda3\envs\pandas-dev\lib\site-packages (from tqdm) (0.4.5)
Installing collected packages: tqdm
Successfully installed tqdm-4.64.1

03 시카고 샌드위치 맛집 분석.ipynb

0.41MB

import numpy as np
import pandas as pd
from bs4 import BeautifulSoup

# 데이터 로딩
page = open('../data/03. test_first.html', 'r').read() # 통으로 읽어 온다
print(page)

# 1. 파싱 => Dom Tree 생성 => find, search 추출 => DataFrame. series
# BeautifulSoup(파싱할 변수, 파싱 방법) <= 파싱하는 애 => 파싱하면 element object가 나옴.
soup = BeautifulSoup(page, 'html.parser')
type(soup) # 타입 확인

print(soup.prettify())

soup.children # iteralble gt르로 있더 순가 listㄹ 바뀜

list(soup.children)

list(soup.body.children)

soup.find_all('p') # 모두 다 찾기 => 여러개 나옴 => list로 반환됨

soup.find('p') # 처음 1개 찾기

soup.find_all('p', class_='outer-text')

soup.find_all(class_='outer-text')

links = soup.find_all('a')
links

for each in links:
    print(each)

for each in links:
    href = each['href']
    print(href)

for each in links:
    href = each['href']
    print(href)
    print(each.string)

# 네이버 시장지표에서 원달러 환율 추출
# url = 'https://finance.naver.com/marketindex/'
from urllib.request import urlopen

url = 'https://finance.naver.com/marketindex/'
page = urlopen(url)
page

url = 'https://finance.naver.com/marketindex/'
page = urlopen(url)
# 파싱
soup = BeautifulSoup(page, 'html.parser')
print(soup.prettify)

soup.find_all('span', class_='value')

soup.find_all('span', class_='value')[0]

soup.find_all('span', class_='value')[0].string

# 시카고 샌드위치 맛집
# https://www.chicagomag.com/Chicago-Magazine/November-2012/Best-Sandwiches-Chicago/
url_base = 'https://www.chicagomag.com'
url_sub = '/Chicago-Magazine/November-2012/Best-Sandwiches-Chicago/'

# 메인 페이지 정보
url = url_base + url_sub

# 403 해결
from urllib.error import URLError, HTTPError
import urllib.request

try:
    headers = {'User-Agent' : 'Chrome/105.0.5195.128'} #버전=> chrome://settings/help
    # 이걸 설정 안하면 headers = {'User-Agent' : 'python'}으로 들어감.
    req = urllib.request.Request(url, headers=headers) # headers를 받아서 request해라
    # html = urlopen(url)
    html = urlopen(req)
except HTTPError as e:
    err =  e.read()
    code = e.getcode()
    print(err, code)

# 파싱
soup = BeautifulSoup(html, 'html.parser')
soup

print(soup.find_all('div', class_='sammy'))

len(soup.find_all('div', class_='sammy'))

soup.find_all('div', class_='sammy')[0]

tmp_one = soup.find_all('div',class_='sammy')[0]
type(tmp_one)

print(tmp_one)

# rank
tmp_one.find(class_='sammyRank')

tmp_one.find(class_='sammyRank').get_text()

# text 추출
tmp_one.find(class_='sammyListing')

tmp_one.find(class_='sammyListing').get_text()

tmp_one.find(class_='sammyListing').get_text().split('\n')

tmp_one.find(class_='sammyListing').get_text().split('\n')[:2]

# href = '' 추출
tmp_one.find('a')

tmp_one.find('a')['href'] # attribute 접근 :  하위 주소가 나옴

url_base

url_base + tmp_one.find('a')['href']

# 순위, 상호, 상품명, 상세 URL 추출
# 위 4가지 저장하는 리스트 생성
from urllib.parse import urljoin


rank = []
main_menu = []
cafe_name = []
url_add = []

# 50개순위 추출 
list_soup = soup.find_all('div', class_='sammy') # => findall을 하면 리스트가 나옴=> 반복 가능하다

from urllib.parse import urljoin
from tqdm import tqdm

# 50번 반복
for item in tqdm(list_soup):
    # tqdm으로 리스트를 감싸면 tqdm(list_soup) 진행률과 진행 퍼센트를 알려준다
    # 순위 추출
    rank.append(item.find(class_='sammyRank').get_text()) # => 1
    tmp_string = item.find(class_='sammyListing').get_text()
    # 'BLT\nOld Oak Tap\nRead more '
    main_menu.append(tmp_string.split('\n')[0]) # 0 => menu
    cafe_name.append(tmp_string.split('\n')[1]) # 1 => cafe_name
    # URL 처리
    # url_add.append(url_base+item.find('a')['href'])  # => 영문이라 에러는 안남
    # 한글 url은 + 로 연결하면 깨진다. 그래서 urljoin 사용
    url_add.append(urljoin(url_base, item.find('a')['href']))

tmp_string = tmp_one.find(class_='sammyListing').get_text()
tmp_string

tmp_string.split('\n')

tmp_string.split('\n')[0]

len(rank), len(main_menu), len(cafe_name), len(url_add)

rank[0], main_menu[0], cafe_name[0], url_add[0]

# DataFrame 생성
data_df = pd.DataFrame({
      'Rank' : rank # rank는 value에 리스트를 집어넣는다. 컬럼 명이 된다.
    , 'Menu' : main_menu
    , 'Cafe' : cafe_name
    , 'URL' : url_add
})
data_df.head()

data_df.to_csv('../data/03. best_sand', encoding='utf-8')

# 상세 페이지 이동해서 가격, 주소 추출
data_df['URL'][0]

headers = {'User-Agent' : 'Chrome/105.0.5195.128'} 
req = urllib.request.Request(url_, headers=headers)
html = urlopen(req)

soup_tmp = BeautifulSoup(html, 'html.parser')
soup_tmp

soup_tmp.find('p', class_='addy')

soup_tmp.find('p', class_='addy').get_text()

soup_tmp.find('p', class_='addy').get_text().split()

detail = soup_tmp.find('p', class_='addy').get_text().split()

detail[0]

detail

detail[1:-2] # 뒤에서 2개 빼고 가져와라

' '.join(detail[1:-2])

# 50번 반복, 가격, 주소 추출
price = []
address = []

# 
for url_ in data_df['URL']:
    headers = {'User-Agent' : 'Chrome/105.0.5195.128'} 
    req = urllib.request.Request(url_, headers=headers)
    html = urlopen(req)
    soup = BeautifulSoup(html, 'html.parser')
    # p tag, class = addy
    gettings = soup.find('p', class_='addy').get_text()
    price.append(gettings.split()[0][:-1])
    address.append(' '.join(gettings.split()[1:-2]))
    
    price[0], address[0]

data_df['Price'] = price
data_df['Address'] = address

data_df.head()

data_df = data_df.loc[:,['Rank','Cafe','Menu','Price','Address']]
data_df.head()

# Rank column을 index 보낸다.
# data_df.index = data_df['Rank'] => index에 대입 but! Rank column도 존재한다
# 우리가 원하는 index에 대입되고 Rank column이 사라지는 것.
# data_df.set_index('Rank') # Rank가 안 갔다! => 원본에 반영이 안되었음
data_df.set_index('Rank', inplace=True) # 반영이 되었다.
data_df.head()

728x90

저작자표시 비영리 변경금지

'Computer > Pandas' 카테고리의 다른 글

셀프 주유소 가격 분석 (1)	2022.10.04
네이버 매크로 예제 (0)	2022.10.04
Ch12. 시계열 데이터 (0)	2022.09.29
Ch11. 그룹 연산 (0)	2022.09.29
Ch10. apply 메서드 활용 (0)	2022.09.29

'Computer/Pandas' Related Articles

Comments

BASHA TECH

시카고 맛집 분석 예제 본문

시카고 맛집 분석 예제

'Computer > Pandas' 카테고리의 다른 글

티스토리툴바