BeautifulSoup 和 Selenium 都能处理网页,但定位完全不同。
选择原则
如果页面 HTML 已经包含目标数据,用 BeautifulSoup。它快、简单、资源消耗低。
如果页面需要执行 JavaScript、登录、点击、滚动或处理动态内容,用 Selenium。
BeautifulSoup 初始化
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, "html.parser")查找元素
first_p = soup.find("p")
all_links = soup.find_all("a")
title = soup.find("p", class_="title")CSS 选择器:
links = soup.select("a.sister")
id_link = soup.select("#link1")
nested = soup.select("p.story a")获取文本和属性:
text = soup.find("p").get_text()
href = soup.find("a").get("href")修改和删除:
tag = soup.find("b")
tag.string = "New Title"
link = soup.find("a", id="link1")
link.decompose()Selenium 初始化
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://example.com")常用浏览器操作:
driver.maximize_window()
driver.refresh()
driver.back()
driver.forward()
print(driver.current_url)
print(driver.title)元素定位
from selenium.webdriver.common.by import By
driver.find_element(By.ID, "username")
driver.find_element(By.NAME, "email")
driver.find_element(By.CSS_SELECTOR, "button.submit")
driver.find_element(By.XPATH, "//div[@id='content']")操作元素
input_box.send_keys("my_username")
input_box.clear()
button.click()
form.submit()显式等待
动态页面必须等待条件,而不是盲目 sleep。
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
element = WebDriverWait(driver, 10).until(
EC.visibility_of_element_located((By.ID, "username"))
)复杂页面操作
滚动:
driver.execute_script("arguments[0].scrollIntoView();", element)弹窗:
alert = driver.switch_to.alert
alert.accept()iframe:
driver.switch_to.frame("iframe_name")
driver.switch_to.default_content()截图:
driver.save_screenshot("page.png")
element.screenshot("element.png")实践顺序通常是:静态 HTML 解析优先,其次解析页面内 JSON,最后才启动浏览器自动化。