python_fastapi

[프로젝트]중기부 보도자료 크롤링 & MySql 저장

알세지 2024. 5. 29. 00:43

프로젝트 작업 정리

1. Python 환경 설정

가상환경 생성 및 활성화

python -m venv venv
source venv/bin/activate  # For Mac/Linux
venv\Scripts\activate     # For Windows

필요한 라이브러리 설치

pip install beautifulsoup4 requests fastapi uvicorn sqlalchemy mysql-connector-python jinja2

2. 데이터베이스 설정

MySQL 설치 및 설정

  1. Homebrew로 MySQL 설치:
  2. brew install mysql
  3. MySQL 서비스 시작:
  4. brew services start mysql
  5. MySQL 클라이언트 접속:
  6. mysql -u root -p
  7. 데이터베이스 및 사용자 생성:
  8. CREATE DATABASE your_dbname; CREATE USER 'your_username'@'localhost' IDENTIFIED BY 'your_password'; GRANT ALL PRIVILEGES ON your_dbname.* TO 'your_username'@'localhost'; FLUSH PRIVILEGES;

3. 데이터베이스 모델 정의 (models.py)

from sqlalchemy import create_engine, Column, Integer, String, DateTime
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker

DATABASE_URL = "mysql+mysqlconnector://your_username:your_password@localhost/your_dbname"

engine = create_engine(DATABASE_URL)
SessionLocal = sessionmaker(autocommit=False, autoflush=False, bind=engine)
Base = declarative_base()

class Article(Base):
    __tablename__ = 'articles'
    id = Column(Integer, primary_key=True, index=True)
    title = Column(String(255), index=True)
    department = Column(String(255))
    date = Column(DateTime)
    attachment = Column(String(255))

Base.metadata.create_all(bind=engine)

4. 웹 크롤러 작성 (scraper.py)

import os
import requests
from bs4 import BeautifulSoup
from datetime import datetime
from models import SessionLocal, Article

BASE_URL = "https://www.mss.go.kr/site/smba/ex/bbs/List.do?cbIdx=86"
DOWNLOAD_DIR = "downloads"

def fetch_page(url):
    response = requests.get(url)
    return BeautifulSoup(response.text, 'html.parser')

def download_file(url, filename):
    response = requests.get(url, stream=True)
    if not os.path.exists(DOWNLOAD_DIR):
        os.makedirs(DOWNLOAD_DIR)
    with open(os.path.join(DOWNLOAD_DIR, filename), 'wb') as f:
        for chunk in response.iter_content(chunk_size=8192):
            f.write(chunk)

def scrape_board():
    soup = fetch_page(BASE_URL)
    rows = soup.select('#contents_inner > div > table > tbody > tr')

    rows = rows[:5]  # 최근 5건만 처리

    for row in rows:
        title = row.select_one('td.subject a').get_text(strip=True)
        department = row.select_one('td:nth-child(3)').get_text(strip=True)
        date_str = row.select_one('td:nth-child(5)').get_text(strip=True)

        try:
            date = datetime.strptime(date_str, '%Y.%m.%d')
        except ValueError:
            print(f"Date format error: {date_str}")
            continue

        attachment_tag = row.select_one('td.attached-files a')
        if attachment_tag:
            attachment_link = attachment_tag['href']
            attachment_url = f"https://www.mss.go.kr{attachment_link}"
            attachment_name = attachment_url.split('streFileNm=')[-1]

            download_file(attachment_url, attachment_name)
        else:
            attachment_name = None

        db = SessionLocal()
        save_to_db(db, title, department, date, attachment_name)
        db.close()

def save_to_db(db, title, department, date, attachment_name):
    article = Article(title=title, department=department, date=date, attachment=attachment_name)
    db.add(article)
    db.commit()

if __name__ == "__main__":
    scrape_board()

5. FastAPI 웹 서버 작성 (main.py)

from fastapi import FastAPI, Depends, Request, HTTPException
from sqlalchemy.orm import Session
from fastapi.responses import FileResponse
from fastapi.templating import Jinja2Templates
import os
from models import SessionLocal, Article

app = FastAPI()
templates = Jinja2Templates(directory="templates")

def get_db():
    db = SessionLocal()
    try:
        yield db
    finally:
        db.close()

@app.get("/articles")
def read_articles(request: Request, skip: int = 0, limit: int = 10, db: Session = Depends(get_db)):
    articles = db.query(Article).offset(skip).limit(limit).all()
    return templates.TemplateResponse("index.html", {"request": request, "articles": articles})

@app.get("/download/{filename}")
def download_file(filename: str):
    file_path = os.path.join("downloads", filename)
    if not os.path.exists(file_path):
        raise HTTPException(status_code=404, detail="File not found")
    return FileResponse(path=file_path, filename=filename)

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8000)

6. HTML 템플릿 작성 (templates/index.html)

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
    <title>Article Board</title>
    <link href="https://stackpath.bootstrapcdn.com/bootstrap/4.5.2/css/bootstrap.min.css" rel="stylesheet">
</head>
<body>
    <div class="container">
        <h1 class="mt-5">Article Board</h1>
        <table class="table table-bordered mt-3">
            <thead class="thead-dark">
                <tr>
                    <th scope="col">Title</th>
                    <th scope="col">Department</th>
                    <th scope="col">Date</th>
                    <th scope="col">Attachment</th>
                </tr>
            </thead>
            <tbody>
                {% for article in articles %}
                <tr>
                    <td>{{ article.title }}</td>
                    <td>{{ article.department }}</td>
                    <td>{{ article.date.strftime('%Y-%m-%d') }}</td>
                    <td>
                        {% if article.attachment %}
                        <a href="/download/{{ article.attachment }}" class="btn btn-primary">Download</a>
                        {% else %}
                        No attachment
                        {% endif %}
                    </td>
                </tr>
                {% endfor %}
            </tbody>
        </table>
    </div>
    <script src="https://code.jquery.com/jquery-3.5.1.slim.min.js"></script>
    <script src="https://cdn.jsdelivr.net/npm/@popperjs/core@2.5.4/dist/umd/popper.min.js"></script>
    <script src="https://stackpath.bootstrapcdn.com/bootstrap/4.5.2/js/bootstrap.min.js"></script>
</body>
</html>

어려웠던 부분들

1. MySQL 설치 및 설정

  • Homebrew로 MySQL을 설치하고, 환경 변수를 설정하는 과정에서 어려움을 겪음.
  • 해결: ~/.zshrc 또는 ~/.bash_profile 파일에 MySQL 경로를 추가하고 적용.

2. 크롤링한 데이터의 날짜 형식 파싱

  • 날짜 형식이 올바르지 않아 파싱 오류가 발생.
  • 해결: 여러 날짜 형식을 시도하여 올바르게 파싱되도록 수정.

3. 첨부 파일 다운로드

  • 첨부 파일의 링크를 올바르게 추출하고, 파일 이름을 설정하는 데 어려움을 겪음.
  • 해결: URL에서 streFileNm 파라미터를 추출하여 파일 이름을 설정.

4. FastAPI 서버 설정 및 파일 다운로드

  • FastAPI 서버에서 다운로드 링크가 올바르게 작동하지 않음.
  • 해결: 파일 경로와 파일 이름을 정확하게 설정하고, 서버에서 파일을 찾을 수 있도록 수정.