프로젝트 작업 정리
1. Python 환경 설정
가상환경 생성 및 활성화
python -m venv venv
source venv/bin/activate # For Mac/Linux
venv\Scripts\activate # For Windows
필요한 라이브러리 설치
pip install beautifulsoup4 requests fastapi uvicorn sqlalchemy mysql-connector-python jinja2
2. 데이터베이스 설정
MySQL 설치 및 설정
- Homebrew로 MySQL 설치:
brew install mysql
- MySQL 서비스 시작:
brew services start mysql
- MySQL 클라이언트 접속:
mysql -u root -p
- 데이터베이스 및 사용자 생성:
CREATE DATABASE your_dbname;
CREATE USER 'your_username'@'localhost' IDENTIFIED BY 'your_password';
GRANT ALL PRIVILEGES ON your_dbname.* TO 'your_username'@'localhost';
FLUSH PRIVILEGES;
3. 데이터베이스 모델 정의 (models.py
)
from sqlalchemy import create_engine, Column, Integer, String, DateTime
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker
DATABASE_URL = "mysql+mysqlconnector://your_username:your_password@localhost/your_dbname"
engine = create_engine(DATABASE_URL)
SessionLocal = sessionmaker(autocommit=False, autoflush=False, bind=engine)
Base = declarative_base()
class Article(Base):
__tablename__ = 'articles'
id = Column(Integer, primary_key=True, index=True)
title = Column(String(255), index=True)
department = Column(String(255))
date = Column(DateTime)
attachment = Column(String(255))
Base.metadata.create_all(bind=engine)
4. 웹 크롤러 작성 (scraper.py
)
import os
import requests
from bs4 import BeautifulSoup
from datetime import datetime
from models import SessionLocal, Article
BASE_URL = "https://www.mss.go.kr/site/smba/ex/bbs/List.do?cbIdx=86"
DOWNLOAD_DIR = "downloads"
def fetch_page(url):
response = requests.get(url)
return BeautifulSoup(response.text, 'html.parser')
def download_file(url, filename):
response = requests.get(url, stream=True)
if not os.path.exists(DOWNLOAD_DIR):
os.makedirs(DOWNLOAD_DIR)
with open(os.path.join(DOWNLOAD_DIR, filename), 'wb') as f:
for chunk in response.iter_content(chunk_size=8192):
f.write(chunk)
def scrape_board():
soup = fetch_page(BASE_URL)
rows = soup.select('#contents_inner > div > table > tbody > tr')
rows = rows[:5] # 최근 5건만 처리
for row in rows:
title = row.select_one('td.subject a').get_text(strip=True)
department = row.select_one('td:nth-child(3)').get_text(strip=True)
date_str = row.select_one('td:nth-child(5)').get_text(strip=True)
try:
date = datetime.strptime(date_str, '%Y.%m.%d')
except ValueError:
print(f"Date format error: {date_str}")
continue
attachment_tag = row.select_one('td.attached-files a')
if attachment_tag:
attachment_link = attachment_tag['href']
attachment_url = f"https://www.mss.go.kr{attachment_link}"
attachment_name = attachment_url.split('streFileNm=')[-1]
download_file(attachment_url, attachment_name)
else:
attachment_name = None
db = SessionLocal()
save_to_db(db, title, department, date, attachment_name)
db.close()
def save_to_db(db, title, department, date, attachment_name):
article = Article(title=title, department=department, date=date, attachment=attachment_name)
db.add(article)
db.commit()
if __name__ == "__main__":
scrape_board()
5. FastAPI 웹 서버 작성 (main.py
)
from fastapi import FastAPI, Depends, Request, HTTPException
from sqlalchemy.orm import Session
from fastapi.responses import FileResponse
from fastapi.templating import Jinja2Templates
import os
from models import SessionLocal, Article
app = FastAPI()
templates = Jinja2Templates(directory="templates")
def get_db():
db = SessionLocal()
try:
yield db
finally:
db.close()
@app.get("/articles")
def read_articles(request: Request, skip: int = 0, limit: int = 10, db: Session = Depends(get_db)):
articles = db.query(Article).offset(skip).limit(limit).all()
return templates.TemplateResponse("index.html", {"request": request, "articles": articles})
@app.get("/download/{filename}")
def download_file(filename: str):
file_path = os.path.join("downloads", filename)
if not os.path.exists(file_path):
raise HTTPException(status_code=404, detail="File not found")
return FileResponse(path=file_path, filename=filename)
if __name__ == "__main__":
import uvicorn
uvicorn.run(app, host="0.0.0.0", port=8000)
6. HTML 템플릿 작성 (templates/index.html
)
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<title>Article Board</title>
<link href="https://stackpath.bootstrapcdn.com/bootstrap/4.5.2/css/bootstrap.min.css" rel="stylesheet">
</head>
<body>
<div class="container">
<h1 class="mt-5">Article Board</h1>
<table class="table table-bordered mt-3">
<thead class="thead-dark">
<tr>
<th scope="col">Title</th>
<th scope="col">Department</th>
<th scope="col">Date</th>
<th scope="col">Attachment</th>
</tr>
</thead>
<tbody>
{% for article in articles %}
<tr>
<td>{{ article.title }}</td>
<td>{{ article.department }}</td>
<td>{{ article.date.strftime('%Y-%m-%d') }}</td>
<td>
{% if article.attachment %}
<a href="/download/{{ article.attachment }}" class="btn btn-primary">Download</a>
{% else %}
No attachment
{% endif %}
</td>
</tr>
{% endfor %}
</tbody>
</table>
</div>
<script src="https://code.jquery.com/jquery-3.5.1.slim.min.js"></script>
<script src="https://cdn.jsdelivr.net/npm/@popperjs/core@2.5.4/dist/umd/popper.min.js"></script>
<script src="https://stackpath.bootstrapcdn.com/bootstrap/4.5.2/js/bootstrap.min.js"></script>
</body>
</html>
어려웠던 부분들
1. MySQL 설치 및 설정
- Homebrew로 MySQL을 설치하고, 환경 변수를 설정하는 과정에서 어려움을 겪음.
- 해결:
~/.zshrc
또는 ~/.bash_profile
파일에 MySQL 경로를 추가하고 적용.
2. 크롤링한 데이터의 날짜 형식 파싱
- 날짜 형식이 올바르지 않아 파싱 오류가 발생.
- 해결: 여러 날짜 형식을 시도하여 올바르게 파싱되도록 수정.
3. 첨부 파일 다운로드
- 첨부 파일의 링크를 올바르게 추출하고, 파일 이름을 설정하는 데 어려움을 겪음.
- 해결: URL에서
streFileNm
파라미터를 추출하여 파일 이름을 설정.
4. FastAPI 서버 설정 및 파일 다운로드
- FastAPI 서버에서 다운로드 링크가 올바르게 작동하지 않음.
- 해결: 파일 경로와 파일 이름을 정확하게 설정하고, 서버에서 파일을 찾을 수 있도록 수정.