본문 바로가기

Python

Python의 newspaper라는 모듈을 이용하여 뉴스 기사를 크롤링하는 방법

newspaper 모듈 소개

newspaper는 사용자가 지정한 url에서 text를 추출해주는 모듈이다. 자세한 내용은  https://pypi.org/project/newspaper3k/ 에서 확인할 수 있다.


아래의 내용은 newspaper에 대한 소개이다.

 

“Newspaper is an amazing python library for extracting & curating articles.” 

“Newspaper delivers Instapaper style article extraction.” 

 

그럼 newspaper의 설치방법에서부터 사용법을 알아보도록 하자.

newspaper 설치

newspaper 모듈은 python2와 python3에서 각각 설치방법이 다르다.

python 2에서는 pip install newspaper라고 입력하면 설치가 되지만

python3에서는 pip3 install newspaper3k라고 입력해야지 설치가 된다.

newspaper 사용해보기

 

print_code.py

 

 

#!/usr/bin/python3
from newspaper import Article 
var1 = sys.argv[1]
url = var1 
a = Article(url, language='ko', keep_article_html='True') 
a.download() 
a.parse()
print(a.title) 
print(a.text) 

 

 

 

root@www:~# ./print_code.py https://leechul.tistory.com/category/SIP
LeeChul - Ti Story
ubuntu 에서 pjsip python 모듈 컴파일시 -fPIC 에러가 뜨는데 아래와 같이 하면 정상 동작.

sudo apt-get update sudo apt-get -y install build-essential python-dev libpjsua2 libssl-dev libasound2-dev wget http://www.pjsip.org/release/2.7.2/pjproject-2.7.2.tar.bz2 tar -xf pjproject-2.7.2.tar.bz2 && cd pjproject-2.7.2/ export CFLAGS="$CFLAGS -fPIC" ./configure && make dep && make cd pjsip-apps/src/python/ sudo python setup.py install

설치시 에러들.

x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fno-strict-aliasing -Wdate-time -D_FORTIFY_SOURCE=2 -g -fdebug-prefix-map=/build/python2.7-nbjU53/python2.7-2.7.15~rc1=. -fstack-protector-strong -Wformat -Werror=format-security -fPIC -DPJ_AUTOCONF=1 -Imake[1]: Entering directory '/root/pjproject-2.7.2/pjsip-apps/src/python' -I/include -I/root/pjproject-2.7.2/pjlib/include -I/root/pjproject-2.7.2/pjlib-util/include -I/root/pjproject-2.7.2/pjnath/include -I/root/pjproject-2.7.2/pjmedia/include -I/root/pjproject-2.7.2/pjsip/include -Imake[1]: Leaving directory '/root/pjproject-2.7.2/pjsip-apps/src/python' -I/usr/include/python2.7 -c _pjsua.c -o build/temp.linux-x86_64-2.7/_pjsua.o In file included from _pjsua.c:20:0: _pjsua.h:25:10: fatal error: Python.h: No such file or directory #include <Python.h> ^~~~~~~~~~ compilation terminated. error: command 'x86_64-linux-gnu-gcc' failed with exit status 1 Makefile:2: recipe for target 'all' failed

x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fno-strict-aliasing -Wdate-time -D_FORTIFY_SOURCE=2 -g -fdebug-prefix-map=/build/python2.7-nbjU53/python2.7-2.7.15~rc1=. -fstack-protector-strong -Wformat -Werror=format-security -fPIC -DPJ_AUTOCONF=1 -Imake[1]: Entering directory '/root/pjproject-2.7.2/pjsip-apps/src/python' -I/include -I/root/pjproject-2.7.2/pjlib/include -I/root/pjproject-2.7.2/pjlib-util/include -I/root/pjproject-2.7.2/pjnath/include -I/root/pjproject-2.7.2/pjmedia/include -I/root/pjproject-2.7.2/pjsip/include -Imake[1]: Leaving directory '/root/pjproject-2.7.2/pjsip-apps/src/python' -I/usr/include/python2.7 -c _pjsua.c -o build/temp.linux-x86_64-2.7/_pjsua.o In file included from _pjsua.c:20:0: _pjsua.h:25:10: fatal error: Python.h: No such file or directory #include <Python.h> ^~~~~~~~~~ compilation terminated. error: command 'x86_64-linux-gnu-gcc' failed with exit status 1 Makefile:2: recipe for target 'all' failed

make -f /root/pjproject-2.7.2/build/rules.mak APP=YUV app=libyuv depend make[3]: Entering directory '/root/pjproject-2.7.2/third_party/build/yuv' .libyuv-x86_64-unknown-linux-gnu.depend:1: *** missing separator. Stop. make[3]: Leaving directory '/root/pjproject-2.7.2/third_party/build/yuv' Makefile:111: recipe for target 'depend' failed make[2]: *** [depend] Error 2 make[2]: Leaving directory '/root/pjproject-2.7.2/third_party/build/yuv' Makefile:7: recipe for target 'dep' failed

 

print(a.article_html) 사용하면 html 도 포함할 수 있다.

 

'Python' 카테고리의 다른 글

python 특수 기호 의미  (0) 2019.10.24
ModuleNotFoundError: No module named 'requests'  (0) 2019.10.23
Windows Python 설치 및 Path 잡아주기  (0) 2019.10.23
python 개발환경 ATOM IDE 사용.  (0) 2019.10.23
python news paper 설치  (0) 2019.06.19