Scrapy 抓取 Mediuem 网站 Android 标签下的文章

前面实现了对 **简书上面的程序员专题抓取文章**, 接下来开始 Medium 网站的 Android 标签的文章。

medium_spider.py 代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
#!/usr/bin/env python
# -*- coding:utf-8 -*-

import scrapy

class MediumSpider(scrapy.Spider):

"""docstring for MediumSpider"""

name = "mediumspider"

start_urls = ["https://medium.com/tag/android"]

def parse(self, response):
for postitem in response.css('.postItem'):
url = postitem.css('article a::attr(href)').extract()[0]

# get title
if len( postitem.css('article h2::text').extract() ) != 0 :
title = postitem.css('article h2::text').extract()[0]
elif len( postitem.css('article h3::text').extract() ) != 0 :
title = postitem.css('article h3::text').extract()[0]
elif len( postitem.css('article h4::text').extract()) != 0 :
title = postitem.css('article h4::text').extract()[0]
else:
title = "Error get title"


# get subtitle
if len( postitem.css('.section-inner h4::text').extract()) != 0 :
subtitle = postitem.css('.section-inner h4::text').extract()[0]
elif len( postitem.css('.section-inner p::text').extract()) != 0 :
subtitle = postitem.css('.section-inner p::text').extract()[0]
else:
subtitle = "No subtitle"

yield {
"title":title,
"subtitle":subtitle,
"url":url,
}

运行 scrapy runspider medium_spider.py -o res.json ,并且导出 json文件。
res.json

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
[
{
"url": "https://medium.com/sebs-top-tips/analyse-data-flows-without-the-debugger-android-studio-protips-3-ef2885aaffd9?source=tags---------1",
"subtitle": "Did you know that you can figure…",
"title": "Analyse data flows without the debugger (Android Studio protips #3)"
},
{
"url": "https://medium.com/imgur-engineering/design-at-imgur-from-188-colors-to-12-bb6a1a8a26d9?source=tags---------2",
"subtitle": "Let’s take a little look back in time, through the history of Imgur. In 2009, Imgur started…",
"title": "Design at Imgur: From 188 Colors to 12"
},
{
"url": "https://medium.com/@raveeshbhalla/android-n-notifications-a-design-analysis-cec09f1cc5bf?source=tags---------3",
"subtitle": "It’s been a few days since Google surprised us all by releasing a developer preview for Android N, more than two months before Google I/O. As ",
"title": "Android N Notifications: A Design Analysis"
},
{
"url": "https://medium.com/@shollmann/picasso-universal-image-loader-or-glide-that-s-the-question-af34fa7f5e63?source=tags---------4",
"subtitle": "No subtitle",
"title": "Error get title"
},
{
"url": "https://medium.com/@cesarmcferreira/building-android-apps-30-things-that-experience-made-me-learn-the-hard-way-313680430bf9?source=tags---------5",
"subtitle": "No subtitle",
"title": "Building Android Apps — 30 things that experience made me learn the hard way"
},
{
"url": "https://medium.com/@hitherejoe/android-n-introducing-picture-in-picture-for-android-tv-35f2392fb609?source=tags---------6",
"subtitle": "Last week we saw the surprise release of the…",
"title": "Android N: Introducing Picture-in-Picture for Android TV"
},
{
"url": "https://medium.com/@nielsz/android-quality-with-mvp-espresso-junit-jacoco-and-sonarqube-3430d9ee4a4a?source=tags---------7",
"subtitle": "In my ",
"title": "Android Quality with MVP, Espresso, JUnit, JaCoCo and SonarQube"
},
{
"url": "https://labs.ribot.co.uk/android-application-architecture-8b6e34acda65?source=tags---------8",
"subtitle": "Our journey from standard Activities and AsyncTasks to a modern MVP-based architecture powered by RxJava.",
"title": "Android Application Architecture"
},
{
"url": "https://medium.com/@hamen/android-library-aar-and-javadoc-6859898cad28?source=tags---------9",
"subtitle": "As an Android developer, I’m used to ask Android Studio/Intellij Idea for documentation constantly. I have even replaced the shortcut: now, it’s F1, the old-fashion help key. When I want to know about a method or a class, I hit F1 and the JavaDoc shows up: smooth. How does…",
"title": "Android Library AAR and Javadoc"
},
{
"url": "https://medium.com/@p.tournaris/rxjava-rxreplayingshare-emit-only-once-b19acd61b469?source=tags---------10",
"subtitle": "Following up on my ",
"title": "RxJava —RxReplayingShare, Emit only Once"
}
]

文章来自: https://hanks.pub