最近研究开源飞行模拟器，想将其世界地景全部下载下来。一个个手动下载太麻烦，于是想先用python的requests和Beautiful Soup组件获取地景链接，再以每行一个地景链接写入文本文档，最后使用axel多线程自动下载每一个地景包。

基础知识

requests组件

Requests是一个Python HTTP库，在Apache 2许可证下发布。该项目的目标是使HTTP请求更简单，更人性化。

下面是示例代码：

>>> import requests
>>> r = requests.get('https://api.github.com/user', auth=('user', 'pass'))
>>> r.status_code
200
>>> r.headers['content-type']
'application/json; charset=utf8'
>>> r.encoding
'utf-8'
>>> r.text # doctest: +ELLIPSIS
u'{"type":"User"...'
>>> r.json() # doctest: +ELLIPSIS
{u'private_gists': 419, u'total_private_repos': 77, ...}

Beautiful Soup组件

Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库。它能够通过你喜欢的转换器实现惯用的文档导航，查找，修改文档的方式。Beautiful Soup会帮你节省数小时甚至数天的工作时间。

Beautiful Soup支持Python标准库中的HTML解析器，还支持一些第三方的解析器，其中一个是 lxml。

Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为4种: Tag , NavigableString , BeautifulSoup , Comment。

Tag 对象与XML或HTML原生文档中的tag相同，有最重要的属性: name和attributes。
BeautifulSoup 对象表示的是一个文档的全部内容，但并不是真正的HTML或XML的tag，所以它没有name和attribute属性。
Beautiful Soup用 NavigableString 类来包装tag中的字符串。
Comment 对象是一个特殊类型的 NavigableString 对象:

下面是BeautifulSoup的一个简单示例。

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)

print(soup.prettify())
# <html>
#  <head>
#   <title>
#    The Dormouse's story
#   </title>
#  </head>
#  <body>
#   <p class="title">
#    <b>
#     The Dormouse's story
#    </b>
#   </p>
#   <p class="story">
#    Once upon a time there were three little sisters; and their names were
#    <a class="sister" href="http://example.com/elsie" id="link1">
#     Elsie
#    </a>
#    ,
#    <a class="sister" href="http://example.com/lacie" id="link2">
#     Lacie
#    </a>
#    and
#    <a class="sister" href="http://example.com/tillie" id="link2">
#     Tillie
#    </a>
#    ; and they lived at the bottom of a well.
#   </p>
#   <p class="story">
#    ...
#   </p>
#  </body>
# </html>

axel工具

axel是Linux下一个不错的HTTP/ftp高速下载工具。支持多线程下载、断点续传，且可以从多个地址或者从一个地址的多个连接来下载同一个文件。适合网速不给力时多线程下载提高下载速度。

axel语法：

1	axel [options] url1 [url2] [url...]

axel选项：

--max-speed=x , -s x         最高速度x
--num-connections=x , -n x   连接数x
--output=f , -o f            下载为本地文件f
--search[=x] , -S [x]        搜索镜像
--header=x , -H x            添加头文件字符串x（指定 HTTP header）
--user-agent=x , -U x        设置用户代理（指定 HTTP user agent）
--no-proxy ， -N             不使用代理服务器
--quiet ， -q                静默模式
--verbose ，-v               更多状态信息
--alternate ， -a            Alternate progress indicator
--help ，-h                  帮助
--version ，-V               版本信息

代码实现

获取地景链接

import requests
from bs4 import BeautifulSoup

# 下载指定URI链接
def getUriContent(uri):
    try:
        fgWorldScenery = requests.get(link, timeout=10)
    except requests.exceptions.HTTPError as errh:
        print ("Http Error:",errh)
    except requests.exceptions.ConnectionError as errc:
        print ("Error Connecting:",errc)
    except requests.exceptions.Timeout as errt:
        print ("Timeout Error:",errt)  
    except requests.exceptions.RequestException as err:
        print ("OOps: Something Else",err)
    else:    
        print("Http Request Success!")
        return fgWorldScenery.text

# 获取所有地景链接，存入列表
def getTargetLinks(html):
    soup = BeautifulSoup(html)
    list=[]
    #print(soup.prettify())
    for child in soup.find_all('area'):
        list.append(child['href'])
    print("Get Target Links Success!")
    return list

# 将地景链接逐行写入文件
def writeToFile(linkList):
    file = open('./sceneLink.txt','w')
    for link in linkList:
        file.write(link)
        file.write('\t\n')
    print("Write Target Link To File Success!")
    file.close()

if __name__ == "__main__":
    link = "http://www.flightgear.org/legacy-Downloads/scenery-v2.12.html"
    html=getUriContent(link)
    linkList=getTargetLinks(html)
    writeToFile(linkList)

下载世界地景

for line in $(cat sceneLink.txt)
do 
    axel -n 10 $line
done

参考链接

requests快速上手,by requests.
Python 基础教程, by runoob.
Python之父教你写main()函数,by 编程派.
Beautiful Soup 4.2.0 文档,by BeautifulSoup.
FlightGear安装飞机和场景的方法,by jack huang.
axel命令,by Linux命令大全.