Problème pour scrapper l'url d'un fichier en python

lanzrg · Le 26/03/2013, à 11:26

Bonjour,

J'ai un père très attaché à Windows essentiellement pour les jeux.
Et comme il n'arrête pas de télécharger des merdes, toolbars, etc..
Je compte faire un script d'install automatique pour Windows.
Mais pour cela il me faudrait déjà pouvoir récupérer la dernière version des logiciels.

J'ai donc commencé à créer une fonction en python pour télécharger depuis FileHippo.

Voici ce que j'ai fait pour le moment :

#!/usr/bin/python

# imports

import urllib
from bs4 import BeautifulSoup

def FHDownload(app):
	
	# grab first page html
	source = BeautifulSoup(urllib.urlopen('http://www.filehippo.com/download_' + app + '/').read())

	# processing to find the next page url
	container = source.find('div', attrs={'id':'dlbox'})
	a = container.find('a')
	url = "http://www.filehippo.com/" + a['href']

	# grab second page html
	source = BeautifulSoup(urllib.urlopen(url).read())

	# processing to find the app download url
	container = source.find('td', attrs={'class':'img'})
	a = container.find('a')
	download = "http://www.filehippo.com/" + a['href']

	# return download url
	return download


url = FHDownload('firefox')
print(url)
urllib.urlretrieve(url, "firefox.exe")

Le problème est que j'aimerais récupérer le nom du fichier téléchargé, comme les navigateurs.

La chaîne retournée ressemble à ceci :

http://www.filehippo.com//download/file/778be0fe30d10fcf9f869c8ffadd3f1b6c430c179624ea96ea26e58585b0e70d/

Si je lance cette url dans un navigateur et que j'inspecte via l'onglet Réseau de Chromium, je vois qu'il redirectionne vers une autre page.
J'obtiens une url de ce type :

http://fs33.filehippo.com/3290/0c4cfb998b66473ba1292d6ed807c818/Firefox%20Setup%2020.0b6.exe

Cette url me conviendrai mieux pour récupérer le nom du fichier téléchargé, un simple traitement de chaîne et c'est fait.

Mais comment récupérer cette url en python ? Comment ferriez-vous ?

Merci d'avance

Dernière modification par lanzrg (Le 26/03/2013, à 11:28)

lanzrg · Le 26/03/2013, à 13:53

Bon j'ai trouvé , je ne suis pas sûr de la viabilité du code.
Notamment si le site est indisponible ou si le site change.

#!/usr/bin/python

# imports

import urllib
import os
from bs4 import BeautifulSoup

def FHDownload(app):
	
	# grab first page html
	source = BeautifulSoup(urllib.urlopen('http://www.filehippo.com/download_' + app + '/').read())

	# processing to find the next page url
	container = source.find('div', attrs={'id':'dlbox'})
	a = container.find('a')
	url = "http://www.filehippo.com/" + a['href']

	# grab second page html
	source = BeautifulSoup(urllib.urlopen(url).read())

	# processing to find the redirection url
	container = source.find('td', attrs={'class':'img'})
	a = container.find('a')
	url = "http://www.filehippo.com/" + a['href']
	
	# processing to find the redirected url
	download = urllib.urlopen(url).geturl()

	# return download url
	return download


def FirefoxInstall():

	# downloading
	download = FHDownload('firefox')
	filename = os.path.basename(urllib.unquote(download).decode('utf8'))
	urllib.urlretrieve(download, filename)

	# silent install
	os.system("\"" + filename + "\"" + " /S")

	# remove file
	os.remove(filename)


FirefoxInstall()

N'hésitez pas à donner votre avis, je fais du Python depuis pas très longtemps.
Je viens de lire quelques trucs sur l'orienté objet, les classes,...
Mais j'ai du mal de voir comment je pourrais retranscrire ce script en "orienté objet".

Dernière modification par lanzrg (Le 26/03/2013, à 14:54)

Ubuntu-fr

Navigation

Liens de recherche

Annonce

#1 Le 26/03/2013, à 11:26

Problème pour scrapper l'url d'un fichier en python

#2 Le 26/03/2013, à 13:53

Re : Problème pour scrapper l'url d'un fichier en python

Pied de page des forums