python - BeautifulSoup 'href' list that is giving ambiguous TypeErrors? -

- May 15, 2014

i'm using beautifulsoup scrape urls webpage. going good, until of urls have non-ascii characters in them.

requests.get('http://www.reddit.com') soup = beautifulsoup(req.content)  urls = [i.get('href') in soup.findall('a') if         'keyword' in str(i.get('href'))]

the list comprehension return unicodeerror.
thought separate list comprehension 2 parts instead:

urls = [i.get('href') in soup.findall('a')]  urls = [i.encode('utf-8') in urls]

this when got attributeerror, saying items nonetype.

i checked type:

print [type(i) in urls]

which showed unicode types. seems none , unicode @ same time.

you must have missed none value. checked www.reddit.com and, sure enough, there's:

<a name="content"></a>

its href none. instead of printing values , search none manually, do:

urls = [(i, i.get('href')) in soup.findall('a')] print [u u in urls if u[1] none]

Search This Blog

Camp

python - BeautifulSoup 'href' list that is giving ambiguous TypeErrors? -

Comments

Post a Comment

Popular posts from this blog

SVG stroke-linecap doesn't work for circles in Firefox? -

routes - Laravel 4 Wildcard Routing to Different Controllers -

cross browser - XSLT namespace-alias Not Working in Firefox or Chrome -