trying to scrape data from html code using python -


i have been making web scraper website , wanting extract node numbers html table using .findall or work struggling it, getting errors not putting in right tags.

can help, html code follows

</div>  <table class="datatable" cellpadding="5" cellspacing="0" rules="all" border="1" id="ctl00_contentplaceholder1_dgnodes" style="border-collapse:collapse;">     <tr class="header nobreak">         <td>&nbsp;</td><td><a href="javascript:__dopostback('ctl00$contentplaceholder1$dgnodes$ctl00$ctl00','')">node name</a></td><td><a href="javascript:__dopostback('ctl00$contentplaceholder1$dgnodes$ctl00$ctl01','')">description</a></td><td><a href="javascript:__dopostback('ctl00$contentplaceholder1$dgnodes$ctl00$ctl02','')">mac address</a></td><td><a href="javascript:__dopostback('ctl00$contentplaceholder1$dgnodes$ctl00$ctl03','')"></a>                 <a href="javascript:__dopostback('ctl00$contentplaceholder1$dgnodes$ctl00$linoderoleheader','')" id="ctl00_contentplaceholder1_dgnodes_ctl00_linoderoleheader">node role</a>             </td><td><a href="javascript:__dopostback('ctl00$contentplaceholder1$dgnodes$ctl00$ctl04','')">firmware</a></td><td>                 <a href="javascript:__dopostback('ctl00$contentplaceholder1$dgnodes$ctl00$lbuptimeheader','')" id="ctl00_contentplaceholder1_dgnodes_ctl00_lbuptimeheader">uptime</a>             </td><td><a href="javascript:__dopostback('ctl00$contentplaceholder1$dgnodes$ctl00$ctl05','')">users</a></td>     </tr><tr onmouseover="this.classname = 'highlightedrow';" onmouseout="this.classname = 'normalrow';" onclick="gotonodepage('522');" style="height:18px;"> 

i need extract number 522 on last line of code , other gotonodepage numbers cant figure out, appreciated. want put extracted numbers list of later use.

r2 = s2.get(webpage) bsobjswap = beautifulsoup(r2.content)  listy = [] link in bsobjswap.findall('tr'):     if 'onclick' in link.attrs:         listy.append(link) print (listy) 

error link in bsobjswap.findall('tr'): typeerror: 'nonetype' object not callable

try this:

from bs4 import beautifulsoup  xml = """<table class="datatable" cellpadding="5" cellspacing="0" rules="all" border="1" id="ctl00_contentplaceholder1_dgnodes" style="border-collapse:collapse;">     <tr class="header nobreak">         <td>&nbsp;</td><td><a href="javascript:__dopostback('ctl00$contentplaceholder1$dgnodes$ctl00$ctl00','')">node name</a></td><td><a href="javascript:__dopostback('ctl00$contentplaceholder1$dgnodes$ctl00$ctl01','')">description</a></td><td><a href="javascript:__dopostback('ctl00$contentplaceholder1$dgnodes$ctl00$ctl02','')">mac address</a></td><td><a href="javascript:__dopostback('ctl00$contentplaceholder1$dgnodes$ctl00$ctl03','')"></a>                 <a href="javascript:__dopostback('ctl00$contentplaceholder1$dgnodes$ctl00$linoderoleheader','')" id="ctl00_contentplaceholder1_dgnodes_ctl00_linoderoleheader">node role</a>             </td><td><a href="javascript:__dopostback('ctl00$contentplaceholder1$dgnodes$ctl00$ctl04','')">firmware</a></td><td>                 <a href="javascript:__dopostback('ctl00$contentplaceholder1$dgnodes$ctl00$lbuptimeheader','')" id="ctl00_contentplaceholder1_dgnodes_ctl00_lbuptimeheader">uptime</a>             </td><td><a href="javascript:__dopostback('ctl00$contentplaceholder1$dgnodes$ctl00$ctl05','')">users</a></td>     </tr><tr onmouseover="this.classname = 'highlightedrow';" onmouseout="this.classname = 'normalrow';" onclick="gotonodepage('522');" style="height:18px;">"""  soup = beautifulsoup(xml) print([i.get('onclick') in soup.findall('tr', attrs={'onclick':true})]) 

this return ["gotonodepage('522');"]

from here can extract number regex example

print([re.findall("\d+", i.get('onclick')) in soup.findall('tr', attrs={'onclick':true})]) 

this return [['522']]


Comments