Categories

JAVA DATEBASE
Technology Network Community
Oracle Database
Fusion Middleware
Development Tools
Java
Desktop
Server & Storage Systems
Enterprise Management
Berkeley DB Family
Cloud Computing
Big Data
Business Intelligence
Architecture
Migration and Modernization
E-Business Suite
Siebel
PeopleSoft Enterprise
JD Edwards World
Industries
JD Edwards EnterpriseOne
User Productivity Kit Pro (UPK) and Tutor
Governance, Risk & Compliance (GRC)
Master Data Management (MDM)
Oracle CRM On Demand
On Demand: SaaS and Managed Applications
AutoVue Enterprise Visualization
Primavera
ATG
Agile PLM
Endeca Experience Management
Fusion Applications
Archived Forums

 



Tags

New To Java


Simple webcrawler


import java.io.BufferedReader; import java.io.IOException; import java.io.InputStreamReader; import java.net.MalformedURLException; import java.net.URL; import java.util.ArrayList; import java.util.Vector; import java.util.regex.Matcher; import java.util.regex.Pattern;       public class WebCrawler implements Runnable{     	 // URLs to be searched     Vector<String> vectorToSearch = new Vector<String>();     // URLs already searched     Vector<String> vectorSearched = new Vector<String>();     // URLs which match     Vector<String> vectorMatches = new Vector<String>();          String address = "http://www.bloomberg.com";   	public static void main(String[] args){ 		WebCrawler webCrawler = new WebCrawler(); 		webCrawler.start(); 	} 	 	public void start() { 		vectorToSearch.add(address); 		run(); 	}   	public void run() { 		try{ 			while (vectorToSearch.size() > 0){ 				System.out.println("Searching "+vectorToSearch.get(0)); 				getAndParsePage(vectorToSearch.get(0)); 			} 		} 		catch(Exception e){ 			e.printStackTrace(); 		} 	} 	 	private void getAndParsePage(String add) throws IOException{ 		 		try{ 			URL url = new URL(add); 			BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream())); 			String i; 			Vector<String> thisUrl = null; 			thisUrl = new Vector<String>(); 			while (((i=in.readLine()) != null)){thisUrl.add(i);} 			for (String input : thisUrl){ 				if (input.indexOf("href=\"http://")!=-1){ 					Pattern pattern = Pattern.compile("http://.*\""); 					Matcher matcher = pattern.matcher(input); 					if (matcher.find()){ 						String result = matcher.group(); 						result = result.substring(0,result.indexOf("\"")); 						try { 							URL urlLink = new URL(result); 							result = urlLink.toString(); 							if (!result.equals(add)) 								if(!wasSearched(add)) 									vectorToSearch.add(0,result); 						 }  						 catch (MalformedURLException e) { 							 						 } 					} 				} 			} 		}catch(Exception e){ 			 		} 		vectorToSearch.remove(add); 		vectorSearched.add(add); 		System.out.println("closing "+add); 		 	} 	 	private boolean wasSearched(String add){ 		for (String s : vectorSearched) if (s.equals(add))return true; 		return false; 	} }   the problem is i search some urls more then once  Message was edited by:          Kernel_77

Well for one thing this code is much less multithreaded then you desire.  You do not implement a start method only the run method.  Then your code should look like this.  Thread t = new Thread(yourRunnableInstance); t.start(); 

Also Vector has a contains method you may be interested in using.

kernel,  Pattern.compile("http://.*\"");  the .* is greedy, meaning that it'll keep matching to the end of the line... sugest you try [^\"]*\" in place of .*  can't see why it would search some URL's twice though... unless they're in the source page twice.  keith/  Message was edited by: corlettk

thanks guys both advices are good (contains and the greedy)  even if tthe address appears twice in the page it stil should find it in vectorSearched after the first time no ?  Message was edited by:          Kernel_77


Related Links

Input keyboard - Null pointer Exception
how to run java jar file
"View Source" problem in IE and Mozilla
sorting using display tag and Comparable interface
A Compiler error I didn't expect to see
Vector of vectors
Looping Question
Need help with installation of JCreator
HeeeeeeeeeeeeeeeeeeLP!
Help for newbie beginner (arrays)
Netbeans 5 beta CVS (totaly radiculouse)
quick question
Naming conventions....
Find and print illegal character in a string using regexp
New to Java... Need guidance with loops
What are ListNodes?