Technology Network Community
Oracle Database
Fusion Middleware
Development Tools
Server & Storage Systems
Enterprise Management
Berkeley DB Family
Cloud Computing
Big Data
Business Intelligence
Migration and Modernization
E-Business Suite
PeopleSoft Enterprise
JD Edwards World
JD Edwards EnterpriseOne
User Productivity Kit Pro (UPK) and Tutor
Governance, Risk & Compliance (GRC)
Master Data Management (MDM)
Oracle CRM On Demand
On Demand: SaaS and Managed Applications
AutoVue Enterprise Visualization
Agile PLM
Endeca Experience Management
Fusion Applications
Archived Forums



New To Java

Simple webcrawler

import; import; import; import; import; import java.util.ArrayList; import java.util.Vector; import java.util.regex.Matcher; import java.util.regex.Pattern;       public class WebCrawler implements Runnable{     	 // URLs to be searched     Vector<String> vectorToSearch = new Vector<String>();     // URLs already searched     Vector<String> vectorSearched = new Vector<String>();     // URLs which match     Vector<String> vectorMatches = new Vector<String>();          String address = "";   	public static void main(String[] args){ 		WebCrawler webCrawler = new WebCrawler(); 		webCrawler.start(); 	} 	 	public void start() { 		vectorToSearch.add(address); 		run(); 	}   	public void run() { 		try{ 			while (vectorToSearch.size() > 0){ 				System.out.println("Searching "+vectorToSearch.get(0)); 				getAndParsePage(vectorToSearch.get(0)); 			} 		} 		catch(Exception e){ 			e.printStackTrace(); 		} 	} 	 	private void getAndParsePage(String add) throws IOException{ 		 		try{ 			URL url = new URL(add); 			BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream())); 			String i; 			Vector<String> thisUrl = null; 			thisUrl = new Vector<String>(); 			while (((i=in.readLine()) != null)){thisUrl.add(i);} 			for (String input : thisUrl){ 				if (input.indexOf("href=\"http://")!=-1){ 					Pattern pattern = Pattern.compile("http://.*\""); 					Matcher matcher = pattern.matcher(input); 					if (matcher.find()){ 						String result =; 						result = result.substring(0,result.indexOf("\"")); 						try { 							URL urlLink = new URL(result); 							result = urlLink.toString(); 							if (!result.equals(add)) 								if(!wasSearched(add)) 									vectorToSearch.add(0,result); 						 }  						 catch (MalformedURLException e) { 							 						 } 					} 				} 			} 		}catch(Exception e){ 			 		} 		vectorToSearch.remove(add); 		vectorSearched.add(add); 		System.out.println("closing "+add); 		 	} 	 	private boolean wasSearched(String add){ 		for (String s : vectorSearched) if (s.equals(add))return true; 		return false; 	} }   the problem is i search some urls more then once  Message was edited by:          Kernel_77

Well for one thing this code is much less multithreaded then you desire.  You do not implement a start method only the run method.  Then your code should look like this.  Thread t = new Thread(yourRunnableInstance); t.start(); 

Also Vector has a contains method you may be interested in using.

kernel,  Pattern.compile("http://.*\"");  the .* is greedy, meaning that it'll keep matching to the end of the line... sugest you try [^\"]*\" in place of .*  can't see why it would search some URL's twice though... unless they're in the source page twice.  keith/  Message was edited by: corlettk

thanks guys both advices are good (contains and the greedy)  even if tthe address appears twice in the page it stil should find it in vectorSearched after the first time no ?  Message was edited by:          Kernel_77

Related Links

Input keyboard - Null pointer Exception
how to run java jar file
"View Source" problem in IE and Mozilla
sorting using display tag and Comparable interface
A Compiler error I didn't expect to see
Vector of vectors
Looping Question
Need help with installation of JCreator
Help for newbie beginner (arrays)
Netbeans 5 beta CVS (totaly radiculouse)
quick question
Naming conventions....
Find and print illegal character in a string using regexp
New to Java... Need guidance with loops
What are ListNodes?