URL verification

While working on a project, I needed to verify that a user-submitted URL was actually pointing to a valid page.

I found and slightly modified a code snippet that did just that using cURL. It's pretty simple and seemed quite effective for a long time.  It just requests the header for the page and verifies that there was indeed a response.

function valid_url($p_url)
{
    if (!substr_count($p_url, '://'))
	$p_url = 'http://' . $p_url;	
	
    $ch=curl_init();
    curl_setopt($ch,CURLOPT_URL,$p_url);
    curl_setopt($ch,CURLOPT_HEADER,1);//get the header
    curl_setopt($ch,CURLOPT_NOBODY,1);//and *only* get the header
    curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);//get the response as a string from /
curl_exec(), rather than echoing it
    curl_setopt($ch,CURLOPT_FRESH_CONNECT,1);//don't use a cached version of the url
    if(!curl_exec($ch))
        return false;
    else
        return $p_url;
}

Then today, a particular URL came along that caused this function to choke, slowing the page to a crawl and finally timing out after a long wait. While debugging and waiting for responses, I discovered that the page was coming back with an empty response.

Further examination yielded a HTTP response code of "0".  Head scratching, but now I was on the right track.  Looking at the response codes from a regular browser request, everything seemed OK.

That turned out to be the answer.  Simply supplying a user agent with the cURL request was enough to trick this page into thinking it was responding to a browser.  I guess they were trying to be sneaky. Edit: I had another related issue that tipped me off that this is bad form. Although, I'm only using this to grab the occasional header info to verify a page exists, I could still potentially be classified as a malicious bot and blacklisted.  Definitely not cool.  The solution is to make your own user agent that clearly gives your site or application's name and web address, e.g. User-Agent: MyApplication (+http://example.com/application-home/)

Adding in that line and a new check at the bottom to check for error status codes, like 404 or 503, I now have the following function which is working like a champ.

function valid_url($p_url)
{
    if (!substr_count($p_url, '://'))
	$p_url = 'http://' . $p_url;	
	
    $ch=curl_init();
    curl_setopt($ch,CURLOPT_URL,$p_url);
    curl_setopt($ch,CURLOPT_HEADER,1);//get the header
    curl_setopt($ch,CURLOPT_NOBODY,1);//and *only* get the header
    curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);//get the response as a string from /
curl_exec(), rather than echoing it
    curl_setopt($ch,CURLOPT_FRESH_CONNECT,1);//don't use a cached version of the url
    // Some sites will not respond without a user agent.
    // Make sure to clearly identify your application and website, especially
    // if making large numbers of requests.  
    curl_setopt($ch, CURLOPT_USERAGENT, 
                "User-Agent: MyApplication (+http://example.com/application-home/");  
    
    return (curl_exec($ch) 
             ? (curl_getinfo($ch, CURLINFO_HTTP_CODE) < 400 ? $p_url : false) 
             : false);
}

Post a comment


(optional, will not be displayed)
If you can see this field, please leave it blank, or your comment will not be submitted
(optional)

Photo of Carl

About Carl

Web Programmer. Shutterbug. New Father. Enjoying life in New Zealand.

Read more

Facebook  LinkedIn  Twitter

Recent Activity

 

Photo Tags

Tags

Archives

Utilities