Counting Lines, Paragraphs, or Records in a File using pc_split_paragraphs() in PHP

By: David Sklar Viewed: 153223 times  Printer Friendly Format    


You want to count the number of lines, paragraphs, or records in a file.To count lines, use fgets(). Because it reads a line at a time, you can count the number of times it's called before reaching the end of a file:

$lines = 0;

if ($fh = fopen('orders.txt','r')) {
  while (! feof($fh)) {
    if (fgets($fh,1048576)) {
      $lines++;
    }
  }
}
print $lines;

To count paragraphs, increment the counter only when you read a blank line:

$paragraphs = 0;

if ($fh = fopen('great-american-novel.txt','r')) {
  while (! feof($fh)) {
    $s = fgets($fh,1048576);
    if (("\n" == $s) || ("\r\n" == $s)) {
      $paragraphs++;
    }
  }
}
print $paragraphs;

To count records, increment the counter only when the line read contains just the record separator and whitespace:

$records = 0;
$record_separator = '--end--';

if ($fh = fopen('great-american-novel.txt','r')) {
  while (! feof($fh)) {
    $s = rtrim(fgets($fh,1048576));
    if ($s == $record_separator) {
      $records++;
    }
  }
}
print $records;

In the line counter, $lines is incremented only if fgets( ) returns a true value. As fgets( ) moves through the file, it returns each line it retrieves. When it reaches the last line, it returns false, so $lines doesn't get incorrectly incremented. Because EOF has been reached on the file, feof( ) returns true, and the while loop ends.

This paragraph counter works fine on simple text but may produce unexpected results when presented with a long string of blank lines or a file without two consecutive linebreaks. These problems can be remedied with functions based on preg_split( ). If the file is small and can be read into memory, use the pc_split_paragraphs( ) function shown in example below. This function returns an array containing each paragraph in the file.

pc_split_paragraphs( )
function pc_split_paragraphs($file,$rs="\r?\n") {
    $text = join('',file($file));
    $matches = preg_split("/(.*?$rs)(?:$rs)+/s",$text,-1,
                          PREG_SPLIT_DELIM_CAPTURE|PREG_SPLIT_NO_EMPTY);
    return $matches;
}

The contents of the file are broken on two or more consecutive newlines and returned in the $matches array. The default record-separation regular expression, \r?\n, matches both Windows and Unix linebreaks. If the file is too big to read into memory at once, use the pc_split_paragraphs_largefile( )function shown in example below, which reads the file in 4K chunks.

pc_split_paragraphs_largefile( )
function pc_split_paragraphs_largefile($file,$rs="\r?\n") {
    global $php_errormsg;

    $unmatched_text = '';
    $paragraphs = array();

    $fh = fopen($file,'r') or die($php_errormsg);

    while(! feof($fh)) {
        $s = fread($fh,4096) or die($php_errormsg);
        $text_to_split = $unmatched_text . $s;

        $matches = preg_split("/(.*?$rs)(?:$rs)+/s",$text_to_split,-1,
                              PREG_SPLIT_DELIM_CAPTURE|PREG_SPLIT_NO_EMPTY);

        // if the last chunk doesn't end with two record separators, save it
         * to prepend to the next section that gets read 
        $last_match = $matches[count($matches)-1];
        if (! preg_match("/$rs$rs\$/",$last_match)) {
            $unmatched_text = $last_match;
            array_pop($matches);
        } else {
            $unmatched_text = '';
        }
        
        $paragraphs = array_merge($paragraphs,$matches);
    }
    
    // after reading all sections, if there is a final chunk that doesn't
     * end with the record separator, count it as a paragraph 
    if ($unmatched_text) {
        $paragraphs[] = $unmatched_text;
    }
    return $paragraphs;
}

This function uses the same regular expression as pc_split_paragraphs( ) to split the file into paragraphs. When it finds a paragraph end in a chunk read from the file, it saves the rest of the text in the chunk in $unmatched_text and prepends it to the next chunk read. This includes the unmatched text as the beginning of the next paragraph in the file.



Most Viewed Articles (in PHP )

Latest Articles (in PHP)

Comment on this tutorial