Overview

Accuracy calculation for data extraction involves comparing extracted data against expected data, taking into account both the content and structure of the data. This process is relatively straightforward for scalar values, but more complex for arrays due to potential mismatches in row ordering and count.

Array Accuracy Calculation

For array data (tables, lists, etc.), accuracy is calculated as:

Accuracy = (Number of Correct Cells / max(Total Expected Cells, Total Extracted Cells)) × 100%

The denominator is the greater of:

  • Total number of expected cells
  • Total number of extracted cells

This ensures that both over-extraction and under-extraction are properly penalized in the accuracy calculation.

Handling Row Mismatches

The accuracy calculation becomes more complex when there are differences between the extracted and expected array structure. Consider the following scenarios:

Example: Array Row Order Mismatch

Extracted Array:

Row 1: {a, b, c}
Row 2: {d, e, f}

Expected Array:

Row 1: {d, e, f}
Row 2: {a, b, c}
Row 3: {g, h, i}

In this case:

  1. The algorithm intelligently pairs extracted rows with expected rows
  2. Missing or extra rows are counted as incorrect cells
  3. The total number of cells is based on the expected structure

Row Pairing Process

The algorithm creates the following pairings:

  1. Extracted Row 1 ↔ Expected Row 2
  2. Extracted Row 2 ↔ Expected Row 1
  3. null ↔ Expected Row 3

Accuracy Calculation with Mismatches

When there are row count mismatches:

  • The total number of cells used in the denominator is the greater of:
    • Number of expected rows × number of columns
    • Number of extracted rows × number of columns
  • Missing or extra rows count as incorrect cells
  • This penalizes both over-extraction and under-extraction

Example Calculation

Using the previous example:

  • Expected cells: 9 (3 rows × 3 columns)
  • Extracted cells: 6 (2 rows × 3 columns)
  • Denominator: max(9, 6) = 9
  • Correctly extracted cells: 6
  • Accuracy: (6/9) × 100% = 66.66%

Real-World Example

Consider a case where:

  • 39 rows were extracted
  • 50 rows were expected
  • Each row has 8 columns
  • 3 cells were incorrect in the matched rows

The calculation would be:

  • Expected cells: 50 × 8 = 400
  • Extracted cells: 39 × 8 = 312
  • Denominator: max(400, 312) = 400
  • Correctly extracted cells: 309
  • Accuracy: (309/400) × 100% = 77.25%

This lower accuracy reflects both:

  1. The 3 incorrect cells in matched rows
  2. The 11 missing rows (which count as incorrect cells)

Key Points

  • This accuracy calculation method is specifically designed for array data
  • Accuracy is calculated based on the greater of expected or extracted cells
  • Array row order mismatches are handled through intelligent pairing
  • Missing or extra array rows are penalized in the accuracy calculation
  • The denominator is always the larger of the expected or extracted cell count