Calculating array accuracy
How Extend calculates accuracy for array data
Overview
Accuracy calculation for data extraction involves comparing extracted data against expected data, taking into account both the content and structure of the data. This process is relatively straightforward for scalar values, but more complex for arrays due to potential mismatches in row ordering and count.
Array Accuracy Calculation
For array data (tables, lists, etc.), accuracy is calculated as:
The denominator is the greater of:
- Total number of expected cells
- Total number of extracted cells
This ensures that both over-extraction and under-extraction are properly penalized in the accuracy calculation.
Handling Row Mismatches
The accuracy calculation becomes more complex when there are differences between the extracted and expected array structure. Consider the following scenarios:
Example: Array Row Order Mismatch
Extracted Array:
Expected Array:
In this case:
- The algorithm intelligently pairs extracted rows with expected rows
- Missing or extra rows are counted as incorrect cells
- The total number of cells is based on the expected structure
Row Pairing Process
The algorithm creates the following pairings:
- Extracted Row 1 ↔ Expected Row 2
- Extracted Row 2 ↔ Expected Row 1
- null ↔ Expected Row 3
Accuracy Calculation with Mismatches
When there are row count mismatches:
- The total number of cells used in the denominator is the greater of:
- Number of expected rows × number of columns
- Number of extracted rows × number of columns
- Missing or extra rows count as incorrect cells
- This penalizes both over-extraction and under-extraction
Example Calculation
Using the previous example:
- Expected cells: 9 (3 rows × 3 columns)
- Extracted cells: 6 (2 rows × 3 columns)
- Denominator: max(9, 6) = 9
- Correctly extracted cells: 6
- Accuracy: (6/9) × 100% = 66.66%
Real-World Example
Consider a case where:
- 39 rows were extracted
- 50 rows were expected
- Each row has 8 columns
- 3 cells were incorrect in the matched rows
The calculation would be:
- Expected cells: 50 × 8 = 400
- Extracted cells: 39 × 8 = 312
- Denominator: max(400, 312) = 400
- Correctly extracted cells: 309
- Accuracy: (309/400) × 100% = 77.25%
This lower accuracy reflects both:
- The 3 incorrect cells in matched rows
- The 11 missing rows (which count as incorrect cells)
Key Points
- This accuracy calculation method is specifically designed for array data
- Accuracy is calculated based on the greater of expected or extracted cells
- Array row order mismatches are handled through intelligent pairing
- Missing or extra array rows are penalized in the accuracy calculation
- The denominator is always the larger of the expected or extracted cell count