Enhance Keras Compatibility, Tokenization, and Performance Optimizations #28

TimKoornstra · 2024-10-22T14:12:33Z

Pull Request Notes

This pull request introduces several significant updates aimed at enhancing compatibility, performance, and flexibility:

Keras 2 and 3 Compatibility: Forward and backward compatibility between Keras 2 and 3 has been introduced. This ensures smooth upgrades from TensorFlow 2.14.1 to 2.17.1 and intermediate versions. However, we have observed that TensorFlow versions >= 2.16 may run slower. As a workaround, we advise using the --use_float32 parameter.
Broken Parameter: The --steps_per_epoch parameter is currently broken.
Tokenizer Update: The charlist.txt file has been replaced by a tokenizer.json file. This change introduces more flexible tokenization schemes and improves readability. Padding and OOV tokens have been updated to "[PAD]" and "[UNK]", respectively. Any existing charlist.txt files will be automatically converted to the new tokenizer.json format.
VGSL Specification Update: The local implementation of the Variable-Size Graph Specification Language (VGSL) has been replaced by the vgslify package, which simplifies the codebase. Although the VGSL specifications have been slightly modified, the model library should function as before. Please refer to the VGSLify documentation for further details.
Training Log Enhancement: Learning rate values are now logged at each training step.
CTCLoss Update: CTCLoss has been refactored as a subclass of the Keras Loss class, instead of a simple function. It now also uses tf.function for performance improvement.
Augmentation Layers: All augmentation layers have been updated to use tf.function for faster execution.
Deprecated Arguments Removed: All arguments and configuration items marked for removal by May 2024 have now been removed.
Dataloader Optimization: All dataloader functions have been converted to tf.function to improve performance.
API Model Path: The environment variables for loading models has changed, making them similar to the way Laypa model loading works. One should supply aLOGHI_BASE_MODEL_DIR and LOGHI_MODEL_NAME that refer to the directory where your models are stored and the specific model directory, respectively.
Enhanced API Processing: Removed the image preparation worker, instead handling it with a tf.data.Dataset generator function. This allows for processing that is more similar to that of the non-API way.
Inference Time Improvement for Beam Search: Beam search decoding has been significantly optimized. Inference, validation, and test times have dramatically reduced for beam search with higher thread counts.
New Argument: The introduction of the --decoding_threads parameter provides additional flexibility for decoding performance tuning.
Unified Inference, Test, and Validation Functions: The inference, test, and validation functions have been refactored into a single, unified implementation with slight adaptations for specific use cases. This change improves code stability and maintainability.

use_mask to be replaced with a new tokenizer implementation

TimKoornstra added 30 commits August 23, 2024 11:56

Basic keras 3 working

015f460

Fix training + loading legacy models

e5f5f88

Bump requirements

50910cb

Add docstring for conversion function

8c159e1

Remove prediction_model

9f7ee18

Fix tests

4f71dd2

Remove FIXME

0bbd09d

ndim instead of len(tf.shape())

807277f

Fix steps_per_epoch bug

acaed98

Remove most args from deprecation zone

febcb76

use_mask to be replaced with a new tokenizer implementation

Initial tokenizer.json version

eec06d4

Improvement to tokenizer, remove use_mask

1a284fe

Remove charlist functions

8169d03

--replace_final_layer with new tokenizer

142acc2

Fix inference and test with new tokenizer

8def198

[MASK] -> [PAD]

64f35a9

Fix tests

26acf3c

Fix line fault logging

6ca622a

Fix API use of new tokenizer

8b6f3f4

Fix creating charlist for tokenizer

e5d59e6

Update gitignore

6f77c0f

Replace local VGSL impl with vgslify package

791f166

Remove tests and change model_to_spec readme

1fabb29

Replace [UNK] by bell character during CER calculation

16b983c

Upgrade vgslify to 0.12.0

9072215

Use tf.functions in dataloader, fix model replacement inputs

49ba29f

Add LR to training step log

41f4fa7

Combine aug and regular model

91041ea

CTCLoss as keras class

a6f6866

Fix tokenizer creation

b7462a3

TimKoornstra added 24 commits October 10, 2024 09:59

fix tests

c006fab

Remove max_queue_size from args

62210d4

Augs with dynamic shape

8e4996a

Revert augmentations

d0f2ff4

Fix tests

a4d1685

Vgslify update

82996e9

Fix RNN replacing

d0123ee

Minor bug fixes

a66e766

API model with base path

e129ac5

Remove image_preparator, use tf.dataset

65468c7

Remove unused params

a372cc7

Improve batch_predictor code quality

7f34fbf

Decoder quality and predictor fix

d7e87c1

OOM error and more precise logging

57821a9

Update API example scripts

717e095

Multithreaded approach for inference

fde5797

Validation with multithreading

b30b3b9

Fix image normalization bug

b2fc4e4

Fix image normalization bug

6989dee

Multithreading in testing

5f47892

Standardize all inference-based methods

b8f4957

Collapse modes into single mode

4bb860e

Add --decoding_threads arg

00333df

Update docstrings

d3635b4

rvankoert merged commit c2ea0bd into master Jan 6, 2025
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance Keras Compatibility, Tokenization, and Performance Optimizations #28

Enhance Keras Compatibility, Tokenization, and Performance Optimizations #28

TimKoornstra commented Oct 22, 2024 •

edited

Loading

Enhance Keras Compatibility, Tokenization, and Performance Optimizations #28

Enhance Keras Compatibility, Tokenization, and Performance Optimizations #28

Conversation

TimKoornstra commented Oct 22, 2024 • edited Loading

Pull Request Notes

TimKoornstra commented Oct 22, 2024 •

edited

Loading