divyansh-phronetic commited on
Commit
eb1db7e
·
verified ·
1 Parent(s): 77e00ee

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +211 -124
README.md CHANGED
@@ -2,199 +2,286 @@
2
  library_name: transformers
3
  tags:
4
  - llama-factory
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
  ---
6
 
7
- # Model Card for Model ID
8
-
9
- <!-- Provide a quick summary of what the model is/does. -->
10
-
11
 
 
12
 
13
  ## Model Details
14
 
15
  ### Model Description
16
 
17
- <!-- Provide a longer summary of what this model is. -->
18
-
19
- This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
20
 
21
- - **Developed by:** [More Information Needed]
22
- - **Funded by [optional]:** [More Information Needed]
23
- - **Shared by [optional]:** [More Information Needed]
24
- - **Model type:** [More Information Needed]
25
- - **Language(s) (NLP):** [More Information Needed]
26
- - **License:** [More Information Needed]
27
- - **Finetuned from model [optional]:** [More Information Needed]
28
 
29
- ### Model Sources [optional]
30
 
31
- <!-- Provide the basic links for the model. -->
32
-
33
- - **Repository:** [More Information Needed]
34
- - **Paper [optional]:** [More Information Needed]
35
- - **Demo [optional]:** [More Information Needed]
36
 
37
  ## Uses
38
 
39
- <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
40
-
41
  ### Direct Use
42
 
43
- <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
44
 
45
- [More Information Needed]
 
 
 
46
 
47
- ### Downstream Use [optional]
48
 
49
- <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
50
-
51
- [More Information Needed]
 
 
 
52
 
53
  ### Out-of-Scope Use
54
 
55
- <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
56
-
57
- [More Information Needed]
 
 
58
 
59
- ## Bias, Risks, and Limitations
60
 
61
- <!-- This section is meant to convey both technical and sociotechnical limitations. -->
62
 
63
- [More Information Needed]
 
 
64
 
65
- ### Recommendations
 
 
 
66
 
67
- <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
68
-
69
- Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
70
 
71
  ## How to Get Started with the Model
72
 
73
- Use the code below to get started with the model.
74
-
75
- [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
76
 
77
  ## Training Details
78
 
79
  ### Training Data
80
 
81
- <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
82
-
83
- [More Information Needed]
 
 
 
 
 
 
84
 
85
  ### Training Procedure
86
 
87
- <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
88
-
89
- #### Preprocessing [optional]
90
-
91
- [More Information Needed]
92
-
93
-
94
  #### Training Hyperparameters
95
 
96
- - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
 
 
 
 
 
 
97
 
98
- #### Speeds, Sizes, Times [optional]
99
 
100
- <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
101
-
102
- [More Information Needed]
103
 
104
  ## Evaluation
105
 
106
- <!-- This section describes the evaluation protocols and provides the results. -->
107
-
108
- ### Testing Data, Factors & Metrics
109
-
110
- #### Testing Data
111
-
112
- <!-- This should link to a Dataset Card if possible. -->
113
-
114
- [More Information Needed]
115
 
116
- #### Factors
117
 
118
- <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
119
 
120
- [More Information Needed]
 
 
121
 
122
- #### Metrics
123
 
124
- <!-- These are the evaluation metrics being used, ideally with a description of why. -->
125
 
126
- [More Information Needed]
127
-
128
- ### Results
129
-
130
- [More Information Needed]
131
-
132
- #### Summary
133
-
134
-
135
-
136
- ## Model Examination [optional]
137
-
138
- <!-- Relevant interpretability work for the model goes here -->
139
-
140
- [More Information Needed]
141
-
142
- ## Environmental Impact
143
-
144
- <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
145
-
146
- Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
147
 
148
- - **Hardware Type:** [More Information Needed]
149
- - **Hours used:** [More Information Needed]
150
- - **Cloud Provider:** [More Information Needed]
151
- - **Compute Region:** [More Information Needed]
152
- - **Carbon Emitted:** [More Information Needed]
153
 
154
- ## Technical Specifications [optional]
 
 
 
 
155
 
156
- ### Model Architecture and Objective
157
 
158
- [More Information Needed]
 
159
 
160
- ### Compute Infrastructure
161
 
162
- [More Information Needed]
163
 
164
- #### Hardware
165
 
166
- [More Information Needed]
 
 
 
 
 
167
 
168
- #### Software
169
 
170
- [More Information Needed]
171
 
172
- ## Citation [optional]
173
 
174
- <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
175
 
176
  **BibTeX:**
177
-
178
- [More Information Needed]
 
 
 
 
 
 
179
 
180
  **APA:**
 
181
 
182
- [More Information Needed]
183
-
184
- ## Glossary [optional]
185
-
186
- <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
187
-
188
- [More Information Needed]
189
-
190
- ## More Information [optional]
191
-
192
- [More Information Needed]
193
-
194
- ## Model Card Authors [optional]
195
 
196
- [More Information Needed]
197
 
198
  ## Model Card Contact
199
 
200
- [More Information Needed]
 
2
  library_name: transformers
3
  tags:
4
  - llama-factory
5
+ - video-classification
6
+ - activity-recognition
7
+ - human-action-recognition
8
+ - qwen2.5-vl-3B
9
+ - vision-language-model
10
+ - fine-tuned
11
+ - hmdb51
12
+ license: apache-2.0
13
+ base_model: Qwen/Qwen2.5-3B-Instruct
14
+ datasets:
15
+ - hmdb51
16
+ metrics:
17
+ - accuracy
18
+ - precision
19
+ - recall
20
+ pipeline_tag: video-classification
21
  ---
22
 
23
+ # Owlet-HAR-1: Human Activity Recognition Vision Language Model
 
 
 
24
 
25
+ Owlet-HAR-1 is a fine-tuned vision-language model specialized for human activity recognition in videos. Built on Qwen2.5-3B-VL, this model achieves 68.19% accuracy on the HMDB51 dataset, representing a significant improvement over the base model's 38.33% accuracy.
26
 
27
  ## Model Details
28
 
29
  ### Model Description
30
 
31
+ Owlet-HAR-1 is a specialized vision-language model fine-tuned for human activity recognition tasks. The model processes video input and classifies human activities across 51 different action categories, ranging from facial expressions to complex body movements and object interactions.
 
 
32
 
33
+ - **Developed by:** Phronetic AI
34
+ - **Model type:** Vision-Language Model (Video Classification)
35
+ - **Language(s):** English (text output), Visual (video input)
36
+ - **License:** Apache 2.0
37
+ - **Finetuned from model:** Qwen/Qwen2.5-3B-Instruct
38
+ - **Specialized for:** Human Activity Recognition in Videos
 
39
 
40
+ ### Model Sources
41
 
42
+ - **Repository:** https://huggingface.co/phronetic-ai/owlet-har-1
43
+ - **Base Model:** https://huggingface.co/Qwen/Qwen2.5-3B-Instruct
44
+ - **Research Blog:** [Enhancing Video Activity Recognition with Human Pose Data: A Vision Language Model Study](To be added)
 
 
45
 
46
  ## Uses
47
 
 
 
48
  ### Direct Use
49
 
50
+ The model is designed for direct video activity recognition tasks. It takes video input and outputs a single word describing the primary human activity being performed. The model could be used for:
51
 
52
+ - **Healthcare monitoring**: Identifying daily activities and movements
53
+ - **Human-computer interaction**: Understanding user actions in video interfaces
54
+ - **Security and surveillance**: Automated activity detection
55
+ - **Content analysis**: Categorizing video content by human activities
56
 
57
+ ### Downstream Use
58
 
59
+ The model can be integrated into larger systems for:
60
+ - Video content management systems
61
+ - Automated video tagging and indexing
62
+ - Real-time activity monitoring applications
63
+ - Educational platforms for movement analysis
64
+ - Assistive technologies for elderly or disabled individuals
65
 
66
  ### Out-of-Scope Use
67
 
68
+ - **Privacy-sensitive applications**: The model should not be used for unauthorized surveillance
69
+ - **High-stakes decision making**: Not suitable for critical applications without human oversight
70
+ - **Real-time safety systems**: May not be reliable enough for safety-critical applications
71
+ - **Non-human activity recognition**: Trained specifically on human activities
72
+ - **Complex scene understanding**: Focuses on single-person activities, may struggle with multi-person scenes
73
 
74
+ ## Performance
75
 
76
+ ### Key Metrics on HMDB51
77
 
78
+ - **Accuracy:** 68.19%
79
+ - **Precision:** 70.20%
80
+ - **Recall:** 67.93%
81
 
82
+ ### Best Performing Activities
83
+ - Cartwheeling, drawing_sword, falling, grooming, punching: 100% precision and recall
84
+ - Climbing_stairs: 95.24% recall, 90.91% precision
85
+ - Drinking: 93.10% recall, 90.00% precision
86
 
87
+ ### Challenging Activities
88
+ - Biking: 0% precision/recall (specific limitation)
89
+ - Catching and shooting: Lower performance across metrics
90
 
91
  ## How to Get Started with the Model
92
 
93
+ ```python
94
+ from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
95
+ from qwen_vl_utils import process_vision_info
96
+ import torch
97
+
98
+ # Load the model and processor
99
+ model = Qwen2VLForConditionalGeneration.from_pretrained(
100
+ "phronetic-ai/owlet-har-1",
101
+ torch_dtype=torch.bfloat16,
102
+ device_map="auto",
103
+ trust_remote_code=True
104
+ )
105
+ processor = AutoProcessor.from_pretrained("phronetic-ai/owlet-har-1", trust_remote_code=True)
106
+
107
+ # Video inference - Multiple input methods supported:
108
+
109
+ # Method 1: Using video frames as image list
110
+ messages = [
111
+ {
112
+ "role": "user",
113
+ "content": [
114
+ {
115
+ "type": "video",
116
+ "video": [
117
+ "file:///path/to/frame1.jpg",
118
+ "file:///path/to/frame2.jpg",
119
+ "file:///path/to/frame3.jpg",
120
+ "file:///path/to/frame4.jpg",
121
+ ],
122
+ },
123
+ {"type": "text", "text": "What's the activity the person is doing in this video? Answer in one word only."},
124
+ ],
125
+ }
126
+ ]
127
+
128
+ # Method 2: Using local video file
129
+ messages = [
130
+ {
131
+ "role": "user",
132
+ "content": [
133
+ {
134
+ "type": "video",
135
+ "video": "file:///path/to/your_video.mp4",
136
+ "max_pixels": 360 * 420,
137
+ "fps": 1.0,
138
+ },
139
+ {"type": "text", "text": "What's the activity the person is doing in this video? Answer in one word only."},
140
+ ],
141
+ }
142
+ ]
143
+
144
+ # Method 3: Using video URL
145
+ messages = [
146
+ {
147
+ "role": "user",
148
+ "content": [
149
+ {
150
+ "type": "video",
151
+ "video": "https://your-video-url.com/video.mp4",
152
+ },
153
+ {"type": "text", "text": "What's the activity the person is doing in this video? Answer in one word only."},
154
+ ],
155
+ }
156
+ ]
157
+
158
+ # Preparation for inference
159
+ text = processor.apply_chat_template(
160
+ messages, tokenize=False, add_generation_prompt=True
161
+ )
162
+ image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)
163
+ inputs = processor(
164
+ text=[text],
165
+ images=image_inputs,
166
+ videos=video_inputs,
167
+ padding=True,
168
+ return_tensors="pt",
169
+ **video_kwargs,
170
+ )
171
+ inputs = inputs.to("cuda")
172
+
173
+ # Inference
174
+ generated_ids = model.generate(**inputs, max_new_tokens=128)
175
+ generated_ids_trimmed = [
176
+ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
177
+ ]
178
+ output_text = processor.batch_decode(
179
+ generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
180
+ )
181
+
182
+ print(f"Detected activity: {output_text[0]}")
183
+ ```
184
 
185
  ## Training Details
186
 
187
  ### Training Data
188
 
189
+ The model was fine-tuned on the HMDB51 dataset, which contains:
190
+ - **Total clips:** 6,849 video clips
191
+ - **Categories:** 51 distinct human action categories
192
+ - **Category groups:**
193
+ - General Facial Actions (smile, laugh, chew, talk)
194
+ - Facial Actions with Object Manipulation (smoke, eat, drink)
195
+ - General Body Movements (cartwheel, handstand, jump, run, walk)
196
+ - Body Movements with Object Interaction (brush hair, catch, golf, shoot ball)
197
+ - Body Movements for Human Interaction (fencing, hug, kiss, punch)
198
 
199
  ### Training Procedure
200
 
 
 
 
 
 
 
 
201
  #### Training Hyperparameters
202
 
203
+ - **Learning rate:** 5e-05 with cosine annealing
204
+ - **Training epochs:** 3
205
+ - **Batch size:** 16 (2 per device × 8 gradient accumulation steps)
206
+ - **Fine-tuning method:** LoRA (Low-Rank Adaptation) with rank 8
207
+ - **Training regime:** BF16 mixed precision
208
+ - **Optimization:** AdamW optimizer
209
+ - **Framework:** LLaMA-Factory
210
 
211
+ #### Compute Infrastructure
212
 
213
+ - **Hardware:** AWS g5.2xlarge instances with NVIDIA A10G GPUs
214
+ - **Training time:** Approximately 3 epochs on full HMDB51 dataset
215
+ - **Memory optimization:** LoRA fine-tuning with BF16 precision for memory efficiency
216
 
217
  ## Evaluation
218
 
219
+ ### Testing Data
 
 
 
 
 
 
 
 
220
 
221
+ Evaluated on HMDB51 test split using the same 51 activity.
222
 
223
+ ### Metrics
224
 
225
+ - **Accuracy:** Overall classification accuracy across all categories
226
+ - **Precision:** Per-category and macro-averaged precision
227
+ - **Recall:** Per-category and macro-averaged recall
228
 
229
+ ### Results Summary
230
 
231
+ The model demonstrates strong performance on structured activities (gymnastics, specific movements) but struggles with activities involving rapid motion or complex object interactions. The 68.19% accuracy represents a 29.86 percentage point improvement over the base Qwen2.5-3B-VL model.
232
 
233
+ ## Bias, Risks, and Limitations
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
234
 
235
+ ### Known Limitations
 
 
 
 
236
 
237
+ - **Dataset bias:** Trained on HMDB51, which may not represent all human activities or demographics
238
+ - **Single-person focus:** Optimized for single-person activities, may struggle with multi-person scenes
239
+ - **Video quality dependency:** Performance may degrade with poor lighting, low resolution, or occluded subjects
240
+ - **Cultural bias:** Training data may not represent activities from all cultures equally
241
+ - **Temporal resolution:** May miss very brief or subtle activities
242
 
243
+ ### Risk Considerations
244
 
245
+ - **Privacy concerns:** Video analysis capabilities could be misused for surveillance
246
+ - **Misclassification impact:** Incorrect classifications could lead to inappropriate automated responses
247
 
 
248
 
249
+ ## Technical Specifications
250
 
251
+ ### Model Architecture
252
 
253
+ - **Base Architecture:** Qwen2.5-3B Vision-Language Model
254
+ - **Parameters:** ~3 billion parameters
255
+ - **Fine-tuning Method:** LoRA (Low-Rank Adaptation)
256
+ - **Input:** Video sequences
257
+ - **Output:** Text classification (single word activity label)
258
+ - **Context Length:** Supports video sequences with multiple frames
259
 
260
+ ### Model Objective
261
 
262
+ The model is trained to classify human activities in videos using a text generation objective, where the model generates a single word representing the detected activity.
263
 
264
+ ## Citation
265
 
266
+ If you use this model in your research, please cite:
267
 
268
  **BibTeX:**
269
+ ```bibtex
270
+ @misc{owlet-har-1,
271
+ title={Owlet-HAR-1: Human Activity Recognition Vision Language Model},
272
+ author={Phronetic AI},
273
+ year={2025},
274
+ url={https://huggingface.co/phronetic-ai/owlet-har-1}
275
+ }
276
+ ```
277
 
278
  **APA:**
279
+ Phronetic AI. (2024). Owlet-HAR-1: Human Activity Recognition Vision Language Model. Hugging Face. https://huggingface.co/phronetic-ai/owlet-har-1
280
 
281
+ ## Model Card Authors
 
 
 
 
 
 
 
 
 
 
 
 
282
 
283
+ Phronetic AI Research Team
284
 
285
  ## Model Card Contact
286
 
287
+ For questions about this model, please contact: divyansh.makkar@phronetic.ai