which is pretty structured. OP's tool proudes a fairly sparse result from that. By the way, why try to parse keywords and do a url lookup instead of accepting a wikipedia url natively?
Thanks, good suggestion on accepting wikipedia URL natively, I actually started with that, but when I shared it with my friend, he though it takes extra step to get the wikipedia URL which may be not that user friendly. Although I guess I should support both ways.
In terms the spare result, I am still experimenting with how granular the timeline should be. Although now I prompt AI to include as many events as possible, it's still a mystery to me how to precisely control it.
Thanks for trying it out! Yeah, since I prompt to only include events where a date is available, it cannot output many events if article does not have much temporal info. Although your try there revealed a hard issue that I try to fix, that AI somehow output negative year for AD date although I tried hard asking it only do that for BC/BCE. I thought I fixed the issue, but obviously not for all cases.
Thanks! I actually planned to learn some prompt engineering best practice and kept iterating, but the performance kept getting worse and I reverted to the first version. I ended up using a pretty straightforward one just asking to find events with dates, using the precision available for example, if only year is available just using year, otherwise it will make something up like YYYY/1/1, output the event in JSON format with a headline and a more detailed description. My experience tells me a better model is much more important than the prompt engineering (but ofc the prompt needs to be basically right). For cost consideration, I was trying to use openai 4o-mini and the performance was bad no matter how hard I tried for better prompt, but when I switch to 4o, it works for even a basic prompt.
One issue it's struggling is that I want it to output a negative year if it's BC/BCE, it sometimes messes it up, either using positive year for BC/BCE or negative year even it's AD. I changed the prompt to include date in the detailed description because I found most of the time the date in the description is actually right, and ask it correct the date by referring what's in the description to be positive or negative. I feel this fixed the issue for the most time, but seems still happens for some case.
Overall, the challenge here is that I don't really have a benchmark to evaluate the quality of the timelines yet, I basically do a manual random sampling of timelines to check the quality myself. The future work (if I keep my current interest on this) would be building a benchmark which can auto evaluate the quality so that the iteration of prompt can happen and also might try different LLMs.
https://wiki-timeline.com/timeline/History_of_Maxwell%27s_eq...
seems to be a limitation of the article though rather than the tool. Maybe such a tool will encourage more filled-out temporal content in articles. :)
which is pretty structured. OP's tool proudes a fairly sparse result from that. By the way, why try to parse keywords and do a url lookup instead of accepting a wikipedia url natively?
In terms the spare result, I am still experimenting with how granular the timeline should be. Although now I prompt AI to include as many events as possible, it's still a mystery to me how to precisely control it.
One issue it's struggling is that I want it to output a negative year if it's BC/BCE, it sometimes messes it up, either using positive year for BC/BCE or negative year even it's AD. I changed the prompt to include date in the detailed description because I found most of the time the date in the description is actually right, and ask it correct the date by referring what's in the description to be positive or negative. I feel this fixed the issue for the most time, but seems still happens for some case.
Overall, the challenge here is that I don't really have a benchmark to evaluate the quality of the timelines yet, I basically do a manual random sampling of timelines to check the quality myself. The future work (if I keep my current interest on this) would be building a benchmark which can auto evaluate the quality so that the iteration of prompt can happen and also might try different LLMs.