At Socialgist, we wrangle a lot of unstructured textual data. One common challenge we've wrestled with is handling dates in a wide variety of formats. Dates may seem mundane, but interpreting them correctly is critical to understanding trends over time in social data and other datasets. Now, Socialgist is making available on the AWS Sagemaker marketplace our natural language processing (NLP) technology for normalizing dates across a multitude of formats and languages.
We've released two models which perform the key steps of date handling: extraction a date from surrounding text, and normalizing dates into a standard format. Given an input date and time, the date transformer model returns a normalized date in YYYY‐MM‐DD 00:00:00 format. It can also transform relative dates, for example "3 years and 2 months ago". The model supports dates in 13 European languages and can interpret any date between 1975 and 2050.
We tested our new models against popular open source natural language processing solutions, considering both “recall”, or successful detection of dates in text, and “precision”, or correctness of date interpretation. In our tests, the precision of the model for date interpretation was 3.5 times higher than that of the commonly used Stanford CoreNLP framework:
|Metric||Socialgist CheckDate||Stanford CoreNLP|
|Date f1 score (Precision + Recall)||.98||.30|
|Date f1 score (Precision + Recall)||.97||.93|
If your dataset is likely to contain the US date format DD-MM-YY or other non-standard formats like YY-MM-DD, you can define these conditions using API parameters to help guide the interpretation of dates.
Interested in learning more? Have feedback or feature requests? Contact us