Introduction

The Babel NER (Named Entity Recognition) Project uses 18 named entity labels: CARDINAL, DATE, EVENT, FAC, GPE, LANGUAGE, LAW, LOC, MONEY, NORP, ORDINAL, ORG, PERCENT, PERSON, PRODUCT, QUANTITY, TIME, WORK_OF_ART, corresponding to the information about spaCy model available here: model card.

The model was prepared to make predictions on sentence-level data. That does not mean you should provide input that is segmented into sentences. Each line will be splitted into separate sentences automatically during evaluation, with unique identifiers derived from the text identifier. After evaluation, the label is shown in a separate column after each sentence, followed by it's corresponding value from the sentence in the next column, and so on. For each recognized label, it will insert two columns into the output. The NER Babel Machine currently uses an English and Hungarian model.

You can upload your datasets here for automated NER coding. If you wish to submit multiple datasets one after another, please wait 5-10 minutes between each of your submissions. There is only one possibility for upload: non-coded datasets. The explanation of the form and the dataset requirement is available here.

The upload requires you to fill the following form on metadata regarding the dataset.

The non-coded datasets should contain an id and a text column. The column names must be in row 1. You are free to add supplementary variables to the dataset beyond the compulsory ones in the columns following them.

You are free to add supplementary variables to the dataset beyond the compulsory ones in the columns following them. Automatic processing requires to follow these rules.

After you upload your dataset and your file is successfully processed, you will receive the NER-coded dataset and a file (in CSV format) that includes the predictions by the spaCy model. If the files you would like to upload are larger than 1 GB, please reach out to us with the download link attached (such as Dropbox or Google Drive) using our contact form.

If you have any questions or feedback regarding the Babel Machine, please let us know using our contact form. Please remember that we can only get back to you on Hungarian business days.

Submit a dataset:

exclamation icon

The non-coded datasets should contain an id and text column. The column names must be in row 1. You are free to add supplementary variables to the dataset beyond the compulsory ones in the columns following them. All datasets must be uploaded in a CSV file format with UTF-8 encoding.

Loading...
    Troubleshooting

    If you are experiencing problems with the upload form, or your submission returns an error message (particularly "Something unexpected happened during upload. Please try again later."), please try performing the following steps:

    • If you use an adblocker browser extension, please turn it off for our site. Adblockers may interfere with legitimate functionality, such as the dropdowns on the upload form. (We do not serve ads on the site.)
    • Try turning off your VPN.
    • Try submitting your data from another browser, preferably with default settings.

    If you are still receiving the "Something unexpected..." error message, please get in touch with us via our email address or the contact form. Try to add as much information as possible, e.g., what browser you are using, notable browser extensions, whether you are using a VPN or not, and exactly how you tried to submit the data (for example, you filled out everything but waited 10 minutes before pressing submit).

    The research was supported by the Ministry of Innovation and Technology NRDI Office within the RRF-2.3.1-21-2022-00004 Artificial Intelligence National Laboratory project and received additional funding from the European Union's Horizon 2020 program under grant agreement no 101008468. We also thank the Babel Machine project and HUN-REN Cloud (Héder et al. 2022; https://science-cloud.hu) for their support.


    HOW TO CITE: If you use the Babel Machine for your work or research, please cite this paper:
    Sebők M, Kacsuk Z. The Multiclass Classification of Newspaper Articles with Machine Learning: The Hybrid Binary Snowball Approach. Political Analysis. 2021;29(2):236-249. doi:10.1017/pan.2020.27