To classify the authors and people mentioned in titles of articles based on their gender, we used genderizeR package Wais a of R R Core Team The package guesses the gender of a person based on the first name and the data gathered in the genderize. Created in August , the database has been regularly updated since, by the continuous scanning of public profiles of social network users. In April , the genderize.
|Published (Last):||9 July 2012|
|PDF File Size:||9.29 Mb|
|ePub File Size:||6.76 Mb|
|Price:||Free* [*Free Regsitration Required]|
To classify the authors and people mentioned in titles of articles based on their gender, we used genderizeR package Wais a of R R Core Team The package guesses the gender of a person based on the first name and the data gathered in the genderize. Created in August , the database has been regularly updated since, by the continuous scanning of public profiles of social network users.
In April , the genderize. Glossary Authorship—the unique combination of the title of an article and the name of one of the authors note that the same author can publish more than one article, so the number of authorships will be greater than the number of authors. Unisex first name—a first name that can be used both by men and women.
Gender database—a database used for gender classification; in our study, we used genderize. Probability—given a first name, a probability that the person with this first name is men or women, depending on the context. If the probability is 0. Count—a number of people in the gender database with the same first name. Gender classification We used the methodology suggested in Wais b to guess the gender of i people mentioned in titles of biographical articles and ii authors of these articles.
The algorithm, available in the genderizeR package Wais b , automatically parses all title words, checks in the genderize. In the third step above, the algorithm takes into account that some first names are valid for both men and women, and so classifying such names is always imprecise. Using the gender data from the database, we can estimate this uncertainty: given a first name, the probability of being a woman is estimated as the share of people with this first name who declared themselves as women.
Validation of gender classifications Validation datasets We validated the algorithm with a random sample of unique biographical articles. This way, we coded the gender of persons in the titles as Open in a separate window Similarly, to validate how precisely the algorithm classified the gender of authors, we randomly sampled biographical articles and extracted author names.
We coded the gender of authorships as Open in a separate window Training the algorithm From the genderize. We have to decide whether we wish to work only with names for which this probability is close to 1 or we accept also names for which this probability is closer to 0. Thus, to train the algorithm for classifying gender, we should check different threshold values of this probability and choose the best one.
The algorithm will not use first names with probabilities below this threshold; this way, we can decrease the uncertainty of our classifications at the cost of ignoring unisex first names. We should also be cautious when using rare unisex first names. To decide which names should be included in the algorithm and which ignored, we should test different threshold values for counts of how many times a first name was recorded in the gender database; the algorithm will use only those first names which occurred more often than the threshold.
So, we looked for the optimum values of these two parameters: probability that a first name represents a particular gender and count of how many times a first name was recorded in the database with gender data Wais b. Based on a preliminary, exploratory analysis, we have decided that the optimum probability should be between 0. Note that the algorithm should be independently trained for the two datasets: titles and authorships. For both datasets, we checked all combinations of i probability between 0.
The best combination is that which leads to the highest accuracy of gender classification, that is, for which the algorithm would match the manually coded data in the highest number of cases. For the validation dataset of titles, the algorithm worked best with the probability parameter set to 0. Using these values, we obtained a relatively small overall classification error rate 8.
The gender bias error rate in automatic gender classification was also low 4. Since we estimated the overall classification error rate 8. Thus to get a more realistic indicator of classification error rate, we also estimated a more robust bootstrapped error rate 8. For the validation dataset of authorships, the algorithm worked best with the probability parameter set to 0.
Using these values, we obtained small overall classification error rate 6. Categories of biographical articles Terminology Web of Science defines biographical items and items about an individual which we join to a document type of biographical articles as, generally put, articles focused on life of individuals, obituaries, tributes, and commemorations as well as tributes to such people. The latter group represents articles that are not considered biographical in the traditional meaning; these can be, for example, transcripts of lectures or review articles on a given topic, whose only relation to an individual is dedication of the article.
Individual biographical articles, thus, can differ quite a lot. Thus, we conducted an in-depth analysis of a sample of biographical articles, to find out whether they can be classified into distinct categories. After a preliminary analysis, we divided the articles into those about alive and dead people.
R software integrated with OpenPoland
It was a great motivator for us as a new person believed in the idea and wanted to participate in the venture calling OpenPoland a groundbreaking project. He offered to create a library for an R software environment used by analysts and statisticians - and so he did! Building a libary to connect R with OpenData 1. Coding Just weeks after the event, Dr Kamil Wais developed a library for the R platform so that analysts can get an easy access to selected data through a few functions that automate communication between the tool and the Open Poland API. With this package you can easily access hundreds milion of records in thousands of datasets generated and updated by institutions like Central Statistical Office of Poland.
This was the first time a plenary took place in Asia. Now, I can share with you my experience. Firstly, RDA is a truly global international and intercontinental organization with mission to build the social and technical bridges that enable open sharing of research data. The global character can be seen on every level. One has an opportunity to have fascinating talks about science and data with leading word researchers and experts from different industries, disciplines and the most famous research centers.